Gates
Gates are the pass/fail criteria for your evaluation. They determine whether your agent meets the required performance threshold by checking aggregate metrics.
Common patterns:
- Average score must be 80%+:
avg_score >= 0.8 - 90%+ of samples must pass:
accuracy >= 0.9 - Custom threshold: Define per-sample pass criteria with
pass_value
Gates define the pass/fail criteria for your evaluation. They check if aggregate metrics meet a threshold.
Basic Structure
Section titled “Basic Structure”gate: metric_key: accuracy # Which grader to evaluate metric: avg_score # Use average score (default) op: gte # Greater than or equal value: 0.8 # 80% thresholdWhy Use Gates?
Section titled “Why Use Gates?”Gates provide automated pass/fail decisions for your evaluations, which is essential for:
CI/CD Integration: Gates let you block deployments if agent performance drops:
letta-evals run suite.yaml# Exit code 0 = pass (continue deployment)# Exit code 1 = fail (block deployment)Regression Testing: Set a baseline threshold and ensure new changes don’t degrade performance:
gate: metric: avg_score op: gte value: 0.85 # Must maintain 85%+ to passQuality Enforcement: Require agents meet minimum standards before production:
gate: metric: accuracy op: gte value: 0.95 # 95% of test cases must passWhat Happens When Gates Fail?
Section titled “What Happens When Gates Fail?”When a gate condition is not met:
-
Console output shows failure message:
✗ FAILED (0.72/1.00 avg, 72.0% pass rate)Gate check failed: avg_score (0.72) not >= 0.80 -
Exit code is 1 (non-zero indicates failure):
Terminal window letta-evals run suite.yamlecho $? # Prints 1 if gate failed -
Results JSON includes
gate_passed: false:{"gate_passed": false,"gate_check": {"metric": "avg_score","value": 0.72,"threshold": 0.80,"operator": "gte","passed": false},"metrics": { ... }} -
All other data is preserved - you still get full results, scores, and trajectories even when gating fails
Required Fields
Section titled “Required Fields”metric_key
Section titled “metric_key”Which grader to evaluate. Must match a key in your graders section:
graders: accuracy: # Grader name kind: tool function: exact_match extractor: last_assistant
gate: metric_key: accuracy # Must match grader name above op: gte # >= value: 0.8 # 80% thresholdIf you only have one grader, metric_key can be omitted - it will default to your single grader.
metric
Section titled “metric”Which aggregate statistic to compare. Two options:
avg_score
Section titled “avg_score”Average score across all samples (0.0 to 1.0):
gate: metric_key: quality # Check quality grader metric: avg_score # Use average of all scores op: gte # >= value: 0.7 # Must average 70%+Example: If scores are [0.8, 0.9, 0.6], avg_score = 0.77
accuracy
Section titled “accuracy”Pass rate as a percentage (0.0 to 1.0):
gate: metric_key: accuracy # Check accuracy grader metric: accuracy # Use pass rate, not average op: gte # >= value: 0.8 # 80% of samples must passBy default, samples with score >= 1.0 are considered “passing”.
You can customize the per-sample threshold with pass_op and pass_value (see below).
Comparison operator:
gte: Greater than or equal (>=)gt: Greater than (>)lte: Less than or equal (<=)lt: Less than (<)eq: Equal (==)
Most common: gte (at least X)
Threshold value for comparison:
- For
avg_score: 0.0 to 1.0 - For
accuracy: 0.0 to 1.0 (representing percentage)
gate: metric: avg_score # Average score op: gte # >= value: 0.75 # 75% thresholdgate: metric: accuracy # Pass rate op: gte # >= value: 0.9 # 90% must passOptional Fields
Section titled “Optional Fields”pass_op and pass_value
Section titled “pass_op and pass_value”Customize when individual samples are considered “passing” (used for accuracy calculation):
gate: metric_key: quality # Check quality grader metric: accuracy # Use pass rate op: gte # >= value: 0.8 # 80% must pass pass_op: gte # Sample passes if >= pass_value: 0.7 # This threshold (70%)Default behavior:
- If
metricisavg_score: samples pass if score>=the gate value - If
metricisaccuracy: samples pass if score>= 1.0(perfect)
Examples
Section titled “Examples”Require 80% Average Score
Section titled “Require 80% Average Score”gate: metric_key: quality # Check quality grader metric: avg_score # Use average op: gte # >= value: 0.8 # 80% averagePasses if the average score across all samples is >= 0.8
Require 90% Pass Rate (Perfect Scores)
Section titled “Require 90% Pass Rate (Perfect Scores)”gate: metric_key: accuracy # Check accuracy grader metric: accuracy # Use pass rate op: gte # >= value: 0.9 # 90% must pass (default: score >= 1.0 to pass)Passes if 90% of samples have score = 1.0
Require 75% Pass Rate (Score >= 0.7)
Section titled “Require 75% Pass Rate (Score >= 0.7)”gate: metric_key: quality # Check quality grader metric: accuracy # Use pass rate op: gte # >= value: 0.75 # 75% must pass pass_op: gte # Sample passes if >= pass_value: 0.7 # 70% threshold per samplePasses if 75% of samples have score >= 0.7
Maximum Error Rate
Section titled “Maximum Error Rate”gate: metric_key: quality # Check quality grader metric: accuracy # Use pass rate op: gte # >= value: 0.95 # 95% must pass (allows 5% failures) pass_op: gt # Sample passes if > pass_value: 0.0 # 0.0 (any non-zero score)Allows up to 5% failures.
Exact Pass Rate
Section titled “Exact Pass Rate”gate: metric_key: quality # Check quality grader metric: accuracy # Use pass rate op: eq # Exactly equal value: 1.0 # 100% (all samples must pass)All samples must pass.
Multi-Metric Gating
Section titled “Multi-Metric Gating”When you have multiple graders, you can only gate on one metric:
graders: accuracy: # First metric kind: tool function: exact_match extractor: last_assistant
completeness: # Second metric kind: rubric prompt_path: completeness.txt model: gpt-4o-mini extractor: last_assistant
gate: metric_key: accuracy # Only gate on accuracy (completeness still computed) metric: avg_score # Use average op: gte # >= value: 0.8 # 80% thresholdThe evaluation passes/fails based on the gated metric, but results include scores for all metrics.
Understanding avg_score vs accuracy
Section titled “Understanding avg_score vs accuracy”avg_score
Section titled “avg_score”- Arithmetic mean of all scores
- Sensitive to partial credit
- Good for continuous evaluation
Example:
- Scores: [1.0, 0.8, 0.6]
- avg_score = (1.0 + 0.8 + 0.6) / 3 = 0.8
accuracy
Section titled “accuracy”- Percentage of samples meeting a threshold
- Binary pass/fail per sample
- Good for strict requirements
Example:
- Scores: [1.0, 0.8, 0.6]
- pass_value: 0.7
- Passing: [1.0, 0.8] = 2 out of 3
- accuracy = 2/3 = 0.667 (66.7%)
Errors and Attempted Samples
Section titled “Errors and Attempted Samples”If a sample fails (error during evaluation), it:
- Gets a score of 0.0
- Counts toward
totalbut nottotal_attempted - Included in
avg_score_totalbut notavg_score_attempted
You can gate on either:
avg_score_total: Includes errors as 0.0avg_score_attempted: Excludes errors (only successfully attempted samples)
Gate Results
Section titled “Gate Results”After evaluation, you’ll see:
✓ PASSED (2.25/3.00 avg, 75.0% pass rate)or
✗ FAILED (1.80/3.00 avg, 60.0% pass rate)The evaluation exit code reflects the gate result:
- 0: Passed
- 1: Failed
Advanced Gating
Section titled “Advanced Gating”For complex gating logic (e.g., “pass if accuracy >= 80% OR avg_score >= 0.9”), you’ll need to:
- Run evaluation with one gate
- Examine the results JSON
- Apply custom logic in a post-processing script
Next Steps
Section titled “Next Steps”- Understanding Results - Interpreting evaluation output
- Multi-Metric Evaluation - Using multiple graders
- Suite YAML Reference - Complete gate configuration