Skip to content
  • Auto
  • Light
  • Dark
DiscordForumGitHubSign up
Development Tools
Testing & evals
Core concepts
View as Markdown
Copy Markdown

Open in Claude
Open in ChatGPT

Gates

Gates are the pass/fail criteria for your evaluation. They determine whether your agent meets the required performance threshold by checking aggregate metrics.

Common patterns:

  • Average score must be 80%+: avg_score >= 0.8
  • 90%+ of samples must pass: accuracy >= 0.9
  • Custom threshold: Define per-sample pass criteria with pass_value

Gates define the pass/fail criteria for your evaluation. They check if aggregate metrics meet a threshold.

gate:
metric_key: accuracy # Which grader to evaluate
metric: avg_score # Use average score (default)
op: gte # Greater than or equal
value: 0.8 # 80% threshold

Gates provide automated pass/fail decisions for your evaluations, which is essential for:

CI/CD Integration: Gates let you block deployments if agent performance drops:

Terminal window
letta-evals run suite.yaml
# Exit code 0 = pass (continue deployment)
# Exit code 1 = fail (block deployment)

Regression Testing: Set a baseline threshold and ensure new changes don’t degrade performance:

gate:
metric: avg_score
op: gte
value: 0.85 # Must maintain 85%+ to pass

Quality Enforcement: Require agents meet minimum standards before production:

gate:
metric: accuracy
op: gte
value: 0.95 # 95% of test cases must pass

When a gate condition is not met:

  1. Console output shows failure message:

    ✗ FAILED (0.72/1.00 avg, 72.0% pass rate)
    Gate check failed: avg_score (0.72) not >= 0.80
  2. Exit code is 1 (non-zero indicates failure):

    Terminal window
    letta-evals run suite.yaml
    echo $? # Prints 1 if gate failed
  3. Results JSON includes gate_passed: false:

    {
    "gate_passed": false,
    "gate_check": {
    "metric": "avg_score",
    "value": 0.72,
    "threshold": 0.80,
    "operator": "gte",
    "passed": false
    },
    "metrics": { ... }
    }
  4. All other data is preserved - you still get full results, scores, and trajectories even when gating fails

Which grader to evaluate. Must match a key in your graders section:

graders:
accuracy: # Grader name
kind: tool
function: exact_match
extractor: last_assistant
gate:
metric_key: accuracy # Must match grader name above
op: gte # >=
value: 0.8 # 80% threshold

If you only have one grader, metric_key can be omitted - it will default to your single grader.

Which aggregate statistic to compare. Two options:

Average score across all samples (0.0 to 1.0):

gate:
metric_key: quality # Check quality grader
metric: avg_score # Use average of all scores
op: gte # >=
value: 0.7 # Must average 70%+

Example: If scores are [0.8, 0.9, 0.6], avg_score = 0.77

Pass rate as a percentage (0.0 to 1.0):

gate:
metric_key: accuracy # Check accuracy grader
metric: accuracy # Use pass rate, not average
op: gte # >=
value: 0.8 # 80% of samples must pass

By default, samples with score >= 1.0 are considered “passing”.

You can customize the per-sample threshold with pass_op and pass_value (see below).

Comparison operator:

  • gte: Greater than or equal (>=)
  • gt: Greater than (>)
  • lte: Less than or equal (<=)
  • lt: Less than (<)
  • eq: Equal (==)

Most common: gte (at least X)

Threshold value for comparison:

  • For avg_score: 0.0 to 1.0
  • For accuracy: 0.0 to 1.0 (representing percentage)
gate:
metric: avg_score # Average score
op: gte # >=
value: 0.75 # 75% threshold
gate:
metric: accuracy # Pass rate
op: gte # >=
value: 0.9 # 90% must pass

Customize when individual samples are considered “passing” (used for accuracy calculation):

gate:
metric_key: quality # Check quality grader
metric: accuracy # Use pass rate
op: gte # >=
value: 0.8 # 80% must pass
pass_op: gte # Sample passes if >=
pass_value: 0.7 # This threshold (70%)

Default behavior:

  • If metric is avg_score: samples pass if score >= the gate value
  • If metric is accuracy: samples pass if score >= 1.0 (perfect)
gate:
metric_key: quality # Check quality grader
metric: avg_score # Use average
op: gte # >=
value: 0.8 # 80% average

Passes if the average score across all samples is >= 0.8

gate:
metric_key: accuracy # Check accuracy grader
metric: accuracy # Use pass rate
op: gte # >=
value: 0.9 # 90% must pass (default: score >= 1.0 to pass)

Passes if 90% of samples have score = 1.0

gate:
metric_key: quality # Check quality grader
metric: accuracy # Use pass rate
op: gte # >=
value: 0.75 # 75% must pass
pass_op: gte # Sample passes if >=
pass_value: 0.7 # 70% threshold per sample

Passes if 75% of samples have score >= 0.7

gate:
metric_key: quality # Check quality grader
metric: accuracy # Use pass rate
op: gte # >=
value: 0.95 # 95% must pass (allows 5% failures)
pass_op: gt # Sample passes if >
pass_value: 0.0 # 0.0 (any non-zero score)

Allows up to 5% failures.

gate:
metric_key: quality # Check quality grader
metric: accuracy # Use pass rate
op: eq # Exactly equal
value: 1.0 # 100% (all samples must pass)

All samples must pass.

When you have multiple graders, you can only gate on one metric:

graders:
accuracy: # First metric
kind: tool
function: exact_match
extractor: last_assistant
completeness: # Second metric
kind: rubric
prompt_path: completeness.txt
model: gpt-4o-mini
extractor: last_assistant
gate:
metric_key: accuracy # Only gate on accuracy (completeness still computed)
metric: avg_score # Use average
op: gte # >=
value: 0.8 # 80% threshold

The evaluation passes/fails based on the gated metric, but results include scores for all metrics.

  • Arithmetic mean of all scores
  • Sensitive to partial credit
  • Good for continuous evaluation

Example:

  • Scores: [1.0, 0.8, 0.6]
  • avg_score = (1.0 + 0.8 + 0.6) / 3 = 0.8
  • Percentage of samples meeting a threshold
  • Binary pass/fail per sample
  • Good for strict requirements

Example:

  • Scores: [1.0, 0.8, 0.6]
  • pass_value: 0.7
  • Passing: [1.0, 0.8] = 2 out of 3
  • accuracy = 2/3 = 0.667 (66.7%)

If a sample fails (error during evaluation), it:

  • Gets a score of 0.0
  • Counts toward total but not total_attempted
  • Included in avg_score_total but not avg_score_attempted

You can gate on either:

  • avg_score_total: Includes errors as 0.0
  • avg_score_attempted: Excludes errors (only successfully attempted samples)

After evaluation, you’ll see:

✓ PASSED (2.25/3.00 avg, 75.0% pass rate)

or

✗ FAILED (1.80/3.00 avg, 60.0% pass rate)

The evaluation exit code reflects the gate result:

  • 0: Passed
  • 1: Failed

For complex gating logic (e.g., “pass if accuracy >= 80% OR avg_score >= 0.9”), you’ll need to:

  1. Run evaluation with one gate
  2. Examine the results JSON
  3. Apply custom logic in a post-processing script