Understanding Results
This guide explains how to interpret evaluation results.
Result Structure
Section titled “Result Structure”An evaluation produces three types of output:
- Console output: Real-time progress and summary
- Summary JSON: Aggregate metrics and configuration
- Results JSONL: Per-sample detailed results
Console Output
Section titled “Console Output”Progress Display
Section titled “Progress Display”Running evaluation: my-eval-suite━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 100%
Results: Total samples: 3 Attempted: 3 Avg score: 0.83 (attempted: 0.83) Passed: 2 (66.7%)
Gate (quality >= 0.75): PASSEDQuiet Mode
Section titled “Quiet Mode”letta-evals run suite.yaml --quietOutput:
✓ PASSEDor
✗ FAILEDJSON Output
Section titled “JSON Output”Saving Results
Section titled “Saving Results”letta-evals run suite.yaml --output results/Creates three files:
header.json
Section titled “header.json”Evaluation metadata:
{ "suite_name": "my-eval-suite", "timestamp": "2025-01-15T10:30:00Z", "version": "0.3.0"}summary.json
Section titled “summary.json”Complete evaluation summary:
{ "suite": "my-eval-suite", "config": { "target": {...}, "graders": {...}, "gate": {...} }, "metrics": { "total": 10, "total_attempted": 10, "avg_score_attempted": 0.85, "avg_score_total": 0.85, "passed_attempts": 8, "failed_attempts": 2, "by_metric": { "accuracy": { "avg_score_attempted": 0.90, "pass_rate": 90.0, "passed_attempts": 9, "failed_attempts": 1 }, "quality": { "avg_score_attempted": 0.80, "pass_rate": 70.0, "passed_attempts": 7, "failed_attempts": 3 } } }, "gates_passed": true}results.jsonl
Section titled “results.jsonl”One JSON object per line, each representing one sample:
{"sample": {"id": 0, "input": "What is 2+2?", "ground_truth": "4"}, "submission": "4", "grade": {"score": 1.0, "rationale": "Exact match: true"}, "trajectory": [...], "agent_id": "agent-123", "model_name": "default"}{"sample": {"id": 1, "input": "What is 3+3?", "ground_truth": "6"}, "submission": "6", "grade": {"score": 1.0, "rationale": "Exact match: true"}, "trajectory": [...], "agent_id": "agent-124", "model_name": "default"}Metrics Explained
Section titled “Metrics Explained”Total number of samples in the evaluation (including errors).
total_attempted
Section titled “total_attempted”Number of samples that completed without errors.
If a sample fails during agent execution or grading, it’s counted in total but not total_attempted.
avg_score_attempted
Section titled “avg_score_attempted”Average score across samples that completed successfully.
Formula: sum(scores) / total_attempted
Range: 0.0 to 1.0
avg_score_total
Section titled “avg_score_total”Average score across all samples, treating errors as 0.0.
Formula: sum(scores) / total
Range: 0.0 to 1.0
passed_attempts / failed_attempts
Section titled “passed_attempts / failed_attempts”Number of samples that passed/failed the gate’s per-sample criteria.
By default:
- If gate metric is
accuracy: sample passes if score>= 1.0 - If gate metric is
avg_score: sample passes if score>=gate value
Can be customized with pass_op and pass_value in gate config.
by_metric
Section titled “by_metric”For multi-metric evaluation, shows aggregate stats for each metric:
"by_metric": { "accuracy": { "avg_score_attempted": 0.90, "avg_score_total": 0.85, "pass_rate": 90.0, "passed_attempts": 9, "failed_attempts": 1 }}Sample Results
Section titled “Sample Results”Each sample result includes:
sample
Section titled “sample”The original dataset sample:
"sample": { "id": 0, "input": "What is 2+2?", "ground_truth": "4", "metadata": {...}}submission
Section titled “submission”The extracted text that was graded:
"submission": "The answer is 4"The grading result:
"grade": { "score": 1.0, "rationale": "Contains ground_truth: true", "metadata": {"model": "gpt-4o-mini", "usage": {...}}}grades (multi-metric)
Section titled “grades (multi-metric)”For multi-metric evaluation:
"grades": { "accuracy": {"score": 1.0, "rationale": "Exact match"}, "quality": {"score": 0.85, "rationale": "Good but verbose"}}trajectory
Section titled “trajectory”The complete conversation history:
"trajectory": [ [ {"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "The answer is 4"} ]]agent_id
Section titled “agent_id”The ID of the agent that generated this response:
"agent_id": "agent-abc-123"model_name
Section titled “model_name”The model configuration used:
"model_name": "gpt-4o-mini"agent_usage
Section titled “agent_usage”Token usage statistics (if available):
"agent_usage": [ {"completion_tokens": 10, "prompt_tokens": 50, "total_tokens": 60}]Interpreting Scores
Section titled “Interpreting Scores”Score Ranges
Section titled “Score Ranges”- 1.0: Perfect - fully meets criteria
- 0.8-0.99: Very good - minor issues
- 0.6-0.79: Good - notable improvements possible
- 0.4-0.59: Acceptable - significant issues
- 0.2-0.39: Poor - major problems
- 0.0-0.19: Failed - did not meet criteria
Binary vs Continuous
Section titled “Binary vs Continuous”Tool graders typically return binary scores:
- 1.0: Passed
- 0.0: Failed
Rubric graders return continuous scores:
- Any value from 0.0 to 1.0
- Allows for partial credit
Multi-Model Results
Section titled “Multi-Model Results”When testing multiple models:
"metrics": { "per_model": [ { "model_name": "gpt-4o-mini", "avg_score_attempted": 0.85, "passed_samples": 8, "failed_samples": 2 }, { "model_name": "claude-3-5-sonnet", "avg_score_attempted": 0.90, "passed_samples": 9, "failed_samples": 1 } ]}Console output:
Results by model: gpt-4o-mini - Avg: 0.85, Pass: 80.0% claude-3-5-sonnet - Avg: 0.90, Pass: 90.0%Multiple Runs Statistics
Section titled “Multiple Runs Statistics”Run evaluations multiple times to measure consistency and get aggregate statistics.
Configuration
Section titled “Configuration”Specify in YAML:
name: my-eval-suitedataset: dataset.jsonlnum_runs: 5 # Run 5 timestarget: kind: agent agent_file: my_agent.afgraders: accuracy: kind: tool function: exact_matchgate: metric_key: accuracy op: gte value: 0.8Or via CLI:
letta-evals run suite.yaml --num-runs 10 --output results/Output Structure
Section titled “Output Structure”results/├── run_1/│ ├── header.json│ ├── results.jsonl│ └── summary.json├── run_2/│ ├── header.json│ ├── results.jsonl│ └── summary.json├── ...└── aggregate_stats.json # Statistics across all runsAggregate Statistics File
Section titled “Aggregate Statistics File”The aggregate_stats.json includes statistics across all runs:
{ "num_runs": 10, "runs_passed": 8, "mean_avg_score_attempted": 0.847, "std_avg_score_attempted": 0.042, "mean_avg_score_total": 0.847, "std_avg_score_total": 0.042, "mean_scores": { "accuracy": 0.89, "quality": 0.82 }, "std_scores": { "accuracy": 0.035, "quality": 0.051 }, "individual_run_metrics": [ { "avg_score_attempted": 0.85, "avg_score_total": 0.85, "pass_rate": 0.85, "by_metric": { "accuracy": { "avg_score_attempted": 0.9, "avg_score_total": 0.9, "pass_rate": 0.9 } } } // ... metrics from runs 2-10 ]}Key fields:
num_runs: Total number of runs executedruns_passed: Number of runs that passed the gatemean_avg_score_attempted: Mean score across runs (only attempted samples)std_avg_score_attempted: Standard deviation (measures consistency)mean_scores: Mean for each metric (e.g.,{"accuracy": 0.89})std_scores: Standard deviation for each metric (e.g.,{"accuracy": 0.035})individual_run_metrics: Full metrics object from each individual run
Use Cases
Section titled “Use Cases”Measure consistency of non-deterministic agents:
letta-evals run suite.yaml --num-runs 20 --output results/# Check std_avg_score_attempted in aggregate_stats.json# Low std = consistent, high std = variableGet confidence intervals:
import jsonimport math
with open("results/aggregate_stats.json") as f: stats = json.load(f)
mean = stats["mean_avg_score_attempted"]std = stats["std_avg_score_attempted"]n = stats["num_runs"]
# 95% confidence interval (assuming normal distribution)margin = 1.96 * (std / math.sqrt(n))print(f"Score: {mean:.3f} ± {margin:.3f}")Compare metric consistency:
with open("results/aggregate_stats.json") as f: stats = json.load(f)
for metric_name, mean in stats["mean_scores"].items(): std = stats["std_scores"][metric_name] consistency = "consistent" if std < 0.05 else "variable" print(f"{metric_name}: {mean:.3f} ± {std:.3f} ({consistency})")Error Handling
Section titled “Error Handling”If a sample encounters an error:
{ "sample": {...}, "submission": "", "grade": { "score": 0.0, "rationale": "Error during grading: Connection timeout", "metadata": {"error": "timeout", "error_type": "ConnectionError"} }}Errors:
- Count toward
totalbut nottotal_attempted - Get score of 0.0
- Include error details in rationale and metadata
Analyzing Results
Section titled “Analyzing Results”Find Low Scores
Section titled “Find Low Scores”import json
with open("results/results.jsonl") as f: results = [json.loads(line) for line in f]
low_scores = [r for r in results if r["grade"]["score"] < 0.5]print(f"Found {len(low_scores)} samples with score < 0.5")
for result in low_scores: print(f"Sample {result['sample']['id']}: {result['grade']['rationale']}")Compare Metrics
Section titled “Compare Metrics”# Load summarywith open("results/summary.json") as f: summary = json.load(f)
metrics = summary["metrics"]["by_metric"]for name, stats in metrics.items(): print(f"{name}: {stats['avg_score_attempted']:.2f} avg, {stats['pass_rate']:.1f}% pass")Extract Failures
Section titled “Extract Failures”# Find samples that failed gate criteriafailures = [ r for r in results if not gate_passed(r["grade"]["score"]) # Your gate logic]Next Steps
Section titled “Next Steps”- Gates - Setting pass/fail criteria
- CLI Commands - Running evaluations