Suites
A suite is a YAML configuration file that defines a complete evaluation specification. It’s the central piece that ties together your dataset, target agent, grading criteria, and pass/fail thresholds.
Typical workflow:
- Create a suite YAML defining what and how to test
- Run
letta-evals run suite.yaml - Review results showing scores for each metric
- Suite passes or fails based on gate criteria
An evaluation suite is a YAML configuration file that defines a complete test specification.
Basic Structure
Section titled “Basic Structure”name: my-evaluation # Suite identifierdescription: Optional description of what this tests # Human-readable explanationdataset: path/to/dataset.jsonl # Test cases
target: # What agent to evaluate kind: agent agent_file: agent.af # Agent to test base_url: https://api.letta.com # Letta server
graders: # How to evaluate responses my_metric: kind: tool # Deterministic grading function: exact_match # Grading function extractor: last_assistant # What to extract from agent response
gate: # Pass/fail criteria metric_key: my_metric # Which metric to check op: gte # Greater than or equal value: 0.8 # 80% thresholdRequired Fields
Section titled “Required Fields”The name of your evaluation suite. Used in output and results.
name: question-answering-evaldataset
Section titled “dataset”Path to the JSONL or CSV dataset file. Can be relative (to the suite YAML) or absolute.
dataset: ./datasets/qa.jsonl # Relative to suite YAML locationtarget
Section titled “target”Specifies what agent to evaluate. See Targets for details.
graders
Section titled “graders”One or more graders to evaluate agent performance. See Graders for details.
Pass/fail criteria. See Gates for details.
Optional Fields
Section titled “Optional Fields”description
Section titled “description”A human-readable description of what this suite tests:
description: Tests the agent's ability to answer factual questions accuratelymax_samples
Section titled “max_samples”Limit the number of samples to evaluate (useful for quick tests):
max_samples: 10 # Only evaluate first 10 samplessample_tags
Section titled “sample_tags”Filter samples by tags (only evaluate samples with these tags):
sample_tags: [math, easy] # Only samples tagged with "math" AND "easy"Dataset samples can include tags:
{ "input": "What is 2+2?", "ground_truth": "4", "tags": [ "math", "easy" ]}num_runs
Section titled “num_runs”Number of times to run the entire evaluation suite (useful for testing non-deterministic behavior):
num_runs: 5 # Run the evaluation 5 timesDefault: 1
setup_script
Section titled “setup_script”Path to a Python script with a setup function to run before evaluation:
setup_script: setup.py:prepare_environment # script.py:function_nameThe setup function should have this signature:
def prepare_environment(suite: SuiteSpec) -> None: # Setup code here passPath Resolution
Section titled “Path Resolution”Paths in the suite YAML are resolved relative to the YAML file location:
project/├── suite.yaml├── dataset.jsonl└── agents/ └── my_agent.af# In suite.yamldataset: dataset.jsonl # Resolves to project/dataset.jsonltarget: agent_file: agents/my_agent.af # Resolves to project/agents/my_agent.afAbsolute paths are used as-is.
Multi-Grader Suites
Section titled “Multi-Grader Suites”You can evaluate multiple metrics in one suite:
graders: accuracy: # Check if answer is correct kind: tool function: exact_match extractor: last_assistant
completeness: # LLM judges response quality kind: rubric prompt_path: rubrics/completeness.txt model: gpt-4o-mini extractor: last_assistant
tool_usage: # Verify correct tool was called kind: tool function: contains extractor: tool_arguments # Extract tool call argumentsThe gate can check any of these metrics:
gate: metric_key: accuracy # Gate on accuracy metric (others still computed) op: gte # Greater than or equal value: 0.9 # 90% thresholdResults will include scores for all graders, even if you only gate on one.
Examples
Section titled “Examples”Simple Tool Grader Suite
Section titled “Simple Tool Grader Suite”name: basic-qa # Suite namedataset: questions.jsonl # Test questions
target: kind: agent agent_file: qa_agent.af # Pre-configured agent base_url: https://api.letta.com # Local server
graders: accuracy: # Single metric kind: tool # Deterministic grading function: contains # Check if ground truth is in response extractor: last_assistant # Use final agent message
gate: metric_key: accuracy # Gate on this metric op: gte # Must be >= value: 0.75 # 75% to passRubric Grader Suite
Section titled “Rubric Grader Suite”name: quality-eval # Quality evaluationdataset: prompts.jsonl # Test prompts
target: kind: agent agent_id: existing-agent-123 # Use existing agent base_url: https://api.letta.com # Letta Cloud
graders: quality: # LLM-as-judge metric kind: rubric # Subjective evaluation prompt_path: quality_rubric.txt # Rubric template model: gpt-4o-mini # Judge model temperature: 0.0 # Deterministic extractor: last_assistant # Evaluate final response
gate: metric_key: quality # Gate on this metric metric: avg_score # Use average score op: gte # Must be >= value: 0.7 # 70% to passMulti-Model Suite
Section titled “Multi-Model Suite”Test the same agent configuration across different models:
name: model-comparison # Compare model performancedataset: test.jsonl # Same test for all models
target: kind: agent agent_file: agent.af # Same agent configuration base_url: https://api.letta.com # Local server model_configs: [gpt-4o-mini, claude-3-5-sonnet] # Test both models
graders: accuracy: # Single metric for comparison kind: tool function: exact_match extractor: last_assistant
gate: metric_key: accuracy # Both models must pass this op: gte # Must be >= value: 0.8 # 80% thresholdResults will show per-model metrics.
Validation
Section titled “Validation”Validate your suite configuration before running:
letta-evals validate suite.yamlThis checks:
- Required fields are present
- Paths exist
- Configuration is valid
- Grader/extractor combinations are compatible