Skip to content
  • Auto
  • Light
  • Dark
DiscordForumGitHubSign up
Development Tools
Testing & evals
Core concepts
View as Markdown
Copy Markdown

Open in Claude
Open in ChatGPT

Suites

A suite is a YAML configuration file that defines a complete evaluation specification. It’s the central piece that ties together your dataset, target agent, grading criteria, and pass/fail thresholds.

Typical workflow:

  1. Create a suite YAML defining what and how to test
  2. Run letta-evals run suite.yaml
  3. Review results showing scores for each metric
  4. Suite passes or fails based on gate criteria

An evaluation suite is a YAML configuration file that defines a complete test specification.

name: my-evaluation # Suite identifier
description: Optional description of what this tests # Human-readable explanation
dataset: path/to/dataset.jsonl # Test cases
target: # What agent to evaluate
kind: agent
agent_file: agent.af # Agent to test
base_url: https://api.letta.com # Letta server
graders: # How to evaluate responses
my_metric:
kind: tool # Deterministic grading
function: exact_match # Grading function
extractor: last_assistant # What to extract from agent response
gate: # Pass/fail criteria
metric_key: my_metric # Which metric to check
op: gte # Greater than or equal
value: 0.8 # 80% threshold

The name of your evaluation suite. Used in output and results.

name: question-answering-eval

Path to the JSONL or CSV dataset file. Can be relative (to the suite YAML) or absolute.

dataset: ./datasets/qa.jsonl # Relative to suite YAML location

Specifies what agent to evaluate. See Targets for details.

One or more graders to evaluate agent performance. See Graders for details.

Pass/fail criteria. See Gates for details.

A human-readable description of what this suite tests:

description: Tests the agent's ability to answer factual questions accurately

Limit the number of samples to evaluate (useful for quick tests):

max_samples: 10 # Only evaluate first 10 samples

Filter samples by tags (only evaluate samples with these tags):

sample_tags: [math, easy] # Only samples tagged with "math" AND "easy"

Dataset samples can include tags:

{
"input": "What is 2+2?",
"ground_truth": "4",
"tags": [
"math",
"easy"
]
}

Number of times to run the entire evaluation suite (useful for testing non-deterministic behavior):

num_runs: 5 # Run the evaluation 5 times

Default: 1

Path to a Python script with a setup function to run before evaluation:

setup_script: setup.py:prepare_environment # script.py:function_name

The setup function should have this signature:

def prepare_environment(suite: SuiteSpec) -> None:
# Setup code here
pass

Paths in the suite YAML are resolved relative to the YAML file location:

project/
├── suite.yaml
├── dataset.jsonl
└── agents/
└── my_agent.af
# In suite.yaml
dataset: dataset.jsonl # Resolves to project/dataset.jsonl
target:
agent_file: agents/my_agent.af # Resolves to project/agents/my_agent.af

Absolute paths are used as-is.

You can evaluate multiple metrics in one suite:

graders:
accuracy: # Check if answer is correct
kind: tool
function: exact_match
extractor: last_assistant
completeness: # LLM judges response quality
kind: rubric
prompt_path: rubrics/completeness.txt
model: gpt-4o-mini
extractor: last_assistant
tool_usage: # Verify correct tool was called
kind: tool
function: contains
extractor: tool_arguments # Extract tool call arguments

The gate can check any of these metrics:

gate:
metric_key: accuracy # Gate on accuracy metric (others still computed)
op: gte # Greater than or equal
value: 0.9 # 90% threshold

Results will include scores for all graders, even if you only gate on one.

name: basic-qa # Suite name
dataset: questions.jsonl # Test questions
target:
kind: agent
agent_file: qa_agent.af # Pre-configured agent
base_url: https://api.letta.com # Local server
graders:
accuracy: # Single metric
kind: tool # Deterministic grading
function: contains # Check if ground truth is in response
extractor: last_assistant # Use final agent message
gate:
metric_key: accuracy # Gate on this metric
op: gte # Must be >=
value: 0.75 # 75% to pass
name: quality-eval # Quality evaluation
dataset: prompts.jsonl # Test prompts
target:
kind: agent
agent_id: existing-agent-123 # Use existing agent
base_url: https://api.letta.com # Letta Cloud
graders:
quality: # LLM-as-judge metric
kind: rubric # Subjective evaluation
prompt_path: quality_rubric.txt # Rubric template
model: gpt-4o-mini # Judge model
temperature: 0.0 # Deterministic
extractor: last_assistant # Evaluate final response
gate:
metric_key: quality # Gate on this metric
metric: avg_score # Use average score
op: gte # Must be >=
value: 0.7 # 70% to pass

Test the same agent configuration across different models:

name: model-comparison # Compare model performance
dataset: test.jsonl # Same test for all models
target:
kind: agent
agent_file: agent.af # Same agent configuration
base_url: https://api.letta.com # Local server
model_configs: [gpt-4o-mini, claude-3-5-sonnet] # Test both models
graders:
accuracy: # Single metric for comparison
kind: tool
function: exact_match
extractor: last_assistant
gate:
metric_key: accuracy # Both models must pass this
op: gte # Must be >=
value: 0.8 # 80% threshold

Results will show per-model metrics.

Validate your suite configuration before running:

Terminal window
letta-evals validate suite.yaml

This checks:

  • Required fields are present
  • Paths exist
  • Configuration is valid
  • Grader/extractor combinations are compatible