Suites

Development Tools

Testing & evals

Core concepts

A suite is a YAML configuration file that defines a complete evaluation specification. It’s the central piece that ties together your dataset, target agent, grading criteria, and pass/fail thresholds.

Typical workflow:

Create a suite YAML defining what and how to test
Run letta-evals run suite.yaml
Review results showing scores for each metric
Suite passes or fails based on gate criteria

An evaluation suite is a YAML configuration file that defines a complete test specification.

Basic Structure

name: my-evaluation # Suite identifier
description: Optional description of what this tests # Human-readable explanation
dataset: path/to/dataset.jsonl # Test cases

target: # What agent to evaluate
  kind: agent
  agent_file: agent.af # Agent to test
  base_url: https://api.letta.com # Letta server

graders: # How to evaluate responses
  my_metric:
    kind: tool # Deterministic grading
    function: exact_match # Grading function
    extractor: last_assistant # What to extract from agent response

gate: # Pass/fail criteria
  metric_key: my_metric # Which metric to check
  op: gte # Greater than or equal
  value: 0.8 # 80% threshold

Required Fields

name

The name of your evaluation suite. Used in output and results.

name: question-answering-eval

dataset

Path to the JSONL or CSV dataset file. Can be relative (to the suite YAML) or absolute.

dataset: ./datasets/qa.jsonl # Relative to suite YAML location

target

Specifies what agent to evaluate. See Targets for details.

graders

One or more graders to evaluate agent performance. See Graders for details.

gate

Pass/fail criteria. See Gates for details.

Optional Fields

description

A human-readable description of what this suite tests:

description: Tests the agent's ability to answer factual questions accurately

max_samples

Limit the number of samples to evaluate (useful for quick tests):

max_samples: 10 # Only evaluate first 10 samples

sample_tags

Filter samples by tags (only evaluate samples with these tags):

sample_tags: [math, easy] # Only samples tagged with "math" AND "easy"

Dataset samples can include tags:

{
  "input": "What is 2+2?",
  "ground_truth": "4",
  "tags": [
    "math",
    "easy"
  ]
}

num_runs

Number of times to run the entire evaluation suite (useful for testing non-deterministic behavior):

num_runs: 5 # Run the evaluation 5 times

Default: 1

setup_script

Path to a Python script with a setup function to run before evaluation:

setup_script: setup.py:prepare_environment # script.py:function_name

The setup function should have this signature:

def prepare_environment(suite: SuiteSpec) -> None:
    # Setup code here
    pass

Path Resolution

Paths in the suite YAML are resolved relative to the YAML file location:

project/
├── suite.yaml
├── dataset.jsonl
└── agents/
    └── my_agent.af

# In suite.yaml
dataset: dataset.jsonl # Resolves to project/dataset.jsonl
target:
  agent_file: agents/my_agent.af # Resolves to project/agents/my_agent.af

Absolute paths are used as-is.

Multi-Grader Suites

You can evaluate multiple metrics in one suite:

graders:
  accuracy: # Check if answer is correct
    kind: tool
    function: exact_match
    extractor: last_assistant

  completeness: # LLM judges response quality
    kind: rubric
    prompt_path: rubrics/completeness.txt
    model: gpt-4o-mini
    extractor: last_assistant

  tool_usage: # Verify correct tool was called
    kind: tool
    function: contains
    extractor: tool_arguments # Extract tool call arguments

The gate can check any of these metrics:

gate:
  metric_key: accuracy # Gate on accuracy metric (others still computed)
  op: gte # Greater than or equal
  value: 0.9 # 90% threshold

Results will include scores for all graders, even if you only gate on one.

Examples

Simple Tool Grader Suite

name: basic-qa # Suite name
dataset: questions.jsonl # Test questions

target:
  kind: agent
  agent_file: qa_agent.af # Pre-configured agent
  base_url: https://api.letta.com # Local server

graders:
  accuracy: # Single metric
    kind: tool # Deterministic grading
    function: contains # Check if ground truth is in response
    extractor: last_assistant # Use final agent message

gate:
  metric_key: accuracy # Gate on this metric
  op: gte # Must be >=
  value: 0.75 # 75% to pass

Rubric Grader Suite

name: quality-eval # Quality evaluation
dataset: prompts.jsonl # Test prompts

target:
  kind: agent
  agent_id: existing-agent-123 # Use existing agent
  base_url: https://api.letta.com # Letta Cloud

graders:
  quality: # LLM-as-judge metric
    kind: rubric # Subjective evaluation
    prompt_path: quality_rubric.txt # Rubric template
    model: gpt-4o-mini # Judge model
    temperature: 0.0 # Deterministic
    extractor: last_assistant # Evaluate final response

gate:
  metric_key: quality # Gate on this metric
  metric: avg_score # Use average score
  op: gte # Must be >=
  value: 0.7 # 70% to pass

Multi-Model Suite

Test the same agent configuration across different models:

name: model-comparison # Compare model performance
dataset: test.jsonl # Same test for all models

target:
  kind: agent
  agent_file: agent.af # Same agent configuration
  base_url: https://api.letta.com # Local server
  model_configs: [gpt-4o-mini, claude-3-5-sonnet] # Test both models

graders:
  accuracy: # Single metric for comparison
    kind: tool
    function: exact_match
    extractor: last_assistant

gate:
  metric_key: accuracy # Both models must pass this
  op: gte # Must be >=
  value: 0.8 # 80% threshold

Results will show per-model metrics.

Validation

Validate your suite configuration before running:

letta-evals validate suite.yaml

This checks:

Required fields are present
Paths exist
Configuration is valid
Grader/extractor combinations are compatible