Skip to content
  • Auto
  • Light
  • Dark
DiscordForumGitHubSign up
Development Tools
Testing & evals
Configuration
View as Markdown
Copy Markdown

Open in Claude
Open in ChatGPT

Suite YAML Reference

Complete reference for suite configuration files.

A suite is a YAML file that defines an evaluation: what agent to test, what dataset to use, how to grade responses, and what criteria determine pass/fail. This is your evaluation specification.

See Getting Started for a tutorial, or Core Concepts for conceptual overview.

name: string (required)
description: string (optional)
dataset: path (required)
max_samples: integer (optional)
sample_tags: array (optional)
num_runs: integer (optional)
setup_script: string (optional)
target: object (required)
kind: "agent"
base_url: string
api_key: string
timeout: float
project_id: string
agent_id: string (one of: agent_id, agent_file, agent_script)
agent_file: path
agent_script: string
model_configs: array
model_handles: array
graders: object (required)
<metric_key>: object
kind: "tool" | "rubric"
display_name: string
extractor: string
extractor_config: object
# Tool grader fields
function: string
# Rubric grader fields (LLM API)
prompt: string
prompt_path: path
model: string
temperature: float
provider: string
max_retries: integer
timeout: float
rubric_vars: array
# Rubric grader fields (agent-as-judge)
agent_file: path
judge_tool_name: string
gate: object (required)
metric_key: string
metric: "avg_score" | "accuracy"
op: "gte" | "gt" | "lte" | "lt" | "eq"
value: float
pass_op: "gte" | "gt" | "lte" | "lt" | "eq"
pass_value: float

Suite name, used in output and results.

Type: string

name: question-answering-eval

Human-readable description of what the suite tests.

Type: string

description: Tests agent's ability to answer factual questions accurately

Path to JSONL dataset file. Relative paths are resolved from the suite YAML location.

Type: path (string)

dataset: ./datasets/qa.jsonl
dataset: /absolute/path/to/dataset.jsonl

Limit the number of samples to evaluate. Useful for quick tests.

Type: integer | Default: All samples

max_samples: 10 # Only evaluate first 10 samples

Filter samples by tags. Only samples with ALL specified tags are evaluated.

Type: array of strings

sample_tags: [math, easy] # Only samples tagged with both

Number of times to run the evaluation suite.

Type: integer | Default: 1

num_runs: 5 # Run the evaluation 5 times

Path to Python script with setup function.

Type: string (format: path/to/script.py:function_name)

setup_script: setup.py:prepare_environment

Configuration for the agent being evaluated.

Type of target. Currently only "agent" is supported.

target:
kind: agent

Letta server URL. Default: https://api.letta.com

target:
base_url: https://api.letta.com
# or
base_url: https://api.letta.com

API key for Letta authentication. Can also be set via LETTA_API_KEY environment variable.

target:
api_key: your-api-key-here

Request timeout in seconds. Default: 300.0

target:
timeout: 600.0 # 10 minutes

Exactly one of these must be specified:

ID of existing agent on the server.

target:
agent_id: agent-123-abc

Path to .af agent file.

target:
agent_file: ./agents/my_agent.af

Path to Python script with agent factory.

target:
agent_script: factory.py:MyAgentFactory

See Targets for details on agent sources.

List of model configuration names to test. Cannot be used with model_handles.

target:
model_configs: [gpt-4o-mini, claude-3-5-sonnet]

List of model handles for cloud deployments. Cannot be used with model_configs.

target:
model_handles: ["openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"]

One or more graders, each with a unique key.

Grader type: "tool" or "rubric".

graders:
my_metric:
kind: tool

Name of the extractor to use.

graders:
my_metric:
extractor: last_assistant

Name of the grading function.

graders:
accuracy:
kind: tool
function: exact_match

Inline rubric prompt or path to rubric file.

graders:
quality:
kind: rubric
prompt: |
Evaluate response quality from 0.0 to 1.0.

LLM model for judging. Default: gpt-4o-mini

graders:
quality:
kind: rubric
model: gpt-4o

Temperature for LLM generation. Default: 0.0

graders:
quality:
kind: rubric
temperature: 0.0

Path to .af agent file to use as judge.

graders:
agent_judge:
kind: rubric
agent_file: judge.af
prompt_path: rubric.txt

Pass/fail criteria for the evaluation.

Which grader to evaluate. If only one grader, this can be omitted.

gate:
metric_key: accuracy

Which aggregate to compare: avg_score or accuracy. Default: avg_score

gate:
metric: avg_score

Comparison operator: gte, gt, lte, lt, eq

gate:
op: gte # Greater than or equal

Threshold value for comparison (0.0 to 1.0).

gate:
value: 0.8 # Require >= 0.8
name: basic-eval
dataset: dataset.jsonl
target:
kind: agent
agent_file: agent.af
graders:
accuracy:
kind: tool
function: exact_match
extractor: last_assistant
gate:
op: gte
value: 0.8
name: comprehensive-eval
description: Tests accuracy and quality
dataset: test_data.jsonl
target:
kind: agent
agent_file: agent.af
graders:
accuracy:
kind: tool
function: contains
extractor: last_assistant
quality:
kind: rubric
prompt_path: rubrics/quality.txt
model: gpt-4o-mini
extractor: last_assistant
gate:
metric_key: accuracy
op: gte
value: 0.85

Validate your suite before running:

Terminal window
letta-evals validate suite.yaml
  • Targets - Understanding agent sources and configuration
  • Graders - Tool graders vs rubric graders
  • Extractors - What to extract from agent responses
  • Gates - Setting pass/fail criteria