Suite YAML Reference
Complete reference for suite configuration files.
A suite is a YAML file that defines an evaluation: what agent to test, what dataset to use, how to grade responses, and what criteria determine pass/fail. This is your evaluation specification.
See Getting Started for a tutorial, or Core Concepts for conceptual overview.
File Structure
Section titled “File Structure”name: string (required)description: string (optional)dataset: path (required)max_samples: integer (optional)sample_tags: array (optional)num_runs: integer (optional)setup_script: string (optional)
target: object (required) kind: "agent" base_url: string api_key: string timeout: float project_id: string agent_id: string (one of: agent_id, agent_file, agent_script) agent_file: path agent_script: string model_configs: array model_handles: array
graders: object (required) <metric_key>: object kind: "tool" | "rubric" display_name: string extractor: string extractor_config: object # Tool grader fields function: string # Rubric grader fields (LLM API) prompt: string prompt_path: path model: string temperature: float provider: string max_retries: integer timeout: float rubric_vars: array # Rubric grader fields (agent-as-judge) agent_file: path judge_tool_name: string
gate: object (required) metric_key: string metric: "avg_score" | "accuracy" op: "gte" | "gt" | "lte" | "lt" | "eq" value: float pass_op: "gte" | "gt" | "lte" | "lt" | "eq" pass_value: floatTop-Level Fields
Section titled “Top-Level Fields”name (required)
Section titled “name (required)”Suite name, used in output and results.
Type: string
name: question-answering-evaldescription (optional)
Section titled “description (optional)”Human-readable description of what the suite tests.
Type: string
description: Tests agent's ability to answer factual questions accuratelydataset (required)
Section titled “dataset (required)”Path to JSONL dataset file. Relative paths are resolved from the suite YAML location.
Type: path (string)
dataset: ./datasets/qa.jsonldataset: /absolute/path/to/dataset.jsonlmax_samples (optional)
Section titled “max_samples (optional)”Limit the number of samples to evaluate. Useful for quick tests.
Type: integer | Default: All samples
max_samples: 10 # Only evaluate first 10 samplessample_tags (optional)
Section titled “sample_tags (optional)”Filter samples by tags. Only samples with ALL specified tags are evaluated.
Type: array of strings
sample_tags: [math, easy] # Only samples tagged with bothnum_runs (optional)
Section titled “num_runs (optional)”Number of times to run the evaluation suite.
Type: integer | Default: 1
num_runs: 5 # Run the evaluation 5 timessetup_script (optional)
Section titled “setup_script (optional)”Path to Python script with setup function.
Type: string (format: path/to/script.py:function_name)
setup_script: setup.py:prepare_environmenttarget (required)
Section titled “target (required)”Configuration for the agent being evaluated.
kind (required)
Section titled “kind (required)”Type of target. Currently only "agent" is supported.
target: kind: agentbase_url (optional)
Section titled “base_url (optional)”Letta server URL. Default: https://api.letta.com
target: base_url: https://api.letta.com # or base_url: https://api.letta.comapi_key (optional)
Section titled “api_key (optional)”API key for Letta authentication. Can also be set via LETTA_API_KEY environment variable.
target: api_key: your-api-key-heretimeout (optional)
Section titled “timeout (optional)”Request timeout in seconds. Default: 300.0
target: timeout: 600.0 # 10 minutesAgent Source (required, pick one)
Section titled “Agent Source (required, pick one)”Exactly one of these must be specified:
agent_id
Section titled “agent_id”ID of existing agent on the server.
target: agent_id: agent-123-abcagent_file
Section titled “agent_file”Path to .af agent file.
target: agent_file: ./agents/my_agent.afagent_script
Section titled “agent_script”Path to Python script with agent factory.
target: agent_script: factory.py:MyAgentFactorySee Targets for details on agent sources.
model_configs (optional)
Section titled “model_configs (optional)”List of model configuration names to test. Cannot be used with model_handles.
target: model_configs: [gpt-4o-mini, claude-3-5-sonnet]model_handles (optional)
Section titled “model_handles (optional)”List of model handles for cloud deployments. Cannot be used with model_configs.
target: model_handles: ["openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"]graders (required)
Section titled “graders (required)”One or more graders, each with a unique key.
kind (required)
Section titled “kind (required)”Grader type: "tool" or "rubric".
graders: my_metric: kind: toolextractor (required)
Section titled “extractor (required)”Name of the extractor to use.
graders: my_metric: extractor: last_assistantTool Grader Fields
Section titled “Tool Grader Fields”function (required for tool graders)
Section titled “function (required for tool graders)”Name of the grading function.
graders: accuracy: kind: tool function: exact_matchRubric Grader Fields
Section titled “Rubric Grader Fields”prompt or prompt_path (required)
Section titled “prompt or prompt_path (required)”Inline rubric prompt or path to rubric file.
graders: quality: kind: rubric prompt: | Evaluate response quality from 0.0 to 1.0.model (optional)
Section titled “model (optional)”LLM model for judging. Default: gpt-4o-mini
graders: quality: kind: rubric model: gpt-4otemperature (optional)
Section titled “temperature (optional)”Temperature for LLM generation. Default: 0.0
graders: quality: kind: rubric temperature: 0.0agent_file (agent-as-judge)
Section titled “agent_file (agent-as-judge)”Path to .af agent file to use as judge.
graders: agent_judge: kind: rubric agent_file: judge.af prompt_path: rubric.txtgate (required)
Section titled “gate (required)”Pass/fail criteria for the evaluation.
metric_key (optional)
Section titled “metric_key (optional)”Which grader to evaluate. If only one grader, this can be omitted.
gate: metric_key: accuracymetric (optional)
Section titled “metric (optional)”Which aggregate to compare: avg_score or accuracy. Default: avg_score
gate: metric: avg_scoreop (required)
Section titled “op (required)”Comparison operator: gte, gt, lte, lt, eq
gate: op: gte # Greater than or equalvalue (required)
Section titled “value (required)”Threshold value for comparison (0.0 to 1.0).
gate: value: 0.8 # Require >= 0.8Complete Examples
Section titled “Complete Examples”Minimal Suite
Section titled “Minimal Suite”name: basic-evaldataset: dataset.jsonl
target: kind: agent agent_file: agent.af
graders: accuracy: kind: tool function: exact_match extractor: last_assistant
gate: op: gte value: 0.8Multi-Metric Suite
Section titled “Multi-Metric Suite”name: comprehensive-evaldescription: Tests accuracy and qualitydataset: test_data.jsonl
target: kind: agent agent_file: agent.af
graders: accuracy: kind: tool function: contains extractor: last_assistant
quality: kind: rubric prompt_path: rubrics/quality.txt model: gpt-4o-mini extractor: last_assistant
gate: metric_key: accuracy op: gte value: 0.85Validation
Section titled “Validation”Validate your suite before running:
letta-evals validate suite.yamlNext Steps
Section titled “Next Steps”- Targets - Understanding agent sources and configuration
- Graders - Tool graders vs rubric graders
- Extractors - What to extract from agent responses
- Gates - Setting pass/fail criteria