Skip to content
  • Auto
  • Light
  • Dark
DiscordForumGitHubSign up
Development Tools
Testing & evals
View as Markdown
Copy Markdown

Open in Claude
Open in ChatGPT

Getting Started

Run your first Letta agent evaluation in 5 minutes.

  • Python 3.11 or higher
  • A running Letta server (local or Letta Cloud)
  • A Letta agent to test, either in agent file format or by ID (see Targets for more details)
Terminal window
pip install letta-evals

Or with uv:

Terminal window
uv pip install letta-evals

Export an existing agent to a file using the Letta SDK:

from letta_client import Letta
import os
# Connect to Letta Cloud
client = Letta(token=os.getenv("LETTA_API_KEY"))
# Export an agent to a file
agent_file = client.agents.export_file(agent_id="agent-123")
# Save to disk
with open("my_agent.af", "w") as f:
f.write(agent_file)

Or export via the Agent Development Environment (ADE) by selecting “Export Agent”.

Then reference it in your suite:

target:
kind: agent
agent_file: my_agent.af

Let’s create your first evaluation in 3 steps:

Create a file named dataset.jsonl:

{"input": "What's the capital of France?", "ground_truth": "Paris"}
{"input": "Calculate 2+2", "ground_truth": "4"}
{"input": "What color is the sky?", "ground_truth": "blue"}

Each line is a JSON object with:

  • input: The prompt to send to your agent
  • ground_truth: The expected answer (used for grading)

Read more about Datasets for details on how to create your dataset.

Create a file named suite.yaml:

name: my-first-eval
dataset: dataset.jsonl
target:
kind: agent
agent_file: my_agent.af # Path to your agent file
base_url: https://api.letta.com # Letta Cloud (default)
token: ${LETTA_API_KEY} # Your API key
graders:
quality:
kind: tool
function: contains # Check if response contains the ground truth
extractor: last_assistant # Use the last assistant message
gate:
metric_key: quality
op: gte
value: 0.75 # Require 75% pass rate

The suite configuration defines:

Read more about Suites for details on how to configure your evaluation.

Run your evaluation with the following command:

Terminal window
letta-evals run suite.yaml

You’ll see real-time progress as your evaluation runs:

Running evaluation: my-first-eval
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 100%
✓ PASSED (2.25/3.00 avg, 75.0% pass rate)

Read more about CLI Commands for details about the available commands and options.

The core evaluation flow is:

Dataset → Target (Agent) → Extractor → Grader → Gate → Result

The evaluation runner:

  1. Loads your dataset
  2. Sends each input to your agent (Target)
  3. Extracts the relevant information (using the Extractor)
  4. Grades the response (using the Grader function)
  5. Computes aggregate metrics
  6. Checks if metrics pass the Gate criteria

The output shows:

  • Average score: Mean score across all samples
  • Pass rate: Percentage of samples that passed
  • Gate status: Whether the evaluation passed or failed overall

Now that you’ve run your first evaluation, explore more advanced features:

Use exact matching for cases where the answer must be precisely correct:

graders:
accuracy:
kind: tool
function: exact_match
extractor: last_assistant

Use an LLM judge to evaluate subjective qualities like helpfulness or tone:

graders:
quality:
kind: rubric
prompt_path: rubric.txt
model: gpt-4o-mini
extractor: last_assistant

Then create rubric.txt:

Rate the helpfulness and accuracy of the response.
- Score 1.0 if helpful and accurate
- Score 0.5 if partially helpful
- Score 0.0 if unhelpful or wrong

Verify that your agent calls specific tools with expected arguments:

graders:
tool_check:
kind: tool
function: contains
extractor: tool_arguments
extractor_config:
tool_name: search

Check if the agent correctly updates its memory blocks:

graders:
memory_check:
kind: tool
function: contains
extractor: memory_block
extractor_config:
block_label: human

For more help, see the Troubleshooting Guide.