Getting Started

Development Tools

Testing & evals

Run your first Letta agent evaluation in 5 minutes.

Prerequisites

Python 3.11 or higher
A running Letta server (local or Letta Cloud)
A Letta agent to test, either in agent file format or by ID (see Targets for more details)

Installation

pip install letta-evals

Or with uv:

uv pip install letta-evals

Getting an Agent to Test

Export an existing agent to a file using the Letta SDK:

from letta_client import Letta
import os

# Connect to Letta Cloud
client = Letta(token=os.getenv("LETTA_API_KEY"))

# Export an agent to a file
agent_file = client.agents.export_file(agent_id="agent-123")

# Save to disk
with open("my_agent.af", "w") as f:
    f.write(agent_file)

Or export via the Agent Development Environment (ADE) by selecting “Export Agent”.

Then reference it in your suite:

target:
  kind: agent
  agent_file: my_agent.af

Quick Start

Let’s create your first evaluation in 3 steps:

1. Create a Test Dataset

Create a file named dataset.jsonl:

{"input": "What's the capital of France?", "ground_truth": "Paris"}
{"input": "Calculate 2+2", "ground_truth": "4"}
{"input": "What color is the sky?", "ground_truth": "blue"}

Each line is a JSON object with:

input: The prompt to send to your agent
ground_truth: The expected answer (used for grading)

Read more about Datasets for details on how to create your dataset.

2. Create a Suite Configuration

Create a file named suite.yaml:

name: my-first-eval
dataset: dataset.jsonl

target:
  kind: agent
  agent_file: my_agent.af # Path to your agent file
  base_url: https://api.letta.com # Letta Cloud (default)
  token: ${LETTA_API_KEY} # Your API key

graders:
  quality:
    kind: tool
    function: contains # Check if response contains the ground truth
    extractor: last_assistant # Use the last assistant message

gate:
  metric_key: quality
  op: gte
  value: 0.75 # Require 75% pass rate

The suite configuration defines:

The dataset to use
The agent to test
The graders to use
The gate criteria

Read more about Suites for details on how to configure your evaluation.

3. Run the Evaluation

Run your evaluation with the following command:

letta-evals run suite.yaml

You’ll see real-time progress as your evaluation runs:

Running evaluation: my-first-eval
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 100%
✓ PASSED (2.25/3.00 avg, 75.0% pass rate)

Read more about CLI Commands for details about the available commands and options.

Understanding the Results

The core evaluation flow is:

Dataset → Target (Agent) → Extractor → Grader → Gate → Result

The evaluation runner:

Loads your dataset
Sends each input to your agent (Target)
Extracts the relevant information (using the Extractor)
Grades the response (using the Grader function)
Computes aggregate metrics
Checks if metrics pass the Gate criteria

The output shows:

Average score: Mean score across all samples
Pass rate: Percentage of samples that passed
Gate status: Whether the evaluation passed or failed overall

Next Steps

Now that you’ve run your first evaluation, explore more advanced features:

Core Concepts - Understand suites, datasets, graders, and extractors
Grader Types - Learn about tool graders vs rubric graders
Multi-Metric Evaluation - Test multiple aspects simultaneously
Custom Graders - Write custom grading functions
Multi-Turn Conversations - Test conversational memory

Common Use Cases

Strict Answer Checking

Use exact matching for cases where the answer must be precisely correct:

graders:
  accuracy:
    kind: tool
    function: exact_match
    extractor: last_assistant

Subjective Quality Evaluation

Use an LLM judge to evaluate subjective qualities like helpfulness or tone:

graders:
  quality:
    kind: rubric
    prompt_path: rubric.txt
    model: gpt-4o-mini
    extractor: last_assistant

Then create rubric.txt:

Rate the helpfulness and accuracy of the response.
- Score 1.0 if helpful and accurate
- Score 0.5 if partially helpful
- Score 0.0 if unhelpful or wrong

Testing Tool Calls

Verify that your agent calls specific tools with expected arguments:

graders:
  tool_check:
    kind: tool
    function: contains
    extractor: tool_arguments
    extractor_config:
      tool_name: search

Testing Memory Persistence

Check if the agent correctly updates its memory blocks:

graders:
  memory_check:
    kind: tool
    function: contains
    extractor: memory_block
    extractor_config:
      block_label: human

Troubleshooting

“Agent file not found”

Make sure your agent_file path is correct. Paths are relative to the suite YAML file location. Use absolute paths if needed:

target:
  agent_file: /absolute/path/to/my_agent.af

“Connection refused”

Your Letta server isn’t running or isn’t accessible. Start it using Docker:

docker run -p 8283:8283 -e OPENAI_API_KEY="your_api_key" letta/letta:latest

By default, it runs at http://localhost:8283. See the self-hosting guide for more information.

For more help, see the Troubleshooting Guide.