Getting Started
Run your first Letta agent evaluation in 5 minutes.
Prerequisites
Section titled “Prerequisites”- Python 3.11 or higher
- A running Letta server (local or Letta Cloud)
- A Letta agent to test, either in agent file format or by ID (see Targets for more details)
Installation
Section titled “Installation”pip install letta-evalsOr with uv:
uv pip install letta-evalsGetting an Agent to Test
Section titled “Getting an Agent to Test”Export an existing agent to a file using the Letta SDK:
from letta_client import Lettaimport os
# Connect to Letta Cloudclient = Letta(token=os.getenv("LETTA_API_KEY"))
# Export an agent to a fileagent_file = client.agents.export_file(agent_id="agent-123")
# Save to diskwith open("my_agent.af", "w") as f: f.write(agent_file)Or export via the Agent Development Environment (ADE) by selecting “Export Agent”.
Then reference it in your suite:
target: kind: agent agent_file: my_agent.afQuick Start
Section titled “Quick Start”Let’s create your first evaluation in 3 steps:
1. Create a Test Dataset
Section titled “1. Create a Test Dataset”Create a file named dataset.jsonl:
{"input": "What's the capital of France?", "ground_truth": "Paris"}{"input": "Calculate 2+2", "ground_truth": "4"}{"input": "What color is the sky?", "ground_truth": "blue"}Each line is a JSON object with:
input: The prompt to send to your agentground_truth: The expected answer (used for grading)
Read more about Datasets for details on how to create your dataset.
2. Create a Suite Configuration
Section titled “2. Create a Suite Configuration”Create a file named suite.yaml:
name: my-first-evaldataset: dataset.jsonl
target: kind: agent agent_file: my_agent.af # Path to your agent file base_url: https://api.letta.com # Letta Cloud (default) token: ${LETTA_API_KEY} # Your API key
graders: quality: kind: tool function: contains # Check if response contains the ground truth extractor: last_assistant # Use the last assistant message
gate: metric_key: quality op: gte value: 0.75 # Require 75% pass rateThe suite configuration defines:
Read more about Suites for details on how to configure your evaluation.
3. Run the Evaluation
Section titled “3. Run the Evaluation”Run your evaluation with the following command:
letta-evals run suite.yamlYou’ll see real-time progress as your evaluation runs:
Running evaluation: my-first-eval━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 100%✓ PASSED (2.25/3.00 avg, 75.0% pass rate)Read more about CLI Commands for details about the available commands and options.
Understanding the Results
Section titled “Understanding the Results”The core evaluation flow is:
Dataset → Target (Agent) → Extractor → Grader → Gate → Result
The evaluation runner:
- Loads your dataset
- Sends each input to your agent (Target)
- Extracts the relevant information (using the Extractor)
- Grades the response (using the Grader function)
- Computes aggregate metrics
- Checks if metrics pass the Gate criteria
The output shows:
- Average score: Mean score across all samples
- Pass rate: Percentage of samples that passed
- Gate status: Whether the evaluation passed or failed overall
Next Steps
Section titled “Next Steps”Now that you’ve run your first evaluation, explore more advanced features:
- Core Concepts - Understand suites, datasets, graders, and extractors
- Grader Types - Learn about tool graders vs rubric graders
- Multi-Metric Evaluation - Test multiple aspects simultaneously
- Custom Graders - Write custom grading functions
- Multi-Turn Conversations - Test conversational memory
Common Use Cases
Section titled “Common Use Cases”Strict Answer Checking
Section titled “Strict Answer Checking”Use exact matching for cases where the answer must be precisely correct:
graders: accuracy: kind: tool function: exact_match extractor: last_assistantSubjective Quality Evaluation
Section titled “Subjective Quality Evaluation”Use an LLM judge to evaluate subjective qualities like helpfulness or tone:
graders: quality: kind: rubric prompt_path: rubric.txt model: gpt-4o-mini extractor: last_assistantThen create rubric.txt:
Rate the helpfulness and accuracy of the response.- Score 1.0 if helpful and accurate- Score 0.5 if partially helpful- Score 0.0 if unhelpful or wrongTesting Tool Calls
Section titled “Testing Tool Calls”Verify that your agent calls specific tools with expected arguments:
graders: tool_check: kind: tool function: contains extractor: tool_arguments extractor_config: tool_name: searchTesting Memory Persistence
Section titled “Testing Memory Persistence”Check if the agent correctly updates its memory blocks:
graders: memory_check: kind: tool function: contains extractor: memory_block extractor_config: block_label: humanTroubleshooting
Section titled “Troubleshooting”For more help, see the Troubleshooting Guide.