Skip to content
  • Auto
  • Light
  • Dark
DiscordForumGitHubSign up
Development Tools
Testing & evals
Core concepts
View as Markdown
Copy Markdown

Open in Claude
Open in ChatGPT

Core Concepts

Understanding how Letta Evals works and what makes it different.

Letta Evals is a testing framework specifically designed for agents that maintain state. Unlike traditional eval frameworks built for simple input-output models, Letta Evals understands that agents:

  • Maintain memory across conversations
  • Use tools and external functions
  • Evolve their behavior based on interactions
  • Have persistent context and state

This means you can test aspects of your agent that other frameworks can’t: memory updates, multi-turn conversations, tool usage patterns, and state evolution over time.

Every evaluation follows this flow:

Dataset → Target (Agent) → Extractor → Grader → Gate → Result

  1. Dataset: Your test cases (questions, scenarios, expected outputs)
  2. Target: The agent being evaluated
  3. Extractor: Pulls out the relevant information from the agent’s response
  4. Grader: Scores the extracted information
  5. Gate: Pass/fail criteria for the overall evaluation
  6. Result: Metrics, scores, and detailed results

With Letta Evals, you can test aspects of agents that traditional frameworks can’t:

  • Memory updates: Did the agent correctly remember the user’s name?
  • Multi-turn conversations: Can the agent maintain context across multiple exchanges?
  • Tool usage: Does the agent call the right tools with the right arguments?
  • State evolution: How does the agent’s internal state change over time?

AI agents are complex systems that can behave unpredictably. Without systematic evaluation, you can’t:

  • Know if changes improve or break your agent - Did that prompt tweak help or hurt?
  • Prevent regressions - Catch when “fixes” break existing functionality
  • Compare approaches objectively - Which model works better for your use case?
  • Build confidence before deployment - Ensure quality before shipping to users
  • Track improvement over time - Measure progress as you iterate

Manual testing doesn’t scale. Evals let you test hundreds of scenarios in minutes.

  • Test prompt changes instantly across your entire test suite
  • Experiment with different models and compare results
  • Validate that new features work as expected
  • Prevent regressions when modifying agent behavior
  • Ensure agents handle edge cases correctly
  • Verify tool usage and memory updates
  • Compare GPT-4 vs Claude vs other models on your specific use case
  • Test different model configurations (temperature, system prompts, etc.)
  • Find the right cost/performance tradeoff
  • Measure agent performance on standard tasks
  • Track improvements over time
  • Share reproducible results with your team
  • Validate agents meet quality thresholds before deployment
  • Run continuous evaluation in CI/CD pipelines
  • Monitor production agent quality

Letta Evals is built around a few key concepts that work together to create a flexible evaluation framework.

An evaluation suite is a complete test configuration defined in a YAML file. It ties together:

  • Which dataset to use
  • Which agent to test
  • How to grade responses
  • What criteria determine pass/fail

Think of a suite as a reusable test specification.

A dataset is a JSONL file where each line represents one test case. Each sample has:

  • An input (what to ask the agent)
  • Optional ground truth (the expected answer)
  • Optional metadata (tags, custom fields)

The target is what you’re evaluating. Currently, this is a Letta agent, specified by:

  • An agent file (.af)
  • An existing agent ID
  • A Python script that creates agents programmatically

A trajectory is the complete conversation history from one test case. It’s a list of turns, where each turn contains a list of Letta messages (assistant messages, tool calls, tool returns, etc.).

An extractor determines what part of the trajectory to evaluate. For example:

  • The last thing the agent said
  • All tool calls made
  • Content from agent memory
  • Text matching a pattern

A grader scores how well the agent performed. There are two types:

  • Tool graders: Python functions that compare submission to ground truth
  • Rubric graders: LLM judges that evaluate based on custom criteria

A gate is the pass/fail threshold for your evaluation. It compares aggregate metrics (like average score or pass rate) against a target value.

You can define multiple graders in one suite to evaluate different aspects:

graders:
accuracy: # Check if answer is correct
kind: tool
function: exact_match
extractor: last_assistant # Use final response
tool_usage: # Check if agent called the right tool
kind: tool
function: contains
extractor: tool_arguments # Extract tool call args
extractor_config:
tool_name: search # From search tool

The gate can check any of these metrics:

gate:
metric_key: accuracy # Gate on accuracy (tool_usage still computed)
op: gte # >=
value: 0.8 # 80% threshold

All scores are normalized to the range [0.0, 1.0]:

  • 0.0 = complete failure
  • 1.0 = perfect success
  • Values in between = partial credit

This allows different grader types to be compared and combined.

Individual sample scores are aggregated in two ways:

  1. Average Score: Mean of all scores (0.0 to 1.0)
  2. Accuracy/Pass Rate: Percentage of samples passing a threshold

You can gate on either metric type.

Dive deeper into each concept:

  • Suites - Suite configuration in detail
  • Datasets - Creating effective test datasets
  • Targets - Agent configuration options
  • Graders - Understanding grader types
  • Extractors - Extraction strategies
  • Gates - Setting pass/fail criteria