Core Concepts

Development Tools

Testing & evals

Core concepts

Understanding how Letta Evals works and what makes it different.

Built for Stateful Agents

Letta Evals is a testing framework specifically designed for agents that maintain state. Unlike traditional eval frameworks built for simple input-output models, Letta Evals understands that agents:

Maintain memory across conversations
Use tools and external functions
Evolve their behavior based on interactions
Have persistent context and state

This means you can test aspects of your agent that other frameworks can’t: memory updates, multi-turn conversations, tool usage patterns, and state evolution over time.

The Evaluation Flow

Every evaluation follows this flow:

Dataset → Target (Agent) → Extractor → Grader → Gate → Result

Dataset: Your test cases (questions, scenarios, expected outputs)
Target: The agent being evaluated
Extractor: Pulls out the relevant information from the agent’s response
Grader: Scores the extracted information
Gate: Pass/fail criteria for the overall evaluation
Result: Metrics, scores, and detailed results

What You Can Test

With Letta Evals, you can test aspects of agents that traditional frameworks can’t:

Memory updates: Did the agent correctly remember the user’s name?
Multi-turn conversations: Can the agent maintain context across multiple exchanges?
Tool usage: Does the agent call the right tools with the right arguments?
State evolution: How does the agent’s internal state change over time?

Example: Testing Memory Updates

graders:
  memory_check:
    kind: tool # Deterministic grading
    function: contains # Check if ground_truth in extracted content
    extractor: memory_block # Extract from agent memory (not just response!)
    extractor_config:
      block_label: human # Which memory block to check

Dataset:

{
  "input": "Please remember that I like bananas.",
  "ground_truth": "bananas"
}

This doesn’t just check if the agent responded correctly - it verifies the agent actually stored “bananas” in its memory block. Traditional eval frameworks can’t inspect agent state like this.

Why Evals Matter

AI agents are complex systems that can behave unpredictably. Without systematic evaluation, you can’t:

Know if changes improve or break your agent - Did that prompt tweak help or hurt?
Prevent regressions - Catch when “fixes” break existing functionality
Compare approaches objectively - Which model works better for your use case?
Build confidence before deployment - Ensure quality before shipping to users
Track improvement over time - Measure progress as you iterate

Manual testing doesn’t scale. Evals let you test hundreds of scenarios in minutes.

What Evals Are Useful For

1. Development & Iteration

Test prompt changes instantly across your entire test suite
Experiment with different models and compare results
Validate that new features work as expected

2. Quality Assurance

Prevent regressions when modifying agent behavior
Ensure agents handle edge cases correctly
Verify tool usage and memory updates

3. Model Selection

Compare GPT-4 vs Claude vs other models on your specific use case
Test different model configurations (temperature, system prompts, etc.)
Find the right cost/performance tradeoff

4. Benchmarking

Measure agent performance on standard tasks
Track improvements over time
Share reproducible results with your team

5. Production Readiness

Validate agents meet quality thresholds before deployment
Run continuous evaluation in CI/CD pipelines
Monitor production agent quality

How Letta Evals Works

Letta Evals is built around a few key concepts that work together to create a flexible evaluation framework.

Key Components

Suite

An evaluation suite is a complete test configuration defined in a YAML file. It ties together:

Which dataset to use
Which agent to test
How to grade responses
What criteria determine pass/fail

Think of a suite as a reusable test specification.

Dataset

A dataset is a JSONL file where each line represents one test case. Each sample has:

An input (what to ask the agent)
Optional ground truth (the expected answer)
Optional metadata (tags, custom fields)

Target

The target is what you’re evaluating. Currently, this is a Letta agent, specified by:

An agent file (.af)
An existing agent ID
A Python script that creates agents programmatically

Trajectory

A trajectory is the complete conversation history from one test case. It’s a list of turns, where each turn contains a list of Letta messages (assistant messages, tool calls, tool returns, etc.).

Extractor

An extractor determines what part of the trajectory to evaluate. For example:

The last thing the agent said
All tool calls made
Content from agent memory
Text matching a pattern

Grader

A grader scores how well the agent performed. There are two types:

Tool graders: Python functions that compare submission to ground truth
Rubric graders: LLM judges that evaluate based on custom criteria

Gate

A gate is the pass/fail threshold for your evaluation. It compares aggregate metrics (like average score or pass rate) against a target value.

Multi-Metric Evaluation

You can define multiple graders in one suite to evaluate different aspects:

graders:
  accuracy: # Check if answer is correct
    kind: tool
    function: exact_match
    extractor: last_assistant # Use final response

  tool_usage: # Check if agent called the right tool
    kind: tool
    function: contains
    extractor: tool_arguments # Extract tool call args
    extractor_config:
      tool_name: search # From search tool

The gate can check any of these metrics:

gate:
  metric_key: accuracy # Gate on accuracy (tool_usage still computed)
  op: gte # >=
  value: 0.8 # 80% threshold

Score Normalization

All scores are normalized to the range [0.0, 1.0]:

0.0 = complete failure
1.0 = perfect success
Values in between = partial credit

This allows different grader types to be compared and combined.

Aggregate Metrics

Individual sample scores are aggregated in two ways:

Average Score: Mean of all scores (0.0 to 1.0)
Accuracy/Pass Rate: Percentage of samples passing a threshold

You can gate on either metric type.

Next Steps

Dive deeper into each concept:

Suites - Suite configuration in detail
Datasets - Creating effective test datasets
Targets - Agent configuration options
Graders - Understanding grader types
Extractors - Extraction strategies
Gates - Setting pass/fail criteria