Skip to content
  • Auto
  • Light
  • Dark
DiscordForumGitHubSign up
Development Tools
Testing & evals
Core concepts
View as Markdown
Copy Markdown

Open in Claude
Open in ChatGPT

Targets

A target is the agent you’re evaluating. In Letta Evals, the target configuration determines how agents are created, accessed, and tested.

When to use each approach:

  • agent_file - Pre-configured agents saved as .af files (most common)
  • agent_id - Testing existing agents or multi-turn conversations with state
  • agent_script - Dynamic agent creation with per-sample customization

The target configuration specifies how to create or access the agent for evaluation.

All targets have a kind field (currently only agent is supported):

target:
kind: agent # Currently only "agent" is supported
# ... agent-specific configuration

You must specify exactly ONE of these:

Path to a .af (Agent File) to upload:

target:
kind: agent
agent_file: path/to/agent.af # Path to .af file
base_url: https://api.letta.com # Letta server URL

The agent file will be uploaded to the Letta server and a new agent created for the evaluation.

ID of an existing agent on the server:

target:
kind: agent
agent_id: agent-123-abc # ID of existing agent
base_url: https://api.letta.com # Letta server URL

Path to a Python script with an agent factory function for programmatic agent creation:

target:
kind: agent
agent_script: create_agent.py:create_inventory_agent # script.py:function_name
base_url: https://api.letta.com # Letta server URL

Format: path/to/script.py:function_name

The function must be decorated with @agent_factory and have the signature async (client: AsyncLetta, sample: Sample) -> str:

from letta_client import AsyncLetta, CreateBlock
from letta_evals.decorators import agent_factory
from letta_evals.models import Sample
@agent_factory
async def create_inventory_agent(client: AsyncLetta, sample: Sample) -> str:
"""Create and return agent ID for this sample."""
# Access custom arguments from the dataset
item = sample.agent_args.get("item", {})
# Create agent with sample-specific configuration
agent = await client.agents.create(
name="inventory-assistant",
memory_blocks=[
CreateBlock(
label="item_context",
value=f"Item: {item.get('name', 'Unknown')}"
)
],
agent_type="letta_v1_agent",
model="openai/gpt-4.1-mini",
embedding="openai/text-embedding-3-small",
)
return agent.id

Key features:

  • Creates a fresh agent for each sample
  • Can customize agents using sample.agent_args from the dataset
  • Allows testing agent creation logic itself
  • Useful when you don’t have pre-saved agent files

When to use:

  • Testing agent creation workflows
  • Dynamic per-sample agent configuration
  • Agents that need sample-specific memory or tools
  • Programmatic agent testing

Letta server URL:

target:
base_url: https://api.letta.com # Local Letta server
# or
base_url: https://api.letta.com # Letta Cloud

Default: https://api.letta.com

API key for authentication (required for Letta Cloud):

target:
api_key: your-api-key-here # Required for Letta Cloud

Or set via environment variable:

Terminal window
export LETTA_API_KEY=your-api-key-here

Letta project ID (for Letta Cloud):

target:
project_id: proj_abc123 # Letta Cloud project

Or set via environment variable:

Terminal window
export LETTA_PROJECT_ID=proj_abc123

Request timeout in seconds:

target:
timeout: 300.0 # Request timeout (5 minutes)

Default: 300 seconds

Test the same agent across different models:

List of model configuration names from JSON files:

target:
kind: agent
agent_file: agent.af
model_configs: [gpt-4o-mini, claude-3-5-sonnet] # Test with both models

The evaluation will run once for each model config. Model configs are JSON files in letta_evals/llm_model_configs/.

List of model handles (cloud-compatible identifiers):

target:
kind: agent
agent_file: agent.af
model_handles: ["openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"] # Cloud model identifiers

Use this for Letta Cloud deployments.

target:
kind: agent
agent_file: ./agents/my_agent.af # Pre-configured agent
base_url: https://api.letta.com # Local server
target:
kind: agent
agent_id: agent-cloud-123 # Existing cloud agent
base_url: https://api.letta.com # Letta Cloud
api_key: ${LETTA_API_KEY} # From environment variable
project_id: proj_abc # Your project ID
target:
kind: agent
agent_file: agent.af # Same agent configuration
base_url: https://api.letta.com # Local server
model_configs: [gpt-4o-mini, gpt-4o, claude-3-5-sonnet] # Test 3 models

Results will include per-model metrics:

Model: gpt-4o-mini - Avg: 0.85, Pass: 85.0%
Model: gpt-4o - Avg: 0.92, Pass: 92.0%
Model: claude-3-5-sonnet - Avg: 0.88, Pass: 88.0%
target:
kind: agent
agent_script: setup.py:CustomAgentFactory # Programmatic creation
base_url: https://api.letta.com # Local server

Configuration values are resolved in this order (highest priority first):

  1. CLI arguments (--api-key, --base-url, --project-id)
  2. Suite YAML configuration
  3. Environment variables (LETTA_API_KEY, LETTA_BASE_URL, LETTA_PROJECT_ID)

The way your agent is specified fundamentally changes how the evaluation runs:

With agent_file or agent_script: Independent Testing

Section titled “With agent_file or agent_script: Independent Testing”

Agent lifecycle:

  1. A fresh agent instance is created for each sample
  2. Agent processes the sample input(s)
  3. Agent remains on the server after the sample completes

Testing behavior: Each sample is an independent, isolated test. Agent state (memory, message history) does not carry over between samples. This enables parallel execution and ensures reproducible results.

Use cases:

  • Testing how the agent responds to various independent inputs
  • Ensuring consistent behavior across different scenarios
  • Regression testing where each case should be isolated
  • Evaluating agent responses without prior context

Agent lifecycle:

  1. The same agent instance is used for all samples
  2. Agent processes each sample in sequence
  3. Agent state persists throughout the entire evaluation

Testing behavior: The dataset becomes a conversation script where each sample builds on previous ones. Agent memory and message history accumulate, and earlier interactions affect later responses. Samples must execute sequentially.

Use cases:

  • Testing multi-turn conversations with context
  • Evaluating how agent memory evolves over time
  • Simulating a single user session with multiple interactions
  • Testing scenarios where context should accumulate
Aspectagent_file / agent_scriptagent_id
Agent instancesNew agent per sampleSame agent for all samples
State isolationFully isolatedState carries over
ExecutionCan run in parallelMust run sequentially
MemoryFresh for each sampleAccumulates across samples
Use caseIndependent test casesConversation scripts
ReproducibilityHighly reproducibleDepends on execution order

The runner validates:

  • Exactly one of agent_file, agent_id, or agent_script is specified
  • Agent files have .af extension
  • Agent script paths are valid