Skip to content
  • Auto
  • Light
  • Dark
DiscordForumGitHubSign up
Development Tools
Testing & evals
Core concepts
View as Markdown
Copy Markdown

Open in Claude
Open in ChatGPT

Graders

Graders are the scoring functions that evaluate agent responses. They take the extracted submission (from an extractor) and assign a score between 0.0 (complete failure) and 1.0 (perfect success).

When to use each:

  • Tool graders: Fast, deterministic, free - perfect for exact matching, patterns, tool validation
  • Rubric graders: Flexible, subjective, costs API calls - ideal for quality, creativity, nuanced evaluation

Graders evaluate agent responses and assign scores between 0.0 (complete failure) and 1.0 (perfect success).

There are two types of graders:

Python functions that programmatically compare the submission to ground truth or apply deterministic checks.

graders:
accuracy:
kind: tool # Deterministic grading
function: exact_match # Built-in grading function
extractor: last_assistant # Use final agent response

Best for:

  • Exact matching
  • Pattern checking
  • Tool call validation
  • Deterministic criteria

LLM-as-judge evaluation using custom prompts and criteria. Can use either direct LLM API calls or a Letta agent as the judge.

Standard rubric grading (LLM API):

graders:
quality:
kind: rubric # LLM-as-judge
prompt_path: rubric.txt # Custom evaluation criteria
model: gpt-4o-mini # Judge model
extractor: last_assistant # What to evaluate

Agent-as-judge (Letta agent):

graders:
agent_judge:
kind: rubric # Still "rubric" kind
agent_file: judge.af # Judge agent with submit_grade tool
prompt_path: rubric.txt # Evaluation criteria
extractor: last_assistant # What to evaluate

Best for:

  • Subjective quality assessment
  • Open-ended responses
  • Nuanced evaluation
  • Complex criteria
  • Judges that need tools (when using agent-as-judge)

Checks if submission exactly matches ground truth (case-sensitive, whitespace-trimmed).

graders:
accuracy:
kind: tool
function: exact_match # Case-sensitive, whitespace-trimmed
extractor: last_assistant # Extract final response

Requires: ground_truth in dataset

Score: 1.0 if exact match, 0.0 otherwise

Checks if submission contains ground truth (case-insensitive).

graders:
contains_answer:
kind: tool
function: contains # Case-insensitive substring match
extractor: last_assistant # Search in final response

Requires: ground_truth in dataset

Score: 1.0 if found, 0.0 otherwise

Checks if submission matches a regex pattern in ground truth.

graders:
pattern:
kind: tool
function: regex_match # Pattern matching
extractor: last_assistant # Check final response

Dataset sample:

{
"input": "Generate a UUID",
"ground_truth": "[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
}

Score: 1.0 if pattern matches, 0.0 otherwise

Validates that all characters are printable ASCII (useful for ASCII art, formatted output).

graders:
ascii_check:
kind: tool
function: ascii_printable_only # Validate ASCII characters
extractor: last_assistant # Check final response

Does not require ground truth.

Score: 1.0 if all characters are printable ASCII, 0.0 if any non-printable characters found

Rubric graders use an LLM to evaluate responses based on custom criteria.

graders:
quality:
kind: rubric # LLM-as-judge
prompt_path: quality_rubric.txt # Evaluation criteria
model: gpt-4o-mini # Judge model
temperature: 0.0 # Deterministic
extractor: last_assistant # What to evaluate

Your rubric file should describe the evaluation criteria. Use placeholders:

  • {input}: The original input from the dataset
  • {submission}: The extracted agent response
  • {ground_truth}: Ground truth from dataset (if available)

Example quality_rubric.txt:

Evaluate the response for:
1. Accuracy: Does it correctly answer the question?
2. Completeness: Is the answer thorough?
3. Clarity: Is it well-explained?
Input: {input}
Expected: {ground_truth}
Response: {submission}
Score from 0.0 to 1.0 where:
- 1.0: Perfect response
- 0.75: Good with minor issues
- 0.5: Acceptable but incomplete
- 0.25: Poor quality
- 0.0: Completely wrong

Instead of a file, you can include the prompt inline:

graders:
quality:
kind: rubric # LLM-as-judge
prompt: | # Inline prompt instead of file
Evaluate the creativity and originality of the response.
Score 1.0 for highly creative, 0.0 for generic or unoriginal.
model: gpt-4o-mini # Judge model
extractor: last_assistant # What to evaluate
graders:
quality:
kind: rubric
prompt_path: rubric.txt # Evaluation criteria
model: gpt-4o-mini # Judge model
temperature: 0.0 # Deterministic (0.0-2.0)
provider: openai # LLM provider (default: openai)
max_retries: 5 # API retry attempts
timeout: 120.0 # Request timeout in seconds

Supported providers:

  • openai (default)

Models:

  • Any OpenAI-compatible model
  • Special handling for reasoning models (o1, o3) - temperature automatically adjusted to 1.0

Rubric graders use JSON mode to get structured responses:

{
"score": 0.85,
"rationale": "The response is accurate and complete but could be more concise."
}

The score is validated to be between 0.0 and 1.0.

Evaluate multiple aspects in one suite:

graders:
accuracy: # Tool grader for factual correctness
kind: tool
function: contains
extractor: last_assistant
completeness: # Rubric grader for thoroughness
kind: rubric
prompt_path: completeness_rubric.txt
model: gpt-4o-mini
extractor: last_assistant
tool_usage: # Tool grader for tool call validation
kind: tool
function: exact_match
extractor: tool_arguments # Extract tool call args
extractor_config:
tool_name: search # Which tool to check

Each grader can use a different extractor.

Every grader must specify an extractor to select what to grade:

graders:
my_metric:
kind: tool
function: contains # Grading function
extractor: last_assistant # What to extract and grade

Some extractors need additional configuration:

graders:
tool_check:
kind: tool
function: contains # Check if ground truth in tool args
extractor: tool_arguments # Extract tool call arguments
extractor_config: # Configuration for this extractor
tool_name: search # Which tool to extract from

See Extractors for all available extractors.

You can write custom grading functions. See Custom Graders for details.

Use CaseRecommended Grader
Exact answer matchingexact_match
Keyword checkingcontains
Pattern validationregex_match
Tool call validationexact_match with tool_arguments extractor
Quality assessmentRubric grader
Creativity evaluationRubric grader
Format checkingCustom tool grader
Multi-criteria evaluationMultiple graders

All scores are between 0.0 and 1.0:

  • 1.0: Perfect - meets all criteria
  • 0.75-0.99: Good - minor issues
  • 0.5-0.74: Acceptable - notable gaps
  • 0.25-0.49: Poor - major problems
  • 0.0-0.24: Failed - did not meet criteria

Tool graders typically return binary scores (0.0 or 1.0), while rubric graders can return any value in the range.

If grading fails (e.g., network error, invalid format):

  • Score is set to 0.0
  • Rationale includes error message
  • Metadata includes error details

This ensures evaluations can continue even with individual failures.