Tool Graders
Tool graders use Python functions to programmatically evaluate submissions. They’re ideal for deterministic, rule-based evaluation.
Overview
Section titled “Overview”Tool graders:
- Execute Python functions that take
(sample, submission)and return aGradeResult - Are fast and deterministic
- Don’t require external API calls
- Can implement any custom logic
Configuration
Section titled “Configuration”graders: my_metric: kind: tool function: exact_match # Function name extractor: last_assistant # What to extract from trajectoryBuilt-in Functions
Section titled “Built-in Functions”exact_match
Section titled “exact_match”Checks if submission exactly matches ground truth (case-sensitive, whitespace-trimmed).
graders: accuracy: kind: tool function: exact_match extractor: last_assistantRequires: ground_truth in dataset | Score: 1.0 if exact match, 0.0 otherwise
contains
Section titled “contains”Checks if submission contains ground truth (case-insensitive).
graders: contains_answer: kind: tool function: contains extractor: last_assistantRequires: ground_truth in dataset | Score: 1.0 if found, 0.0 otherwise
regex_match
Section titled “regex_match”Checks if submission matches a regex pattern in ground truth.
graders: pattern: kind: tool function: regex_match extractor: last_assistantScore: 1.0 if pattern matches, 0.0 otherwise
ascii_printable_only
Section titled “ascii_printable_only”Validates that all characters are printable ASCII.
graders: ascii_check: kind: tool function: ascii_printable_only extractor: last_assistantScore: 1.0 if all characters are printable ASCII, 0.0 otherwise
Next Steps
Section titled “Next Steps”- Rubric Graders - LLM-as-judge evaluation
- Custom Graders - Write your own grading functions
- Multi-Metric - Combine multiple graders