Rubric Graders
Rubric graders use language models to evaluate submissions based on custom criteria. They’re ideal for subjective, nuanced evaluation.
Basic Configuration
Section titled “Basic Configuration”graders: quality: kind: rubric prompt_path: quality_rubric.txt # Evaluation criteria model: gpt-4o-mini # Judge model temperature: 0.0 # Deterministic extractor: last_assistant # What to evaluateRubric Prompt Format
Section titled “Rubric Prompt Format”Your rubric file should describe the evaluation criteria. Use placeholders:
{input}: The original input from the dataset{submission}: The extracted agent response{ground_truth}: Ground truth from dataset (if available)
Example quality_rubric.txt:
Evaluate the response for:1. Accuracy: Does it correctly answer the question?2. Completeness: Is the answer thorough?3. Clarity: Is it well-explained?
Input: {input}Expected: {ground_truth}Response: {submission}
Score from 0.0 to 1.0 where:
- 1.0: Perfect response- 0.75: Good with minor issues- 0.5: Acceptable but incomplete- 0.25: Poor quality- 0.0: Completely wrongModel Configuration
Section titled “Model Configuration”graders: quality: kind: rubric prompt_path: rubric.txt model: gpt-4o-mini # Judge model temperature: 0.0 # Deterministic provider: openai # LLM provider max_retries: 5 # API retry attempts timeout: 120.0 # Request timeoutAgent-as-Judge
Section titled “Agent-as-Judge”Use a Letta agent as the judge instead of a direct LLM API call:
graders: agent_judge: kind: rubric agent_file: judge.af # Judge agent with submit_grade tool prompt_path: rubric.txt # Evaluation criteria extractor: last_assistantRequirements: The judge agent must have a tool with signature submit_grade(score: float, rationale: str).
Next Steps
Section titled “Next Steps”- Tool Graders - Deterministic grading functions
- Multi-Metric - Combine multiple graders
- Custom Graders - Write your own grading logic