Multi-Metric Evaluation
Evaluate multiple aspects of agent performance simultaneously in a single evaluation suite.
Why Multiple Metrics?
Section titled “Why Multiple Metrics?”Agents are complex systems. You might want to evaluate:
- Correctness: Does the answer match the expected output?
- Quality: Is the explanation clear and complete?
- Tool usage: Does the agent call the right tools with correct arguments?
- Memory: Does the agent correctly update its memory blocks?
- Format: Does the output follow required formatting rules?
Configuration
Section titled “Configuration”graders: accuracy: # Check if answer is correct kind: tool function: exact_match extractor: last_assistant
completeness: # LLM judges response quality kind: rubric prompt_path: rubrics/completeness.txt model: gpt-4o-mini extractor: last_assistant
tool_usage: # Verify correct tool was called kind: tool function: contains extractor: tool_arguments extractor_config: tool_name: searchGating on One Metric
Section titled “Gating on One Metric”The gate can check any of these metrics:
gate: metric_key: accuracy # Gate on accuracy (others still computed) op: gte value: 0.9Results will include scores for all graders, even if you only gate on one.
Next Steps
Section titled “Next Steps”- Tool Graders - Deterministic evaluation
- Rubric Graders - LLM-as-judge evaluation
- Gates - Setting pass/fail criteria