Datasets
Datasets are the test cases that define what your agent will be evaluated on. Each sample in your dataset represents one evaluation scenario.
Typical workflow:
- Create a JSONL or CSV file with test cases
- Reference it in your suite YAML:
dataset: test_cases.jsonl - Run evaluation - each sample is tested independently
- Results show per-sample and aggregate scores
Datasets can be created in two formats: JSONL or CSV. Choose based on your team’s workflow and complexity needs.
Dataset Formats
Section titled “Dataset Formats”JSONL Format
Section titled “JSONL Format”Each line is a JSON object representing one test case:
{"input": "What's the capital of France?", "ground_truth": "Paris"}{"input": "Calculate 2+2", "ground_truth": "4"}{"input": "What color is the sky?", "ground_truth": "blue"}Best for:
- Complex data structures (nested objects, arrays)
- Multi-turn conversations
- Advanced features (agent_args, rubric_vars)
- Teams comfortable with JSON/code
- Version control (clean line-by-line diffs)
CSV Format
Section titled “CSV Format”Standard CSV with headers:
input,ground_truth"What's the capital of France?","Paris""Calculate 2+2","4""What color is the sky?","blue"Best for:
- Simple question-answer pairs
- Teams that prefer spreadsheets (Excel, Google Sheets)
- Non-technical collaborators creating test cases
- Quick dataset creation and editing
- Easy sharing with non-developers
Quick Reference
Section titled “Quick Reference”| Field | Required | Type | Purpose |
|---|---|---|---|
input | ✅ | string or array | Prompt(s) to send to agent |
ground_truth | ❌ | string | Expected answer (for tool graders) |
tags | ❌ | array of strings | For filtering samples |
agent_args | ❌ | object | Per-sample agent customization |
rubric_vars | ❌ | object | Per-sample rubric variables |
metadata | ❌ | object | Arbitrary extra data |
id | ❌ | integer | Sample ID (auto-assigned if omitted) |
Field Reference
Section titled “Field Reference”Required Fields
Section titled “Required Fields”The prompt(s) to send to the agent. Can be a string or array of strings:
Single message:
{ "input": "Hello, who are you?" }Multi-turn conversation:
{ "input": ["Hello", "What's your name?", "Tell me about yourself"] }Optional Fields
Section titled “Optional Fields”ground_truth
Section titled “ground_truth”The expected answer or content to check against. Required for most tool graders (exact_match, contains, etc.):
{ "input": "What is 2+2?", "ground_truth": "4" }metadata
Section titled “metadata”Arbitrary additional data about the sample:
{ "input": "What is photosynthesis?", "ground_truth": "process where plants convert light into energy", "metadata": { "category": "biology", "difficulty": "medium" }}List of tags for filtering samples:
{ "input": "Solve x^2 = 16", "ground_truth": "4", "tags": ["math", "algebra"] }Filter by tags in your suite:
sample_tags: [math] # Only samples tagged "math" will be evaluatedagent_args
Section titled “agent_args”Custom arguments passed to programmatic agent creation when using agent_script. Allows per-sample agent customization.
JSONL:
{ "input": "What items do we have?", "agent_args": { "item": { "sku": "SKU-123", "name": "Widget A", "price": 19.99 } }}CSV:
input,agent_args"What items do we have?","{""item"": {""sku"": ""SKU-123"", ""name"": ""Widget A"", ""price"": 19.99}}"Your agent factory function can access these values via sample.agent_args to customize agent configuration.
See Targets - agent_script for details on programmatic agent creation.
rubric_vars
Section titled “rubric_vars”Variables to inject into rubric templates when using rubric graders. This allows you to provide per-sample context or examples to the LLM judge.
Example: Evaluating code quality against a reference implementation.
JSONL:
{ "input": "Write a function to calculate fibonacci numbers", "rubric_vars": { "reference_code": "def fib(n):\n if n <= 1: return n\n return fib(n-1) + fib(n-2)", "required_features": "recursion, base case" }}CSV:
input,rubric_vars"Write a function to calculate fibonacci numbers","{""reference_code"": ""def fib(n):\n if n <= 1: return n\n return fib(n-1) + fib(n-2)"", ""required_features"": ""recursion, base case""}"In your rubric template file, reference variables with {variable_name}:
rubric.txt:
Evaluate the submitted code against this reference implementation:
{reference_code}
Required features: {required_features}
Score on correctness (0.6) and code quality (0.4).When the rubric grader runs, variables are replaced with values from rubric_vars:
Final formatted prompt sent to LLM:
Evaluate the submitted code against this reference implementation:
def fib(n): if n <= 1: return n return fib(n-1) + fib(n-2)
Required features: recursion, base case
Score on correctness (0.6) and code quality (0.4).This lets you customize evaluation criteria per sample using the same rubric template.
See Rubric Graders for details on rubric templates.
Sample ID is automatically assigned (0-based index) if not provided. You can override:
{ "id": 42, "input": "Test case 42" }Complete Example
Section titled “Complete Example”{"id": 1, "input": "What is the capital of France?", "ground_truth": "Paris", "tags": ["geography", "easy"], "metadata": {"region": "Europe"}}{"id": 2, "input": "Calculate the square root of 144", "ground_truth": "12", "tags": ["math", "medium"]}{"id": 3, "input": ["Hello", "What can you help me with?"], "tags": ["conversation"]}Dataset Best Practices
Section titled “Dataset Best Practices”1. Clear Ground Truth
Section titled “1. Clear Ground Truth”Make ground truth specific enough to grade but flexible enough to match valid responses:
2. Diverse Test Cases
Section titled “2. Diverse Test Cases”Include edge cases and variations:
{"input": "What is 2+2?", "ground_truth": "4", "tags": ["math", "easy"]}{"input": "What is 0.1 + 0.2?", "ground_truth": "0.3", "tags": ["math", "floating_point"]}{"input": "What is 999999999 + 1?", "ground_truth": "1000000000", "tags": ["math", "large_numbers"]}3. Use Tags for Organization
Section titled “3. Use Tags for Organization”Organize samples by type, difficulty, or feature:
{"tags": ["tool_usage", "search"]}{"tags": ["memory", "recall"]}{"tags": ["reasoning", "multi_step"]}4. Multi-Turn Conversations
Section titled “4. Multi-Turn Conversations”Test conversational context and memory updates:
{"input": ["My name is Alice", "What's my name?"], "ground_truth": "Alice", "tags": ["memory", "recall"]}{"input": ["Please remember that I like bananas.", "Actually, sorry, I meant I like apples."], "ground_truth": "apples", "tags": ["memory", "correction"]}{"input": ["I work at Google", "Update my workplace to Microsoft", "Where do I work?"], "ground_truth": "Microsoft", "tags": ["memory", "multi_step"]}5. No Ground Truth for LLM Judges
Section titled “5. No Ground Truth for LLM Judges”If using rubric graders, ground truth is optional:
{"input": "Write a creative story about a robot", "tags": ["creative"]}{"input": "Explain quantum computing simply", "tags": ["explanation"]}The LLM judge evaluates based on the rubric, not ground truth.
Loading Datasets
Section titled “Loading Datasets”Datasets are automatically loaded by the runner:
dataset: path/to/dataset.jsonl # Path to your test cases (JSONL or CSV)Paths are relative to the suite YAML file location.
Dataset Filtering
Section titled “Dataset Filtering”Limit Sample Count
Section titled “Limit Sample Count”max_samples: 10 # Only evaluate first 10 samples (useful for testing)Filter by Tags
Section titled “Filter by Tags”sample_tags: [math, medium] # Only samples with ALL these tagsCreating Datasets Programmatically
Section titled “Creating Datasets Programmatically”You can generate datasets with Python:
import json
samples = []for i in range(100): samples.append({ "input": f"What is {i} + {i}?", "ground_truth": str(i + i), "tags": ["math", "addition"] })
with open("dataset.jsonl", "w") as f: for sample in samples: f.write(json.dumps(sample) + "\n")Dataset Format Validation
Section titled “Dataset Format Validation”The runner validates:
- Each line is valid JSON
- Required fields are present
- Field types are correct
Validation errors will be reported with line numbers.
Examples by Use Case
Section titled “Examples by Use Case”Question Answering
Section titled “Question Answering”JSONL:
{"input": "What is the capital of France?", "ground_truth": "Paris"}{"input": "Who wrote Romeo and Juliet?", "ground_truth": "Shakespeare"}CSV:
input,ground_truth"What is the capital of France?","Paris""Who wrote Romeo and Juliet?","Shakespeare"Tool Usage Testing
Section titled “Tool Usage Testing”JSONL:
{"input": "Search for information about pandas", "ground_truth": "search"}{"input": "Calculate 15 * 23", "ground_truth": "calculator"}CSV:
input,ground_truth"Search for information about pandas","search""Calculate 15 * 23","calculator"Ground truth = expected tool name.
Memory Testing (Multi-turn)
Section titled “Memory Testing (Multi-turn)”JSONL:
{"input": ["Remember that my favorite color is blue", "What's my favorite color?"], "ground_truth": "blue"}{"input": ["I live in Tokyo", "Where do I live?"], "ground_truth": "Tokyo"}CSV (using JSON array strings):
input,ground_truth"[""Remember that my favorite color is blue"", ""What's my favorite color?""]","blue""[""I live in Tokyo"", ""Where do I live?""]","Tokyo"Code Generation
Section titled “Code Generation”JSONL:
{"input": "Write a function to reverse a string in Python"}{"input": "Create a SQL query to find users older than 21"}CSV:
input"Write a function to reverse a string in Python""Create a SQL query to find users older than 21"Use rubric graders to evaluate code quality.
CSV Advanced Features
Section titled “CSV Advanced Features”CSV supports all the same features as JSONL by encoding complex data as JSON strings in cells:
Multi-turn conversations (requires escaped JSON array string):
input,ground_truth"[""Hello"", ""What's your name?""]","Alice"Agent arguments (requires escaped JSON object string):
input,agent_args"What items do we have?","{""initial_inventory"": [""apple"", ""banana""]}"Rubric variables (requires escaped JSON object string):
input,rubric_vars"Write a story","{""max_length"": 500, ""genre"": ""sci-fi""}"Next Steps
Section titled “Next Steps”- Suite YAML Reference - Complete configuration options including filtering
- Graders - How to evaluate agent responses
- Multi-Turn Conversations - Testing conversational flows