Datasets

Development Tools

Testing & evals

Core concepts

Datasets are the test cases that define what your agent will be evaluated on. Each sample in your dataset represents one evaluation scenario.

Typical workflow:

Create a JSONL or CSV file with test cases
Reference it in your suite YAML: dataset: test_cases.jsonl
Run evaluation - each sample is tested independently
Results show per-sample and aggregate scores

Datasets can be created in two formats: JSONL or CSV. Choose based on your team’s workflow and complexity needs.

Dataset Formats

JSONL Format

Each line is a JSON object representing one test case:

{"input": "What's the capital of France?", "ground_truth": "Paris"}
{"input": "Calculate 2+2", "ground_truth": "4"}
{"input": "What color is the sky?", "ground_truth": "blue"}

Best for:

Complex data structures (nested objects, arrays)
Multi-turn conversations
Advanced features (agent_args, rubric_vars)
Teams comfortable with JSON/code
Version control (clean line-by-line diffs)

CSV Format

Standard CSV with headers:

input,ground_truth
"What's the capital of France?","Paris"
"Calculate 2+2","4"
"What color is the sky?","blue"

Best for:

Simple question-answer pairs
Teams that prefer spreadsheets (Excel, Google Sheets)
Non-technical collaborators creating test cases
Quick dataset creation and editing
Easy sharing with non-developers

Quick Reference

Field	Required	Type	Purpose
`input`	✅	string or array	Prompt(s) to send to agent
`ground_truth`	❌	string	Expected answer (for tool graders)
`tags`	❌	array of strings	For filtering samples
`agent_args`	❌	object	Per-sample agent customization
`rubric_vars`	❌	object	Per-sample rubric variables
`metadata`	❌	object	Arbitrary extra data
`id`	❌	integer	Sample ID (auto-assigned if omitted)

Field Reference

Required Fields

input

The prompt(s) to send to the agent. Can be a string or array of strings:

Single message:

{ "input": "Hello, who are you?" }

Multi-turn conversation:

{ "input": ["Hello", "What's your name?", "Tell me about yourself"] }

Optional Fields

ground_truth

The expected answer or content to check against. Required for most tool graders (exact_match, contains, etc.):

{ "input": "What is 2+2?", "ground_truth": "4" }

metadata

Arbitrary additional data about the sample:

{
  "input": "What is photosynthesis?",
  "ground_truth": "process where plants convert light into energy",
  "metadata": {
    "category": "biology",
    "difficulty": "medium"
  }
}

agent_args

Custom arguments passed to programmatic agent creation when using agent_script. Allows per-sample agent customization.

JSONL:

{
  "input": "What items do we have?",
  "agent_args": {
    "item": { "sku": "SKU-123", "name": "Widget A", "price": 19.99 }
  }
}

CSV:

input,agent_args
"What items do we have?","{""item"": {""sku"": ""SKU-123"", ""name"": ""Widget A"", ""price"": 19.99}}"

Your agent factory function can access these values via sample.agent_args to customize agent configuration.

See Targets - agent_script for details on programmatic agent creation.

rubric_vars

Variables to inject into rubric templates when using rubric graders. This allows you to provide per-sample context or examples to the LLM judge.

Example: Evaluating code quality against a reference implementation.

JSONL:

{
  "input": "Write a function to calculate fibonacci numbers",
  "rubric_vars": {
    "reference_code": "def fib(n):\n    if n <= 1: return n\n    return fib(n-1) + fib(n-2)",
    "required_features": "recursion, base case"
  }
}

CSV:

input,rubric_vars
"Write a function to calculate fibonacci numbers","{""reference_code"": ""def fib(n):\n    if n <= 1: return n\n    return fib(n-1) + fib(n-2)"", ""required_features"": ""recursion, base case""}"

In your rubric template file, reference variables with {variable_name}:

rubric.txt:

Evaluate the submitted code against this reference implementation:

{reference_code}

Required features: {required_features}

Score on correctness (0.6) and code quality (0.4).

When the rubric grader runs, variables are replaced with values from rubric_vars:

Final formatted prompt sent to LLM:

Evaluate the submitted code against this reference implementation:

def fib(n):
    if n <= 1: return n
    return fib(n-1) + fib(n-2)

Required features: recursion, base case

Score on correctness (0.6) and code quality (0.4).

This lets you customize evaluation criteria per sample using the same rubric template.

See Rubric Graders for details on rubric templates.

id

Sample ID is automatically assigned (0-based index) if not provided. You can override:

{ "id": 42, "input": "Test case 42" }

Complete Example

{"id": 1, "input": "What is the capital of France?", "ground_truth": "Paris", "tags": ["geography", "easy"], "metadata": {"region": "Europe"}}
{"id": 2, "input": "Calculate the square root of 144", "ground_truth": "12", "tags": ["math", "medium"]}
{"id": 3, "input": ["Hello", "What can you help me with?"], "tags": ["conversation"]}

Dataset Best Practices

1. Clear Ground Truth

Make ground truth specific enough to grade but flexible enough to match valid responses:

2. Diverse Test Cases

Include edge cases and variations:

{"input": "What is 2+2?", "ground_truth": "4", "tags": ["math", "easy"]}
{"input": "What is 0.1 + 0.2?", "ground_truth": "0.3", "tags": ["math", "floating_point"]}
{"input": "What is 999999999 + 1?", "ground_truth": "1000000000", "tags": ["math", "large_numbers"]}

3. Use Tags for Organization

Organize samples by type, difficulty, or feature:

{"tags": ["tool_usage", "search"]}
{"tags": ["memory", "recall"]}
{"tags": ["reasoning", "multi_step"]}

4. Multi-Turn Conversations

Test conversational context and memory updates:

{"input": ["My name is Alice", "What's my name?"], "ground_truth": "Alice", "tags": ["memory", "recall"]}
{"input": ["Please remember that I like bananas.", "Actually, sorry, I meant I like apples."], "ground_truth": "apples", "tags": ["memory", "correction"]}
{"input": ["I work at Google", "Update my workplace to Microsoft", "Where do I work?"], "ground_truth": "Microsoft", "tags": ["memory", "multi_step"]}

5. No Ground Truth for LLM Judges

If using rubric graders, ground truth is optional:

{"input": "Write a creative story about a robot", "tags": ["creative"]}
{"input": "Explain quantum computing simply", "tags": ["explanation"]}

The LLM judge evaluates based on the rubric, not ground truth.

Loading Datasets

Datasets are automatically loaded by the runner:

dataset: path/to/dataset.jsonl # Path to your test cases (JSONL or CSV)

Paths are relative to the suite YAML file location.

Dataset Filtering

Limit Sample Count

max_samples: 10 # Only evaluate first 10 samples (useful for testing)

Filter by Tags

sample_tags: [math, medium] # Only samples with ALL these tags

Creating Datasets Programmatically

You can generate datasets with Python:

import json

samples = []
for i in range(100):
    samples.append({
        "input": f"What is {i} + {i}?",
        "ground_truth": str(i + i),
        "tags": ["math", "addition"]
    })

with open("dataset.jsonl", "w") as f:
    for sample in samples:
        f.write(json.dumps(sample) + "\n")

Dataset Format Validation

The runner validates:

Each line is valid JSON
Required fields are present
Field types are correct

Validation errors will be reported with line numbers.

Examples by Use Case

Question Answering

JSONL:

{"input": "What is the capital of France?", "ground_truth": "Paris"}
{"input": "Who wrote Romeo and Juliet?", "ground_truth": "Shakespeare"}

CSV:

input,ground_truth
"What is the capital of France?","Paris"
"Who wrote Romeo and Juliet?","Shakespeare"

Tool Usage Testing

JSONL:

{"input": "Search for information about pandas", "ground_truth": "search"}
{"input": "Calculate 15 * 23", "ground_truth": "calculator"}

CSV:

input,ground_truth
"Search for information about pandas","search"
"Calculate 15 * 23","calculator"

Ground truth = expected tool name.

Memory Testing (Multi-turn)

JSONL:

{"input": ["Remember that my favorite color is blue", "What's my favorite color?"], "ground_truth": "blue"}
{"input": ["I live in Tokyo", "Where do I live?"], "ground_truth": "Tokyo"}

CSV (using JSON array strings):

input,ground_truth
"[""Remember that my favorite color is blue"", ""What's my favorite color?""]","blue"
"[""I live in Tokyo"", ""Where do I live?""]","Tokyo"

Code Generation

JSONL:

{"input": "Write a function to reverse a string in Python"}
{"input": "Create a SQL query to find users older than 21"}

CSV:

input
"Write a function to reverse a string in Python"
"Create a SQL query to find users older than 21"

Use rubric graders to evaluate code quality.

CSV Advanced Features

CSV supports all the same features as JSONL by encoding complex data as JSON strings in cells:

Multi-turn conversations (requires escaped JSON array string):

input,ground_truth
"[""Hello"", ""What's your name?""]","Alice"

Agent arguments (requires escaped JSON object string):

input,agent_args
"What items do we have?","{""initial_inventory"": [""apple"", ""banana""]}"

Rubric variables (requires escaped JSON object string):

input,rubric_vars
"Write a story","{""max_length"": 500, ""genre"": ""sci-fi""}"

Next Steps

Suite YAML Reference - Complete configuration options including filtering
Graders - How to evaluate agent responses
Multi-Turn Conversations - Testing conversational flows