Skip to content
  • Auto
  • Light
  • Dark
DiscordForumGitHubSign up
Development Tools
Testing & evals
Core concepts
View as Markdown
Copy Markdown

Open in Claude
Open in ChatGPT

Datasets

Datasets are the test cases that define what your agent will be evaluated on. Each sample in your dataset represents one evaluation scenario.

Typical workflow:

  1. Create a JSONL or CSV file with test cases
  2. Reference it in your suite YAML: dataset: test_cases.jsonl
  3. Run evaluation - each sample is tested independently
  4. Results show per-sample and aggregate scores

Datasets can be created in two formats: JSONL or CSV. Choose based on your team’s workflow and complexity needs.

Each line is a JSON object representing one test case:

{"input": "What's the capital of France?", "ground_truth": "Paris"}
{"input": "Calculate 2+2", "ground_truth": "4"}
{"input": "What color is the sky?", "ground_truth": "blue"}

Best for:

  • Complex data structures (nested objects, arrays)
  • Multi-turn conversations
  • Advanced features (agent_args, rubric_vars)
  • Teams comfortable with JSON/code
  • Version control (clean line-by-line diffs)

Standard CSV with headers:

input,ground_truth
"What's the capital of France?","Paris"
"Calculate 2+2","4"
"What color is the sky?","blue"

Best for:

  • Simple question-answer pairs
  • Teams that prefer spreadsheets (Excel, Google Sheets)
  • Non-technical collaborators creating test cases
  • Quick dataset creation and editing
  • Easy sharing with non-developers
FieldRequiredTypePurpose
inputstring or arrayPrompt(s) to send to agent
ground_truthstringExpected answer (for tool graders)
tagsarray of stringsFor filtering samples
agent_argsobjectPer-sample agent customization
rubric_varsobjectPer-sample rubric variables
metadataobjectArbitrary extra data
idintegerSample ID (auto-assigned if omitted)

The prompt(s) to send to the agent. Can be a string or array of strings:

Single message:

{ "input": "Hello, who are you?" }

Multi-turn conversation:

{ "input": ["Hello", "What's your name?", "Tell me about yourself"] }

The expected answer or content to check against. Required for most tool graders (exact_match, contains, etc.):

{ "input": "What is 2+2?", "ground_truth": "4" }

Arbitrary additional data about the sample:

{
"input": "What is photosynthesis?",
"ground_truth": "process where plants convert light into energy",
"metadata": {
"category": "biology",
"difficulty": "medium"
}
}

List of tags for filtering samples:

{ "input": "Solve x^2 = 16", "ground_truth": "4", "tags": ["math", "algebra"] }

Filter by tags in your suite:

sample_tags: [math] # Only samples tagged "math" will be evaluated

Custom arguments passed to programmatic agent creation when using agent_script. Allows per-sample agent customization.

JSONL:

{
"input": "What items do we have?",
"agent_args": {
"item": { "sku": "SKU-123", "name": "Widget A", "price": 19.99 }
}
}

CSV:

input,agent_args
"What items do we have?","{""item"": {""sku"": ""SKU-123"", ""name"": ""Widget A"", ""price"": 19.99}}"

Your agent factory function can access these values via sample.agent_args to customize agent configuration.

See Targets - agent_script for details on programmatic agent creation.

Variables to inject into rubric templates when using rubric graders. This allows you to provide per-sample context or examples to the LLM judge.

Example: Evaluating code quality against a reference implementation.

JSONL:

{
"input": "Write a function to calculate fibonacci numbers",
"rubric_vars": {
"reference_code": "def fib(n):\n if n <= 1: return n\n return fib(n-1) + fib(n-2)",
"required_features": "recursion, base case"
}
}

CSV:

input,rubric_vars
"Write a function to calculate fibonacci numbers","{""reference_code"": ""def fib(n):\n if n <= 1: return n\n return fib(n-1) + fib(n-2)"", ""required_features"": ""recursion, base case""}"

In your rubric template file, reference variables with {variable_name}:

rubric.txt:

Evaluate the submitted code against this reference implementation:
{reference_code}
Required features: {required_features}
Score on correctness (0.6) and code quality (0.4).

When the rubric grader runs, variables are replaced with values from rubric_vars:

Final formatted prompt sent to LLM:

Evaluate the submitted code against this reference implementation:
def fib(n):
if n <= 1: return n
return fib(n-1) + fib(n-2)
Required features: recursion, base case
Score on correctness (0.6) and code quality (0.4).

This lets you customize evaluation criteria per sample using the same rubric template.

See Rubric Graders for details on rubric templates.

Sample ID is automatically assigned (0-based index) if not provided. You can override:

{ "id": 42, "input": "Test case 42" }
{"id": 1, "input": "What is the capital of France?", "ground_truth": "Paris", "tags": ["geography", "easy"], "metadata": {"region": "Europe"}}
{"id": 2, "input": "Calculate the square root of 144", "ground_truth": "12", "tags": ["math", "medium"]}
{"id": 3, "input": ["Hello", "What can you help me with?"], "tags": ["conversation"]}

Make ground truth specific enough to grade but flexible enough to match valid responses:

Include edge cases and variations:

{"input": "What is 2+2?", "ground_truth": "4", "tags": ["math", "easy"]}
{"input": "What is 0.1 + 0.2?", "ground_truth": "0.3", "tags": ["math", "floating_point"]}
{"input": "What is 999999999 + 1?", "ground_truth": "1000000000", "tags": ["math", "large_numbers"]}

Organize samples by type, difficulty, or feature:

{"tags": ["tool_usage", "search"]}
{"tags": ["memory", "recall"]}
{"tags": ["reasoning", "multi_step"]}

Test conversational context and memory updates:

{"input": ["My name is Alice", "What's my name?"], "ground_truth": "Alice", "tags": ["memory", "recall"]}
{"input": ["Please remember that I like bananas.", "Actually, sorry, I meant I like apples."], "ground_truth": "apples", "tags": ["memory", "correction"]}
{"input": ["I work at Google", "Update my workplace to Microsoft", "Where do I work?"], "ground_truth": "Microsoft", "tags": ["memory", "multi_step"]}

If using rubric graders, ground truth is optional:

{"input": "Write a creative story about a robot", "tags": ["creative"]}
{"input": "Explain quantum computing simply", "tags": ["explanation"]}

The LLM judge evaluates based on the rubric, not ground truth.

Datasets are automatically loaded by the runner:

dataset: path/to/dataset.jsonl # Path to your test cases (JSONL or CSV)

Paths are relative to the suite YAML file location.

max_samples: 10 # Only evaluate first 10 samples (useful for testing)
sample_tags: [math, medium] # Only samples with ALL these tags

You can generate datasets with Python:

import json
samples = []
for i in range(100):
samples.append({
"input": f"What is {i} + {i}?",
"ground_truth": str(i + i),
"tags": ["math", "addition"]
})
with open("dataset.jsonl", "w") as f:
for sample in samples:
f.write(json.dumps(sample) + "\n")

The runner validates:

  • Each line is valid JSON
  • Required fields are present
  • Field types are correct

Validation errors will be reported with line numbers.

JSONL:

{"input": "What is the capital of France?", "ground_truth": "Paris"}
{"input": "Who wrote Romeo and Juliet?", "ground_truth": "Shakespeare"}

CSV:

input,ground_truth
"What is the capital of France?","Paris"
"Who wrote Romeo and Juliet?","Shakespeare"

JSONL:

{"input": "Search for information about pandas", "ground_truth": "search"}
{"input": "Calculate 15 * 23", "ground_truth": "calculator"}

CSV:

input,ground_truth
"Search for information about pandas","search"
"Calculate 15 * 23","calculator"

Ground truth = expected tool name.

JSONL:

{"input": ["Remember that my favorite color is blue", "What's my favorite color?"], "ground_truth": "blue"}
{"input": ["I live in Tokyo", "Where do I live?"], "ground_truth": "Tokyo"}

CSV (using JSON array strings):

input,ground_truth
"[""Remember that my favorite color is blue"", ""What's my favorite color?""]","blue"
"[""I live in Tokyo"", ""Where do I live?""]","Tokyo"

JSONL:

{"input": "Write a function to reverse a string in Python"}
{"input": "Create a SQL query to find users older than 21"}

CSV:

input
"Write a function to reverse a string in Python"
"Create a SQL query to find users older than 21"

Use rubric graders to evaluate code quality.

CSV supports all the same features as JSONL by encoding complex data as JSON strings in cells:

Multi-turn conversations (requires escaped JSON array string):

input,ground_truth
"[""Hello"", ""What's your name?""]","Alice"

Agent arguments (requires escaped JSON object string):

input,agent_args
"What items do we have?","{""initial_inventory"": [""apple"", ""banana""]}"

Rubric variables (requires escaped JSON object string):

input,rubric_vars
"Write a story","{""max_length"": 500, ""genre"": ""sci-fi""}"