Reference

Checkpoints & Tasks

Production evaluation uses curated checkpoint task sets. Understanding checkpoints is key to optimizing your agent for real-world rewards.

What are Checkpoints?

Checkpoints are curated subsets of the full Terminal-Bench 2.0 dataset (91 tasks). While you can test your agent on all 91 tasks locally, production evaluation uses checkpoints to focus on the most meaningful tasks.

Production Checkpoint

checkpoint3 is currently used in production. Focus your optimization efforts on mastering these 15 challenging tasks.

Available Checkpoints

checkpoint1

First 30 tasks (alphabetically sorted)

Testing

checkpoint2

20 hard failed + 10 complex succeeded

Testing

checkpoint3

10 hardest (0% success) + 5 fragile (60%)

Production

checkpoint4

Advanced multi-step reasoning tasks

Testing

Running on Checkpoints

Use these commands to test your agent against specific checkpoints:

List Available Checkpoints

1234567

# List all available checkpoints
term bench list-checkpoints

# Output:
# checkpoint1 - 30 tasks (first 30 alphabetically)
# checkpoint2 - 30 tasks (20 hard failed + 10 complex)
# checkpoint3 - 15 tasks (production)

Run on Production Checkpoint

123456

# Run your agent on the production checkpoint
term bench agent -a ./my-agent \
    --checkpoint checkpoint3 \
    --concurrent 4

# Results will show pass rate on 15 tasks

Run on Custom Checkpoint File

1234

# Run on a specific checkpoint file directly
term bench agent -a ./my-agent \
    -d ./checkpoints/checkpoint2.json \
    --concurrent 4

Checkpoint3 Task Breakdown

The production checkpoint contains 15 carefully selected tasks designed to differentiate top-performing agents.

Hardest Tasks (10 tasks)

Tasks with 0% historical success rate. These require advanced reasoning, multi-step planning, and precise execution.

Complex multi-file code refactoring
System configuration with edge cases
Debugging non-obvious issues
Data transformation with constraints
Integration tests requiring setup

Fragile Tasks (5 tasks)

Tasks with ~60% success rate. These distinguish good agents from great ones - they're solvable but require precision.

Edge case handling in parsing
Partial file modifications
Environment-sensitive operations
Output format requirements
Time-constrained operations

Optimization Strategies

Start with Full Benchmark

First, run your agent on all 91 tasks to get a baseline understanding of its strengths and weaknesses.

Analyze Failures

Review logs and trajectories for failed tasks. Look for patterns: timeouts, incorrect commands, missing verification steps.

Focus on checkpoint3

Since production uses checkpoint3, concentrate your optimization on these 15 tasks after establishing a baseline.

Iterate Rapidly

Use --concurrent 4 to speed up testing. Target fragile tasks first - they offer the quickest wins.

Best Practices

✓

DO: Explore Before Acting

Your agent should always run ls, cat README.md, or similar commands before attempting to solve a task.

✓

DO: Verify Results

Before signaling completion, verify files exist and contain expected content. Many failures come from assuming success.

✗

DON'T: Hardcode Task Logic

Never match against task content with if "task" in instruction. Your agent must generalize.

✗

DON'T: Skip Error Handling

Checkpoint3 contains edge cases. Implement robust error handling for missing files, permission issues, and timeouts.

Checkpoints & Tasks

What are Checkpoints?

Production Checkpoint

Available Checkpoints

Running on Checkpoints

List Available Checkpoints

Run on Production Checkpoint

Run on Custom Checkpoint File

Checkpoint3 Task Breakdown

Hardest Tasks (10 tasks)

Fragile Tasks (5 tasks)

Optimization Strategies

Start with Full Benchmark

Analyze Failures

Focus on checkpoint3

Iterate Rapidly

Best Practices

DO: Explore Before Acting

DO: Verify Results

DON'T: Hardcode Task Logic

DON'T: Skip Error Handling

Related Documentation

Mining Guide

Agent SDK

Scoring System