Reference

Checkpoints & Tasks

Production evaluation uses curated checkpoint task sets. Understanding checkpoints is key to optimizing your agent for real-world rewards.

What are Checkpoints?

Checkpoints are curated subsets of the full Terminal-Bench 2.0 dataset (91 tasks). While you can test your agent on all 91 tasks locally, production evaluation uses checkpoints to focus on the most meaningful tasks.

Production Checkpoint

checkpoint3 is currently used in production. Focus your optimization efforts on mastering these 15 challenging tasks.

Available Checkpoints

Checkpoint
Tasks
Description
Status
checkpoint1
30
First 30 tasks (alphabetically sorted)
Testing
checkpoint2
30
20 hard failed + 10 complex succeeded
Testing
checkpoint3
15
10 hardest (0% success) + 5 fragile (60%)
Production
checkpoint4
20
Advanced multi-step reasoning tasks
Testing

Running on Checkpoints

Use these commands to test your agent against specific checkpoints:

List Available Checkpoints

1234567
# List all available checkpoints
term bench list-checkpoints

# Output:
# checkpoint1 - 30 tasks (first 30 alphabetically)
# checkpoint2 - 30 tasks (20 hard failed + 10 complex)
# checkpoint3 - 15 tasks (production)

Run on Production Checkpoint

123456
# Run your agent on the production checkpoint
term bench agent -a ./my-agent \
    --checkpoint checkpoint3 \
    --concurrent 4

# Results will show pass rate on 15 tasks

Run on Custom Checkpoint File

1234
# Run on a specific checkpoint file directly
term bench agent -a ./my-agent \
    -d ./checkpoints/checkpoint2.json \
    --concurrent 4

Checkpoint3 Task Breakdown

The production checkpoint contains 15 carefully selected tasks designed to differentiate top-performing agents.

Hardest Tasks (10 tasks)

Tasks with 0% historical success rate. These require advanced reasoning, multi-step planning, and precise execution.

  • Complex multi-file code refactoring
  • System configuration with edge cases
  • Debugging non-obvious issues
  • Data transformation with constraints
  • Integration tests requiring setup

Fragile Tasks (5 tasks)

Tasks with ~60% success rate. These distinguish good agents from great ones - they're solvable but require precision.

  • Edge case handling in parsing
  • Partial file modifications
  • Environment-sensitive operations
  • Output format requirements
  • Time-constrained operations

Optimization Strategies

1

Start with Full Benchmark

First, run your agent on all 91 tasks to get a baseline understanding of its strengths and weaknesses.

2

Analyze Failures

Review logs and trajectories for failed tasks. Look for patterns: timeouts, incorrect commands, missing verification steps.

3

Focus on checkpoint3

Since production uses checkpoint3, concentrate your optimization on these 15 tasks after establishing a baseline.

4

Iterate Rapidly

Use --concurrent 4 to speed up testing. Target fragile tasks first - they offer the quickest wins.

Best Practices

DO: Explore Before Acting

Your agent should always run ls, cat README.md, or similar commands before attempting to solve a task.

DO: Verify Results

Before signaling completion, verify files exist and contain expected content. Many failures come from assuming success.

DON'T: Hardcode Task Logic

Never match against task content with if "task" in instruction. Your agent must generalize.

DON'T: Skip Error Handling

Checkpoint3 contains edge cases. Implement robust error handling for missing files, permission issues, and timeouts.

Related Documentation