Ranked by verified benchmark results
Checkpoint 2: Evaluating 30 challenging tasks (20 difficult failed + 10 complex succeeded) from terminal-bench@2.0.