3 Minute Read

Backboard R-CLI Is Now #1 on Terminal-Bench 2.1

Backboard R-CLI Is Now #1 on Terminal-Bench 2.1

Today, Backboard R-CLI achieved the top published result on Terminal-Bench 2.1.

Using Claude Opus 4.8 through AWS Bedrock, Backboard R-CLI solved 75 of 89 tasks for an overall accuracy of 84.3%.


Today, Backboard R-CLI achieved the top published result on Terminal-Bench 2.1.

Using Claude Opus 4.8 through AWS Bedrock, Backboard R-CLI solved 75 of 89 tasks for an overall accuracy of 84.3%.

That places it ahead of the published Terminal-Bench 2.1 results for Codex CLI on GPT-5.5 at 83.4% and Claude Code on Claude 5 Fable at 83.1%.

More importantly, we have published the complete results, including task-level verifier reports, run configuration, aggregate scoring output, and system logs.

The benchmark can be inspected rather than taken on faith.

The result

Terminal-Bench evaluates AI agents on real terminal work inside isolated environments. These are not multiple-choice questions or code-completion tests.

The agent has to actually complete end-to-end tasks such as building and compiling projects, debugging broken code, configuring services, recovering corrupted data, reverse-engineering binaries, and training models. Each task is scored objectively by an independent verifier. A task passes only when the resulting environment fully satisfies the evaluator. There is no partial credit.

Backboard R-CLI achieved:

Rank

Agent

Model

Accuracy

1

Backboard R-CLI

Claude Opus 4.8

84.3%

2

Codex CLI

GPT-5.5

83.4%

3

Claude Code

Claude 5 Fable

83.1%

4

Terminus 2

Claude 5 Fable

80.4%

5

Claude Code

Claude Opus 4.8

78.9%


The result also shows that model choice is only part of the equation.

Claude Opus 4.8 was already represented on the leaderboard, but Backboard R-CLI outperformed the next-best published Opus 4.8 result by 5.4 percentage points.

That difference is not explained by the model alone. It comes from the agent system around it.

Why we are publishing everything

Benchmarks are becoming central to how AI agents are evaluated. They influence product decisions, developer adoption, investment, and increasingly, enterprise buying decisions.

But a benchmark score without the underlying evidence is difficult to evaluate.

Did the agent receive extra retries?
Did it use a different timeout?
Was the environment identical?
What happened when it failed?
How much prompting, context, or manual intervention was involved?

A leaderboard number should be the beginning of scrutiny, not the end of the conversation.

That is why we open sourced the full Backboard R-CLI Terminal-Bench 2.1 run.

The repository includes:

  • Aggregate results across all 89 tasks

  • Per-task run configurations

  • Final verifier rewards

  • Structured verifier reports

  • Verifier test output

  • Agent and verifier metadata

  • Task-level passing and failing outcomes

  • Artifacts used to validate the final state

Anyone can inspect the evidence for each result.

We believe coding-agent performance claims should be reproducible, auditable, and open to challenge.

The model is important. The harness is decisive.

The prevailing conversation around coding agents often focuses on whichever model is newest or largest.

That matters, but it is incomplete.

Two agents using the same frontier model can produce materially different outcomes because the surrounding system determines how effectively the model is used.

Backboard R-CLI is designed to treat every token, tool call, and context window as a resource that needs to be managed deliberately.

Several system-level choices contributed to this result:

Adaptive thinking

Not every step in a software task deserves the same reasoning budget.

Backboard R-CLI adjusts its approach based on the difficulty and uncertainty of the task at hand. Straightforward file inspection should not consume the same resources as debugging a complex build failure or validating a multi-step deployment.

Adaptive context management

Long-running coding tasks create a context-management problem.

As an agent explores a repository, runs tests, reads logs, and iterates on a solution, irrelevant context can accumulate quickly. The agent needs to retain the details that matter without repeatedly carrying unnecessary history forward.

Backboard R-CLI continuously curates its working context so the model stays focused on the active problem.

Smarter tool use

Coding agents do not fail only because they misunderstand code.

They also fail because they waste actions: inspecting the wrong files, repeating commands, exploring dead ends, or calling tools without a clear purpose.

R-CLI is designed to route work toward the most relevant tools and reduce unnecessary terminal round trips.

Reuse and caching

Repeated context should not need to be rebuilt from scratch on every step.

Backboard R-CLI uses reuse and caching strategies to avoid unnecessary repetition while preserving the information needed to complete the task.

Early convergence

A coding agent should know when the work is complete.

Once the task has been solved and validated, continuing to reason or explore adds cost and creates more opportunities to introduce errors.

The system is designed to converge early when there is sufficient evidence that the requested work is complete.

Together, these design decisions help reduce waste while increasing task completion reliability.

Better coding agents are systems, not wrappers

The industry is moving quickly toward a world where developers will work alongside AI agents that can operate directly in repositories, terminals, cloud environments, and deployment systems.

The winning products will not simply expose the strongest available model.

They will combine strong models with better memory, better context, more efficient execution, more reliable validation, and more disciplined recovery when an initial attempt fails.

That is the work we are building at Backboard.

This Terminal-Bench result is not a claim that the problem is solved. There are still tasks we failed, workflows that need improvement, and benchmarks that do not fully represent real-world engineering environments.

But it is evidence that the architecture around a model can materially change what that model is capable of accomplishing.

Reproduce the result

The complete Terminal-Bench 2.1 results for Backboard R-CLI are now public.

You can inspect the full run, task-level verifier reports, configurations, and artifacts here:

https://github.com/Backboard-io/Backboard-R-CLI-Terminal-Bench-2.1-Results

We welcome replication, scrutiny, and feedback.

The future of coding agents should not be decided by screenshots and uninspectable claims.

It should be decided by results that others can verify.

Rob Imbeault

No headings found on page

SHARE