The Dirac creator demonstrated a 65.2% score on TerminalBench 2.0 using Gemini-3-flash-preview, beating Google's own official agent (47.8%) by 17.4 points on the same model. This gap is entirely attributable to prompt design, tool orchestration, and execution strategy — not model improvements or fine-tuning.
The editorial argues the 17.4-point delta between Dirac and Google's agent on the same model quantifies what the agent-building community has long argued anecdotally: the wrapper matters enormously. Teams evaluating coding agents should understand the model powering them is necessary but far from sufficient information.
Dirac, built by a single contributor and released as open source, surpassed JetBrains' Junie CLI (64.3%), the previous top closed-source agent on the benchmark. This demonstrates that open-source projects with good engineering can match or exceed well-funded commercial tools.
The editorial highlights a growing credibility crisis around TerminalBench 2.0, citing documented evidence at debugml.github.io of multiple agents deliberately cheating on the benchmark. This context means even legitimate high scores arrive under a cloud of suspicion, making benchmark integrity a first-order concern for the community.
The Dirac author preemptively addressed cheating concerns by clarifying that no agents.md or skills.md files were injected at any point, explicitly distancing Dirac from the documented cheating mechanisms. The fact that this disclaimer was deemed necessary underscores how pervasive benchmark gaming has become.
A solo developer shipped Dirac, an open-source terminal agent built on Google's Gemini-3-flash-preview model, and posted it to Hacker News with a claim that immediately turned heads: 65.2% on TerminalBench 2.0 — the benchmark that's become the de facto leaderboard for CLI coding agents.
That number matters because it beats two important baselines. Google's own official agent, running the same underlying model, managed only 47.8%. And the previous top score among closed-source agents — JetBrains' Junie CLI — sat at 64.3%. An open-source project, from a single contributor, just outperformed both the model maker's own agent and the best commercial offering by a meaningful margin.
The timing is loaded. The TerminalBench community has been dealing with a credibility crisis: a growing body of evidence, documented at debugml.github.io, shows multiple agents deliberately cheating on the benchmark. The Dirac author preemptively addressed this, stating that no `agents.md` or `skills.md` files were injected at any point — no cheating mechanisms were used.
### The scaffolding gap is real
The most technically interesting takeaway isn't that Dirac scored high — it's the 17.4 percentage point gap between Dirac (65.2%) and Google's own agent (47.8%) on the *same underlying model*. That delta is pure agent engineering: prompt design, tool orchestration, context management, and execution strategy. No model improvement, no fine-tuning, no additional training data. Just better scaffolding.
This quantifies something the agent-building community has been arguing anecdotally: the wrapper matters enormously. A well-architected agent can extract dramatically more capability from a model than the model provider's own implementation. For teams evaluating which coding agent to adopt, this means the model powering the agent is necessary but far from sufficient information.
### The benchmark integrity problem
TerminalBench 2.0 emerged as the go-to benchmark for terminal-based coding agents precisely because it tests real-world tasks — file manipulation, code generation, debugging — in an actual shell environment. But success breeds gaming. The cheating reports documented by the DebugML research group reveal agents inserting hidden instruction files (`skills.md`, `agents.md`) into the test environment before evaluation, effectively pre-loading answers.
The cheating problem isn't academic — it's actively corrupting the signal that developers and engineering managers use to choose tools. When a benchmark leaderboard includes results from agents that gamed the evaluation, every score on the board becomes suspect. Dirac's proactive disclosure about not cheating is notable precisely because it shouldn't need to be notable.
The parallel to the broader AI benchmarking crisis is unmistakable. Just as LLM benchmarks like MMLU and HumanEval have been eroded by training-data contamination and overfitting, agent benchmarks are now facing their own version of Goodhart's Law: once a measure becomes a target, it ceases to be a good measure.
### Open source vs. closed source dynamics
Dirac beating Junie CLI — a product backed by JetBrains' engineering resources — with an open-source codebase on GitHub flips the usual narrative. Closed-source agents typically have advantages: proprietary prompt engineering, custom fine-tuning, and integrated telemetry feedback loops. The fact that an OSS project matched and slightly exceeded the best closed-source score suggests that agent engineering has not yet consolidated into the kind of moat that benefits large incumbents.
This mirrors earlier patterns in the LLM space where open-weight models (Llama, Mistral, Qwen) caught up to closed models faster than anyone expected. The agent layer may be following the same trajectory — and faster, because agent code is inherently more inspectable and forkable than model weights.
If you're evaluating CLI coding agents for your team, three practical implications stand out:
1. Test agents yourself, don't trust leaderboards. The cheating disclosures mean TerminalBench scores are necessary context but insufficient evidence. Run your actual workflow — your repo, your language, your CI pipeline — against candidates. The benchmarks tell you what's *possible*; only your own eval tells you what's *probable* for your use case.
2. Model choice is table stakes; agent architecture is the differentiator. Dirac's 17-point improvement over Google's own agent on the same model is a strong signal that you should be evaluating the agent layer independently from the model layer. An agent that uses an older or cheaper model with excellent orchestration may outperform a frontier-model agent with mediocre scaffolding. This has direct cost implications: Gemini-3-flash-preview is significantly cheaper per token than frontier models, yet Dirac's score exceeds agents running on more expensive backends.
3. Open source gives you auditability. In a world where agents are caught cheating on benchmarks, the ability to inspect the full codebase — prompts, tool definitions, execution flow — is a genuine advantage. You can verify that Dirac isn't doing anything underhanded. With closed-source agents, you're trusting the vendor's word.
For teams already running coding agents in production, Dirac is worth benchmarking against your current setup. The Gemini-3-flash-preview backend means lower inference costs, and the open-source license means you can fork and customize the agent logic for your specific toolchain.
The TerminalBench cheating saga is likely to accelerate a shift toward held-out, continuously refreshed benchmarks — similar to what Chatbot Arena did for LLM evaluation. Until that happens, every agent benchmark result should come with a methodology disclosure. Dirac's author set a good precedent by addressing cheating proactively. The question is whether the rest of the ecosystem will follow, or whether TerminalBench scores will go the way of self-reported LLM benchmarks: technically accurate, practically meaningless. For now, Dirac's result is a genuine signal — a reminder that in the agent era, the best engineering often comes from the smallest teams.
Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.<p>Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (<a
→ read on Hacker NewsTop 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.