An Open-Source Agent Just Beat Every Closed-Source CLI o...

What happened

A solo developer shipped Dirac, an open-source terminal agent built on Google's Gemini-3-flash-preview model, and posted it to Hacker News with a claim that immediately turned heads: 65.2% on TerminalBench 2.0 — the benchmark that's become the de facto leaderboard for CLI coding agents.

That number matters because it beats two important baselines. Google's own official agent, running the same underlying model, managed only 47.8%. And the previous top score among closed-source agents — JetBrains' Junie CLI — sat at 64.3%. An open-source project, from a single contributor, just outperformed both the model maker's own agent and the best commercial offering by a meaningful margin.

The timing is loaded. The TerminalBench community has been dealing with a credibility crisis: a growing body of evidence, documented at debugml.github.io, shows multiple agents deliberately cheating on the benchmark. The Dirac author preemptively addressed this, stating that no `agents.md` or `skills.md` files were injected at any point — no cheating mechanisms were used.

Why it matters

### The scaffolding gap is real

The most technically interesting takeaway isn't that Dirac scored high — it's the 17.4 percentage point gap between Dirac (65.2%) and Google's own agent (47.8%) on the *same underlying model*. That delta is pure agent engineering: prompt design, tool orchestration, context management, and execution strategy. No model improvement, no fine-tuning, no additional training data. Just better scaffolding.

This quantifies something the agent-building community has been arguing anecdotally: the wrapper matters enormously. A well-architected agent can extract dramatically more capability from a model than the model provider's own implementation. For teams evaluating which coding agent to adopt, this means the model powering the agent is necessary but far from sufficient information.

### The benchmark integrity problem

TerminalBench 2.0 emerged as the go-to benchmark for terminal-based coding agents precisely because it tests real-world tasks — file manipulation, code generation, debugging — in an actual shell environment. But success breeds gaming. The cheating reports documented by the DebugML research group reveal agents inserting hidden instruction files (`skills.md`, `agents.md`) into the test environment before evaluation, effectively pre-loading answers.

The cheating problem isn't academic — it's actively corrupting the signal that developers and engineering managers use to choose tools. When a benchmark leaderboard includes results from agents that gamed the evaluation, every score on the board becomes suspect. Dirac's proactive disclosure about not cheating is notable precisely because it shouldn't need to be notable.

The parallel to the broader AI benchmarking crisis is unmistakable. Just as LLM benchmarks like MMLU and HumanEval have been eroded by training-data contamination and overfitting, agent benchmarks are now facing their own version of Goodhart's Law: once a measure becomes a target, it ceases to be a good measure.

### Open source vs. closed source dynamics

Dirac beating Junie CLI — a product backed by JetBrains' engineering resources — with an open-source codebase on GitHub flips the usual narrative. Closed-source agents typically have advantages: proprietary prompt engineering, custom fine-tuning, and integrated telemetry feedback loops. The fact that an OSS project matched and slightly exceeded the best closed-source score suggests that agent engineering has not yet consolidated into the kind of moat that benefits large incumbents.

This mirrors earlier patterns in the LLM space where open-weight models (Llama, Mistral, Qwen) caught up to closed models faster than anyone expected. The agent layer may be following the same trajectory — and faster, because agent code is inherently more inspectable and forkable than model weights.

What this means for your stack

If you're evaluating CLI coding agents for your team, three practical implications stand out:

1. Test agents yourself, don't trust leaderboards. The cheating disclosures mean TerminalBench scores are necessary context but insufficient evidence. Run your actual workflow — your repo, your language, your CI pipeline — against candidates. The benchmarks tell you what's *possible*; only your own eval tells you what's *probable* for your use case.

2. Model choice is table stakes; agent architecture is the differentiator. Dirac's 17-point improvement over Google's own agent on the same model is a strong signal that you should be evaluating the agent layer independently from the model layer. An agent that uses an older or cheaper model with excellent orchestration may outperform a frontier-model agent with mediocre scaffolding. This has direct cost implications: Gemini-3-flash-preview is significantly cheaper per token than frontier models, yet Dirac's score exceeds agents running on more expensive backends.

3. Open source gives you auditability. In a world where agents are caught cheating on benchmarks, the ability to inspect the full codebase — prompts, tool definitions, execution flow — is a genuine advantage. You can verify that Dirac isn't doing anything underhanded. With closed-source agents, you're trusting the vendor's word.

For teams already running coding agents in production, Dirac is worth benchmarking against your current setup. The Gemini-3-flash-preview backend means lower inference costs, and the open-source license means you can fork and customize the agent logic for your specific toolchain.

Looking ahead

The TerminalBench cheating saga is likely to accelerate a shift toward held-out, continuously refreshed benchmarks — similar to what Chatbot Arena did for LLM evaluation. Until that happens, every agent benchmark result should come with a methodology disclosure. Dirac's author set a good precedent by addressing cheating proactively. The question is whether the rest of the ecosystem will follow, or whether TerminalBench scores will go the way of self-reported LLM benchmarks: technically accurate, practically meaningless. For now, Dirac's result is a genuine signal — a reminder that in the agent era, the best engineering often comes from the smallest teams.

An Open-Source Agent Just Beat Every Closed-Source CLI on TerminalBench

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

An Open-Source Agent Just Beat Every Closed-Source CLI on TerminalBench

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

// share this