OpenAI Drops SWE-bench Verified — and They're Right to Do It

5 min read 1 source clear_take
├── "SWE-bench Verified has saturated and no longer discriminates between frontier models"
│  └── OpenAI (OpenAI Blog) → read

OpenAI argues that frontier models have converged to the point where score differences fall within run-to-run variance, meaning the benchmark measures noise rather than capability. With top models clustering in the 75-85% range, SWE-bench Verified cannot reliably tell you which model is better at software engineering.

├── "Benchmark saturation is a well-known pattern and the field needs harder, more realistic evaluations"
│  └── top10.dev editorial (top10.dev) → read below

The editorial draws a direct parallel to ImageNet's saturation arc, arguing this is a recurring pattern in ML evaluation where benchmarks go from revolutionary to noise-measuring. The trajectory from 20-25% in early 2024 to 75-85% by 2026 follows the classic benchmark lifecycle, and the community now needs evaluations that test what actually matters for real-world software engineering.

└── "Retiring a benchmark you've topped is self-serving — labs shouldn't unilaterally decide when evaluation standards expire"
  └── @kmdupree (Hacker News, 152 pts)

By posting this to Hacker News with 152 points and 106 comments, the submission sparked significant community debate. The implicit tension in the framing is that OpenAI — a competitor on the leaderboard — is the one declaring the benchmark obsolete, raising questions about whether this is genuine technical insight or strategic positioning to move goalposts.

What happened

OpenAI published a post explaining why it will no longer evaluate its models against SWE-bench Verified, the curated subset of SWE-bench that became the de facto leaderboard for AI coding capability. The core argument: frontier models have converged to the point where SWE-bench Verified no longer discriminates between them. When the top five models on a benchmark cluster within a few percentage points of each other — and the variance between runs on the *same* model exceeds the gap between models — the benchmark has stopped measuring capability and started measuring noise.

This is a significant move. SWE-bench, created by researchers at Princeton, became the benchmark that mattered for AI coding tools starting in late 2024. The original dataset contains 2,294 real GitHub issues paired with their ground-truth pull requests. SWE-bench Verified, a human-validated 500-problem subset released in mid-2024, was supposed to be the more reliable signal. Every major lab — OpenAI, Anthropic, Google — published SWE-bench scores prominently. Startups like Cognition (Devin) and Factory used it as a core marketing metric.

OpenAI's decision to walk away from the benchmark isn't just a technical quibble. It's a public acknowledgment that the leaderboard race on SWE-bench was producing diminishing returns — for the labs building the models and for the developers trying to make sense of the numbers.

Why it matters

### The saturation problem is real

The trajectory tells the story. In early 2024, the best SWE-bench Verified scores were around 20-25%. By mid-2025, frontier models were hitting 60-70%. By early 2026, the top scores clustered in the 75-85% range. At this density, the benchmark can't tell you whether Model A is genuinely better than Model B at software engineering — it can only tell you which model happened to get lucky on which subset of 500 problems.

This is a well-known pattern in ML evaluation. ImageNet went through the same arc: revolutionary when accuracy was 60%, meaningless when it was 98%. The difference is that ImageNet saturation took nearly a decade. SWE-bench Verified saturated in roughly 18 months. That speed should make us ask whether the benchmark was measuring something deep or something shallow.

### The task distribution doesn't match reality

Here's the more uncomfortable critique, and one that practitioners have been raising for over a year: SWE-bench tasks are overwhelmingly single-file bug fixes in well-tested Python repositories. The median patch in SWE-bench touches 1-2 files and changes fewer than 30 lines. The test oracle is a pre-existing test suite — the model doesn't have to figure out *what* to test, only *how* to make a failing test pass.

Real software engineering is almost nothing like this. Real work involves multi-file refactors across poorly documented codebases, ambiguous requirements conveyed in Slack messages, performance constraints that don't show up in unit tests, and the judgment call of whether to fix the bug or redesign the interface. SWE-bench measures the easiest 10% of a senior engineer's job and calls it "software engineering."

This isn't SWE-bench's fault — it was a genuinely innovative dataset when it launched. But the gap between "can fix a bug given a failing test" and "can do software engineering" is the gap between parallel parking and driving cross-country. Measuring one tells you almost nothing about the other.

### The Goodhart's Law problem

When a benchmark becomes the primary metric for an industry, models get optimized for that benchmark — often in ways that don't generalize. There's strong circumstantial evidence that major labs have been doing exactly this with SWE-bench. Agentic scaffolding, retrieval strategies, and even fine-tuning choices have been tuned to maximize SWE-bench scores specifically. OpenAI retiring the benchmark is, in part, an admission that the metric had become the target — and therefore ceased to be a good metric.

The Hacker News discussion around this post has been pointed. Multiple senior engineers noted that their experience with AI coding tools doesn't track with SWE-bench scores at all. A model scoring 80% on SWE-bench can still produce confidently wrong multi-file refactors, hallucinate API surfaces, and fail at the kind of "read the room" reasoning that separates a useful coding assistant from a dangerous one.

What this means for your stack

If you're evaluating AI coding tools for your team, stop using SWE-bench scores as a primary signal. They were always a proxy, and now they're a saturated proxy. Here's what to look at instead:

Multi-file reasoning. Give the tool a real task in your codebase — not a bug fix, but a feature addition that touches 3+ files. See if it understands the dependency graph, respects existing patterns, and produces code that a reviewer wouldn't immediately reject. This is harder to score on a leaderboard, which is precisely why it matters.

Failure-mode transparency. The most important capability difference between current AI coding tools isn't how often they succeed — it's how they fail. Does the tool tell you when it's uncertain? Does it ask clarifying questions? Or does it confidently generate plausible-looking code that passes a type checker but introduces a subtle regression? This is what separates tools you can trust from tools that create work.

Codebase navigation and context management. SWE-bench gives the model a perfect issue description and a bounded search space. Real codebases are 500K+ lines with inconsistent naming, stale docs, and critical context buried in commit messages. Evaluate tools on their ability to find relevant code, not just modify it.

For teams building internal evals, the lesson is clear: build benchmarks from your own codebase, your own PR history, your own bug tracker. A private eval suite of 50 real tasks from your repo will tell you more than any public leaderboard.

Looking ahead

The AI coding benchmark vacuum is now a real problem. SWE-bench was imperfect, but it was *something* — a shared reference point that let the industry track progress. Without it, we're likely headed into a period where every lab publishes its own cherry-picked evals, and practitioners have even less signal to work with. The next useful benchmark will need to measure multi-step reasoning across large codebases, handle ambiguous specifications, and resist the kind of rapid saturation that killed SWE-bench Verified. Several efforts are underway — including expanded versions of SWE-bench with harder tasks and multi-repo scenarios — but none has reached critical adoption yet. In the meantime, the best benchmark for your team is your team's actual work.

Hacker News 332 pts 172 comments

Why SWE-bench Verified no longer measures frontier coding capabilities

→ read on Hacker News
ofirpress · Hacker News

I'm a co-creator of SWE-bench:1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.2. SWE-bench Multilingual and SWE-bench Multimodal (which we'll open source in the next month) are still uns

Jcampuzano2 · Hacker News

Its pretty clear that any benchmark that comes out will be outdated and exist within the training data with short measure. There will always be an incentive to optimize specifically for these benchmarks even if just for marketing material. Sure there is a training cutoff, but its usually only 3-6 mo

cpard · Hacker News

Benchmarks/evals are really hard and they become harder when there’s huge incentive to game them at an industry scale.ELT-Bench is another recent example. It was the first serious attempt at a benchmark for data engineering workloads, published about a year ago.A few days ago, a follow-up paper

jddj · Hacker News

For the most part I think we get the benchmarks we deserve.Many SWE-bench passing PRs would not be merged: https://news.ycombinator.com/item?id=47341645Top model SWE bench scores may be skewed by git history leaks: https://news.ycombinator.com/item?id=45214670

threepts · Hacker News

Why don't they ask their premier model to generate a bench for them?Jokes aside, a benchmark I look forward to is ARC-AGI-3. I tried out their human simulation, and it feels very reasoning heavy.Leaderboard: https://arcprize.org/leaderboard(Most premier models don't even pas

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.