Your AI Coding Assistant Has a Rewriting Problem

4 min read 1 source clear_take
├── "AI coding models are systematically biased toward over-editing, and benchmarks are to blame"
│  └── nrehiew (Blog Post) → read

Argues that current coding models from Copilot to Claude to GPT-4 consistently make far more changes than necessary when asked to fix a bug or add a feature. Traces this behavior to training and evaluation benchmarks that reward correct final output but don't penalize unnecessary modifications to surrounding code, meaning a model that rewrites an entire function scores the same as one that changes a single character.

├── "Over-editing undermines the core value proposition of AI coding tools by shifting time from writing to reviewing"
│  └── @pella (Hacker News, 328 pts)

Submitted the post which resonated with 328 upvotes, reflecting widespread developer frustration. The core issue is that time saved generating code is lost — and then some — reviewing unnecessary changes like variable renames, style changes, and unsolicited refactors that bloat a one-line fix into a 200-line diff.

└── "The solution is training models to optimize for minimal, targeted diffs rather than correct final output"
  └── nrehiew (Blog Post) → read

Proposes that the fix lies in changing how models are trained and evaluated — benchmarks should penalize unnecessary modifications to surrounding code, not just reward correct outputs. A model that makes the minimal edit necessary should score higher than one that produces a correct but heavily rewritten result, aligning model incentives with real-world developer workflows.

What happened

A blog post titled "Coding Models Are Doing Too Much" has struck a nerve on Hacker News, pulling 328 upvotes and igniting a discussion that clearly resonates with developers who've watched their AI coding assistant turn a one-line bug fix into a 200-line diff. The core thesis is deceptively simple: when you ask an AI model to fix a bug or add a feature, it should make the minimal edit necessary — not rewrite your function signatures, rename your variables, refactor your error handling, and reorganize your imports along the way.

The author presents the case that current coding models — from GitHub Copilot to Claude to GPT-4 — are systematically biased toward over-editing. When prompted to change one thing, they change five, and the four extras aren't improvements — they're liability. The post traces this behavior to how models are trained and evaluated: benchmarks reward "correct final output" but don't penalize unnecessary modifications to surrounding code. A model that rewrites an entire function to fix a typo scores the same as one that changes a single character, despite the former being dramatically worse for real-world use.

Why it matters

This isn't a theoretical complaint. Any developer who's spent time with AI coding tools has experienced the frustration of reviewing an AI-generated diff that's 10x larger than it should be. You asked it to add null checking to one parameter. It added null checking, switched your `var` to `const`, renamed `data` to `responseData`, extracted a helper function you didn't ask for, and added three comments in a tone that doesn't match your codebase. Now you're spending more time reviewing the AI's work than it would have taken to write the fix yourself.

The deeper problem is that over-editing actively undermines the value proposition of AI coding tools. The entire point is to save time. But time saved generating code is lost — and then some — reviewing unnecessary changes, hunting for regressions in untouched logic, and re-running test suites that fail because the model touched code it shouldn't have. Senior engineers know that the best patches are the smallest ones. Every line of diff is a line that could harbor a bug, a line that needs review, and a line that shows up in `git blame` forever.

The Hacker News discussion amplifies this with war stories that any practitioner will recognize. Developers describe asking models to fix a CSS alignment issue and getting back a complete component rewrite. Others report models that "helpfully" upgrade API patterns to newer versions mid-fix, breaking compatibility with the rest of the codebase. The consensus is clear: the industry is optimizing coding models for impressive demos rather than for integration into real development workflows where predictability and minimalism matter more than cleverness.

There's a training data dimension here too. Models learn from open-source commits, pull requests, and code review discussions. But the training signal doesn't distinguish between "this is the minimal fix" and "this is a large refactor that happens to include a fix." Without explicit optimization pressure toward minimal diffs, models default to the statistical average of their training data — which includes plenty of kitchen-sink commits.

What this means for your stack

If you're integrating AI coding assistants into your team's workflow, the minimal editing principle should inform both your tool selection and your prompting strategy. Prompt engineering matters here. Explicitly instructing the model to "change only what is necessary" or "do not modify any code outside the specified function" measurably reduces over-editing in most current models. Some teams are adding these constraints to their system prompts or editor configurations as standard practice.

When evaluating AI coding tools, start measuring diff size relative to task scope. A model that produces a 5-line diff for a 5-line task is more valuable than one that produces a 50-line diff, even if both pass the test suite. This metric — call it "edit efficiency" — isn't tracked by any major benchmark today, but it's the single best predictor of whether an AI coding tool will actually save time in a production codebase. Teams adopting AI coding tools should track this informally: how often do you accept the AI's full diff versus cherry-picking pieces of it?

For tool builders and model trainers, this is a clear signal. The next frontier in coding model quality isn't generating more code — it's generating less. Training with edit-distance penalties, evaluating on diff minimality alongside correctness, and building UI that makes it easy to see exactly what changed (and reject the rest) would move the entire ecosystem forward. Some early work in this direction includes constrained decoding and edit-aware fine-tuning, but it's still far from mainstream.

Looking ahead

The minimal editing debate mirrors a pattern we've seen before in software tooling: the first generation optimizes for capability ("can it do the thing?"), and the second generation optimizes for integration ("does it fit into how we actually work?"). AI coding tools are entering that second phase. The models that win the next round won't be the ones that write the most code — they'll be the ones that write the least code necessary, and leave everything else exactly as they found it. That's not a lower bar. It's a dramatically higher one.

Hacker News 343 pts 193 comments

Coding Models Are Doing Too Much

→ read on Hacker News
hathawsh · Hacker News

I'm either in a minority or a silent majority. Claude Code surpasses all my expectations. When it makes a mistake like over-editing, I explain the mistake, it fixes it, and I ask it to record what it learned in the relevant project-specific skills. It rarely makes that mistake again. When the s

jstanley · Hacker News

Conversely, I often find coding agents privileging the existing code when they could do a much better job if they changed it to suit the new requirement.I guess it comes down to how ossified you want your existing code to be.If it's a big production application that's been running for deca

Isolated_Routes · Hacker News

I think building something really well with AI takes a lot of work. You can certainly ask it to do things and it will comply, and produce something pretty good. But you don't know what you don't know, especially when it speaks to you authoritatively. So checking its work from many differen

foo12bar · Hacker News

I've noticed AI's often try and hide failure by catching exceptions and returning some dummy value maybe with some log message buried in tons of extraneous other log messages. And the logs themselves are often over abbreviated and missing key data to successfully debug what is happening.I

anonu · Hacker News

Here, the author means the agent over-edits code. But agents also do "too much": as in they touch multiple files, run tests, do deployments, run smoke tests, etc... And all of this gets abstracted away. On one hand, its incredible. But on the other hand I have deep anxiety over this:1. I h

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.