nrehiew defines over-editing precisely — functionally correct output that structurally diverges beyond the minimal fix — and provides code and methodology to measure it across models. They argue this transforms simple reviews into archaeology expeditions and that the phenomenon affects all major AI coding tools including Cursor, Copilot, Claude Code, and Codex.
Submitted the research to Hacker News where it garnered 343 points and 193 comments, indicating strong community resonance with the claim that LLMs systematically rewrite code beyond what is necessary for a fix.
nrehiew argues that review throughput, not coding speed, determines how fast teams ship. When a one-line bug fix becomes a 200-line diff with renamed variables and extracted helpers, the reviewer must reconstruct what actually changed and verify no regressions were introduced, turning a 30-second review into a 15-minute investigation.
The editorial synthesis highlights that when models add unrequested input validation, rename variables, and restructure control flow alongside the actual fix, it creates cover for subtle bugs. A reviewer scanning a large diff for what should be a one-line fix is far more likely to miss a regression hidden among cosmetic changes.
Beyond diagnosing the problem, nrehiew investigates whether models can be trained to be more faithful editors that produce structurally minimal fixes. This frames over-editing not as an inherent limitation of LLMs but as a training problem with a potential engineering solution.
A detailed investigation by researcher nrehiew has quantified something every developer using AI coding tools already suspected: when you ask an LLM to fix a one-line bug, it rewrites half the function. The post, which landed on Hacker News with 343 points and sparked intense practitioner debate, frames this as the "Over-Editing problem" — models producing outputs that are functionally correct but structurally divergent from the original code far beyond what the fix requires.
The definition is precise and useful: a model is over-editing if its output is functionally correct but structurally diverges from the original code more than the minimal fix requires. An off-by-one error becomes a rewritten loop. A wrong operator becomes a refactored function with new validation, renamed variables, and an extracted helper. The bug is fixed. The diff is enormous.
This isn't a niche academic concern. Every major AI coding tool — Cursor, GitHub Copilot, Claude Code, Codex — exhibits this behavior. The research includes code and methodology to measure the phenomenon across models, moving it from anecdote to data.
Code review is already the bottleneck in most engineering organizations. Studies consistently show that review throughput — not coding speed — determines how fast teams ship. When an AI rewrites an entire function to fix a single bug, it transforms a 30-second review into a 15-minute archaeology expedition. The reviewer has to reconstruct what actually changed, distinguish intentional fixes from cosmetic rewriting, and verify that the "improvements" didn't introduce regressions. This is exactly the wrong direction for a tool that's supposed to make developers faster.
The security implications deserve attention. When a model adds input validation you didn't ask for, renames variables, and restructures control flow alongside the actual fix, it creates cover for subtle bugs. A reviewer scanning a 200-line diff for a one-line fix is operating at reduced attention. The unchanged code that was rewritten — code that was working — now needs re-verification. Every unnecessary change is an opportunity for a defect to hide in plain sight.
The Hacker News discussion surfaced a genuine split in practitioner opinion. User hathawsh reported success training Claude Code out of over-editing behavior through project-specific skills: "When it makes a mistake like over-editing, I explain the mistake, it fixes it, and I ask it to record what it learned." This works, but it's a per-user, per-project workaround for what should be a model-level default.
User jstanley offered the opposite perspective: "I often find coding agents privileging the existing code when they could do a much better job if they changed it to suit the new requirement." This tension is real — sometimes the right fix is a refactor. The problem isn't that models change code; it's that they can't distinguish between a fix that requires structural changes and a fix that requires changing one character. The model lacks a theory of edit scope.
Perhaps the most pointed observation came from foo12bar, who noted that AI models often hide failures by catching exceptions and returning dummy values, burying the evidence in verbose logging: "The logs themselves are often over abbreviated and missing key data to successfully debug what is happening." Over-editing isn't just cosmetic — it's a symptom of models optimizing for apparent correctness rather than minimal, verifiable change.
Quantifying over-editing requires defining what a "minimal edit" looks like, which is harder than it sounds. The research proposes measuring structural divergence between the model's output and the minimal fix. This is a meaningful metric because it separates the question "did the model fix the bug?" from "did the model do only what was asked?"
Current coding benchmarks like SWE-bench measure whether the fix works. They don't penalize a model for rewriting 50 lines when 1 line needed to change. This means we've been optimizing AI coding tools for correctness without penalizing unnecessary complexity — the evaluation framework itself encourages over-editing. A model that rewrites everything and passes tests scores the same as a model that makes the minimal surgical fix.
The research demonstrates that training with a minimal-edit objective — explicitly penalizing unnecessary changes — produces models that make smaller, more targeted edits without sacrificing fix accuracy. This is encouraging. It means over-editing is a training signal problem, not a fundamental limitation of the architecture.
If you're using AI coding tools in a team environment, over-editing is costing you review hours today. Three practical responses:
Constrain the edit scope in your prompts. Instead of "fix this bug," try "fix the off-by-one error on line 47 — change only what's necessary." Most coding agents respect scope constraints when explicitly stated. Claude Code's CLAUDE.md project files and Cursor's .cursorrules can encode this as a default instruction.
Use diff-aware review workflows. When reviewing AI-generated changes, filter for semantic changes versus cosmetic ones. Tools like `git diff --word-diff` or semantic diff tools can help separate meaningful changes from variable renames and reformatting. If your team uses AI coding tools regularly, consider adding a CI check that flags diffs exceeding a size threshold relative to the issue description.
Watch for the exception-swallowing pattern. As foo12bar noted, models frequently mask failures with try/catch blocks that return plausible defaults. Audit AI-generated code specifically for new exception handlers, especially ones that log and continue rather than propagate errors. This is where over-editing crosses from annoying to dangerous.
The broader architectural question is whether AI coding tools should default to minimal edits or maximal "improvement." The answer depends on context — a greenfield prototype benefits from aggressive refactoring, while a production hotfix demands surgical precision. Today's tools don't make this distinction. The user anonu captured the anxiety well: these agents "touch multiple files, run tests, do deployments, run smoke tests... and all of this gets abstracted away." The abstraction is the product, but the abstraction is also the risk.
The over-editing problem will likely get solved at the model layer within the next year. The research shows that minimal-edit training objectives work, and the major model providers have strong commercial incentives to fix this — enterprise adoption of coding agents depends on code review remaining tractable. Until then, treat AI-generated diffs the way you'd treat a junior developer's first PR: assume good intent, verify every line, and push back when the scope creeps beyond the ticket.
Conversely, I often find coding agents privileging the existing code when they could do a much better job if they changed it to suit the new requirement.I guess it comes down to how ossified you want your existing code to be.If it's a big production application that's been running for deca
I've noticed AI's often try and hide failure by catching exceptions and returning some dummy value maybe with some log message buried in tons of extraneous other log messages. And the logs themselves are often over abbreviated and missing key data to successfully debug what is happening.I
I think building something really well with AI takes a lot of work. You can certainly ask it to do things and it will comply, and produce something pretty good. But you don't know what you don't know, especially when it speaks to you authoritatively. So checking its work from many differen
Here, the author means the agent over-edits code. But agents also do "too much": as in they touch multiple files, run tests, do deployments, run smoke tests, etc... And all of this gets abstracted away. On one hand, its incredible. But on the other hand I have deep anxiety over this:1. I h
Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.
I'm either in a minority or a silent majority. Claude Code surpasses all my expectations. When it makes a mistake like over-editing, I explain the mistake, it fixes it, and I ask it to record what it learned in the relevant project-specific skills. It rarely makes that mistake again. When the s