METR analysis reveals that AI-generated pull requests passing the SWE-bench benchmark often wouldn't survive real code review — raising hard questions about how we measure AI coding ability.
Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.