Your M4 Max Burns More Money on LLMs Than OpenRouter Does

5 min read 1 source clear_take
├── "Local LLM inference on Apple Silicon costs more in electricity than cloud API pricing due to utilization economics"
│  └── William Angel (datadrivenangel blog) → read

Angel measured wall power draw during inference with a hardware power meter and compared per-token electricity costs against OpenRouter's pricing. His data shows that for most model sizes, the MacBook's electricity bill exceeds cloud API costs because local hardware serves one user while cloud GPUs amortize costs across thousands of concurrent requests.

├── "The utilization gap is the real driver — local inference pays for idle silicon while cloud providers spread costs across many users"
│  └── top10.dev editorial (top10.dev) → read below

The editorial emphasizes that the finding isn't about Apple Silicon being power-hungry — it's actually efficient per watt. The fundamental issue is that a local machine pays for 100% of silicon whether generating tokens or idle, while a data center GPU at 80% utilization serves hundreds of simultaneous requests, replicating the same economic logic that made cloud computing win over self-hosted servers.

└── "Apple Silicon's per-watt efficiency is impressive but irrelevant when the comparison is total cost per token against hyperscale infrastructure"
  └── William Angel (datadrivenangel blog) → read

Angel's methodology highlights that while M-series chips draw only 15-45W during inference compared to an H100's 700W, raw wattage is misleading. The H100 serves orders of magnitude more concurrent users, making its per-token energy cost far lower despite the higher absolute power draw.

What happened

William Angel published a detailed analysis measuring the actual electricity cost of running open-weight LLMs on Apple Silicon hardware versus paying per-token through OpenRouter's cloud API. The methodology was straightforward: measure wall power draw during inference using a hardware power meter, calculate the cost per token at real electricity rates, and compare that against OpenRouter's published pricing for equivalent models.

The headline result is counterintuitive: for most model sizes and workloads, your MacBook's electricity bill exceeds what you'd pay OpenRouter to run the same inference in the cloud. The analysis tested multiple quantized models across the range that Apple Silicon users typically run — from 7B parameter models up through the larger models that push the M4 Max's unified memory to its limits.

The post hit 305 points on Hacker News, sparking the kind of vigorous debate that only happens when data challenges a community's foundational assumptions.

Why it matters

### The economics of utilization

The core insight isn't that Apple Silicon is power-hungry — it's actually remarkably efficient per watt. The issue is utilization. When you run a local LLM, you're paying for 100% of the silicon 100% of the time the model is loaded, whether it's generating tokens or waiting for you to read the output. Cloud providers spread their GPU costs across thousands of concurrent users. A data center GPU running at 80% utilization amortizes its energy cost across hundreds of simultaneous requests; your M4 Max serves an audience of one.

This is the same economic logic that made cloud computing win in the first place — and it applies to inference just as much as it applied to web servers in 2008.

### The watt-hour reality

Apple's M-series chips draw between 15-45W during active inference depending on model size and quantization. That sounds modest compared to an NVIDIA H100 pulling 700W, but the H100 is serving orders of magnitude more tokens per second. When you divide watts by tokens, the efficiency gap is stark.

At typical US residential electricity rates (~$0.16/kWh), running a quantized 70B model on an M4 Max for sustained inference costs roughly $0.01-0.02 per 1,000 tokens in electricity alone. OpenRouter often prices equivalent models at $0.001-0.005 per 1,000 tokens. The local option costs 2-10x more in pure energy, before you even factor in the $3,000+ hardware amortization.

The math gets even worse for smaller models. A 7B model on Apple Silicon doesn't draw proportionally less power — the chip's base power draw creates a floor. Meanwhile, cloud providers run small models on optimized infrastructure where a single GPU handles dozens of concurrent requests.

### What the community got wrong

The Hacker News discussion revealed a common reasoning error in the local-LLM community. Many developers think of local inference as "free" because they already own the hardware and electricity feels invisible. This is the same sunk-cost fallacy that makes people drive 30 minutes to save $5 on gas.

The "local is free" mental model collapses under any rigorous accounting. Your MacBook has a finite lifespan measured in charge cycles and thermal wear. Every hour of sustained inference workload shortens that lifespan. If you're running inference 8 hours a day, the hardware depreciation alone — spread across the machine's useful life — likely exceeds cloud API costs.

Several commenters pushed back with legitimate points: electricity rates vary enormously by region (someone in Norway paying $0.04/kWh has very different math than someone in California at $0.30/kWh), and the comparison assumes you're using the LLM primarily for the kind of one-off queries where API pricing works well.

### Where local still wins

The analysis doesn't mean local inference is irrational. It means the justification needs to be honest.

Privacy is the strongest argument. If you're feeding proprietary code, medical data, or legal documents into an LLM, the cost premium for local inference is a privacy budget, not a computing expense. No amount of cloud provider promises about data handling equals the certainty of never transmitting the data.

Latency matters for interactive development workflows. Local inference on Apple Silicon delivers first-token latency in milliseconds versus hundreds of milliseconds for a round-trip to a cloud API. For coding assistants embedded in your editor, that difference is felt on every keystroke.

Availability is underrated. Local inference works on a plane, in a cabin, during an AWS outage. There's no rate limiting, no API deprecation notice, no sudden pricing change.

But "it's cheaper" isn't on that list.

What this means for your stack

If you're building products that use LLM inference, this analysis should sharpen your build-vs-buy thinking. For any workload where you're making fewer than ~10,000 inference calls per day, cloud APIs are almost certainly cheaper than self-hosted inference — even when you ignore the engineering time to manage local deployment.

The break-even calculation shifts for high-volume, sustained workloads. If you're running batch processing jobs that keep the GPU saturated for hours — embedding generation, document classification pipelines, bulk summarization — the per-token cost advantage of cloud shrinks because your utilization approaches what data centers achieve. At that point, the comparison becomes cloud GPU rental (not API pricing) versus local hardware, and the math is model-specific.

For individual developers using local LLMs as coding assistants or writing aids, the honest framing is: you're paying a premium for privacy, offline access, and the satisfying feeling of running your own infrastructure. Those are real benefits. Just don't pretend the electricity is free.

Practical implications: - Audit your actual usage patterns. If you're running Ollama for 20 queries a day, you'd save money with an API key. - Consider hybrid approaches. Use local inference for sensitive data and cloud APIs for everything else. - Factor in hardware depreciation when calculating local inference costs. That M4 Max won't last forever, especially under sustained thermal load.

Looking ahead

This analysis will age as both sides of the equation evolve. Apple's next-generation chips will improve perf-per-watt. Cloud providers will face pressure to maintain margins as competition intensifies. Open-weight models will get more efficient through better quantization and architecture improvements. But the fundamental economic logic — utilization determines unit economics — won't change. The developer who runs local inference for the right reasons (privacy, latency, availability) will always make a better decision than the one who runs it because they haven't done the math.

Hacker News 322 pts 275 comments

Apple Silicon costs more than OpenRouter

→ read on Hacker News
bastawhiz · Hacker News

This isn't a good analysis, and it's because it keeps rounding everything up. He rounds up the cost of electricity by 10%. He has a range of power use, takes the high end (which is 2x the low end) and multiplies it by the inflated electricity cost.But then they talk about using a newly pur

applfanboysbgon · Hacker News

Unless I'm misunderstanding, this is counting the entire laptop in the cost of generating tokens. The calculation seems to omit that, in addition to receiving LLM output, you have also received a laptop in exchange for your money. If you intend to put this machine in a dark corner and run it so

dijit · Hacker News

Frontier AI companies are selling at a loss.Excusing everything else that u/bastawhiz said[0]; the obvious fact here is that Claude, OpenAI, Gemini et al. are quite literally burning through 100's of billions of dollars and selling it back to you for pennies on the dollar in the hopes that

sleepyeldrazi · Hacker News

If you want a good dense model, use qwen3.6 27B instead, speed will be up, and if you don't take my word for it being smarter, take openrouter's prices of it against the bigger, slower and less memory-efficient gemma do the talking.If you want a faster model, go for qwen3.6 35B (or gemma 4

konaraddi · Hacker News

A lot of comments here are about the issues with the analysis in OP’s post but much of them are “a distinction without a difference” with respect to the broader conclusion. When we look at purely cost and performance (setting aside privacy) then it’s better for individual devs to pay for hosted then

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.