Your M4 Max Burns More Money on LLMs Than OpenRouter Doe...

What happened

William Angel published a detailed analysis measuring the actual electricity cost of running open-weight LLMs on Apple Silicon hardware versus paying per-token through OpenRouter's cloud API. The methodology was straightforward: measure wall power draw during inference using a hardware power meter, calculate the cost per token at real electricity rates, and compare that against OpenRouter's published pricing for equivalent models.

The headline result is counterintuitive: for most model sizes and workloads, your MacBook's electricity bill exceeds what you'd pay OpenRouter to run the same inference in the cloud. The analysis tested multiple quantized models across the range that Apple Silicon users typically run — from 7B parameter models up through the larger models that push the M4 Max's unified memory to its limits.

The post hit 305 points on Hacker News, sparking the kind of vigorous debate that only happens when data challenges a community's foundational assumptions.

Why it matters

### The economics of utilization

The core insight isn't that Apple Silicon is power-hungry — it's actually remarkably efficient per watt. The issue is utilization. When you run a local LLM, you're paying for 100% of the silicon 100% of the time the model is loaded, whether it's generating tokens or waiting for you to read the output. Cloud providers spread their GPU costs across thousands of concurrent users. A data center GPU running at 80% utilization amortizes its energy cost across hundreds of simultaneous requests; your M4 Max serves an audience of one.

This is the same economic logic that made cloud computing win in the first place — and it applies to inference just as much as it applied to web servers in 2008.

### The watt-hour reality

Apple's M-series chips draw between 15-45W during active inference depending on model size and quantization. That sounds modest compared to an NVIDIA H100 pulling 700W, but the H100 is serving orders of magnitude more tokens per second. When you divide watts by tokens, the efficiency gap is stark.

At typical US residential electricity rates (~$0.16/kWh), running a quantized 70B model on an M4 Max for sustained inference costs roughly $0.01-0.02 per 1,000 tokens in electricity alone. OpenRouter often prices equivalent models at $0.001-0.005 per 1,000 tokens. The local option costs 2-10x more in pure energy, before you even factor in the $3,000+ hardware amortization.

The math gets even worse for smaller models. A 7B model on Apple Silicon doesn't draw proportionally less power — the chip's base power draw creates a floor. Meanwhile, cloud providers run small models on optimized infrastructure where a single GPU handles dozens of concurrent requests.

### What the community got wrong

The Hacker News discussion revealed a common reasoning error in the local-LLM community. Many developers think of local inference as "free" because they already own the hardware and electricity feels invisible. This is the same sunk-cost fallacy that makes people drive 30 minutes to save $5 on gas.

The "local is free" mental model collapses under any rigorous accounting. Your MacBook has a finite lifespan measured in charge cycles and thermal wear. Every hour of sustained inference workload shortens that lifespan. If you're running inference 8 hours a day, the hardware depreciation alone — spread across the machine's useful life — likely exceeds cloud API costs.

Several commenters pushed back with legitimate points: electricity rates vary enormously by region (someone in Norway paying $0.04/kWh has very different math than someone in California at $0.30/kWh), and the comparison assumes you're using the LLM primarily for the kind of one-off queries where API pricing works well.

### Where local still wins

The analysis doesn't mean local inference is irrational. It means the justification needs to be honest.

Privacy is the strongest argument. If you're feeding proprietary code, medical data, or legal documents into an LLM, the cost premium for local inference is a privacy budget, not a computing expense. No amount of cloud provider promises about data handling equals the certainty of never transmitting the data.

Latency matters for interactive development workflows. Local inference on Apple Silicon delivers first-token latency in milliseconds versus hundreds of milliseconds for a round-trip to a cloud API. For coding assistants embedded in your editor, that difference is felt on every keystroke.

Availability is underrated. Local inference works on a plane, in a cabin, during an AWS outage. There's no rate limiting, no API deprecation notice, no sudden pricing change.

But "it's cheaper" isn't on that list.

What this means for your stack

If you're building products that use LLM inference, this analysis should sharpen your build-vs-buy thinking. For any workload where you're making fewer than ~10,000 inference calls per day, cloud APIs are almost certainly cheaper than self-hosted inference — even when you ignore the engineering time to manage local deployment.

The break-even calculation shifts for high-volume, sustained workloads. If you're running batch processing jobs that keep the GPU saturated for hours — embedding generation, document classification pipelines, bulk summarization — the per-token cost advantage of cloud shrinks because your utilization approaches what data centers achieve. At that point, the comparison becomes cloud GPU rental (not API pricing) versus local hardware, and the math is model-specific.

For individual developers using local LLMs as coding assistants or writing aids, the honest framing is: you're paying a premium for privacy, offline access, and the satisfying feeling of running your own infrastructure. Those are real benefits. Just don't pretend the electricity is free.

Practical implications: - Audit your actual usage patterns. If you're running Ollama for 20 queries a day, you'd save money with an API key. - Consider hybrid approaches. Use local inference for sensitive data and cloud APIs for everything else. - Factor in hardware depreciation when calculating local inference costs. That M4 Max won't last forever, especially under sustained thermal load.

Looking ahead

This analysis will age as both sides of the equation evolve. Apple's next-generation chips will improve perf-per-watt. Cloud providers will face pressure to maintain margins as competition intensifies. Open-weight models will get more efficient through better quantization and architecture improvements. But the fundamental economic logic — utilization determines unit economics — won't change. The developer who runs local inference for the right reasons (privacy, latency, availability) will always make a better decision than the one who runs it because they haven't done the math.

Your M4 Max Burns More Money on LLMs Than OpenRouter Does

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Apple Silicon costs more than OpenRouter

// community takes

Your M4 Max Burns More Money on LLMs Than OpenRouter Does

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Apple Silicon costs more than OpenRouter

// community takes

// share this