Someone Just Ran a 400B LLM on an iPhone 17 Pro

A developer behind the ANEMLL project — an open-source framework for running large language models on Apple's Neural Engine — has demonstrated a 400 billion parameter model running locally on an iPhone 17 Pro. The demo, posted on Twitter and picked up on Hacker News (125 points), is the kind of thing that sounds like a benchmark stunt until you think through the implications.

First, the reality check. Running a 400B model on a phone does not mean running it well. At that scale on mobile hardware, you're looking at aggressive quantization (likely 2-4 bit), significant latency per token, and the kind of throughput that makes a conversation feel like talking to someone on a satellite phone. The Apple Neural Engine in the A-series chips is genuinely impressive silicon for on-device ML — 35+ TOPS on recent generations — but 400B parameters is an enormous amount of state to move through a mobile memory bus, even with the iPhone 17 Pro's expected 12GB of unified memory.

So why does this matter?

Because the direction is the story, not the snapshot. Two years ago, running a 7B model on a phone was the demo that got attention. Last year it was 70B. Now it's 400B. The quantization techniques (GGUF, AWQ, and Apple's own CoreML optimizations) keep getting better. The hardware keeps shipping more neural engine cores and memory bandwidth. The curve is steep.

For practitioners, there are three things worth tracking here:

1. Apple Neural Engine as an inference target is maturing. ANEMLL and similar projects (llama.cpp's Metal backend, MLX on macOS) are proving that ANE isn't just for Core ML vision tasks. It's becoming a viable path for generative AI inference, and Apple clearly wants this — their MLX framework investments signal where they're headed.

2. On-device inference changes the privacy calculus. A 400B model running locally — even slowly — means sensitive queries never leave the device. For enterprise mobile apps handling medical, legal, or financial data, 'slow but private' beats 'fast but cloud-dependent' in specific use cases.

3. The gap between 'runs' and 'useful' is closing faster than expected. If the iPhone 18 Pro ships with 16GB RAM and a more capable Neural Engine, running quantized 70-100B models at conversational speed on-device becomes plausible. That's the real benchmark to watch — not the headline number.

The demo is impressive as engineering. Whether it's practical today is beside the point. The trajectory says that within 1-2 hardware generations, your phone will run models that required a data center GPU three years ago. Plan accordingly.

Someone Just Ran a 400B LLM on an iPhone 17 Pro

// tldr

// deep dive

// read from source

iPhone 17 Pro Demonstrated Running a 400B LLM

// community takes

Someone Just Ran a 400B LLM on an iPhone 17 Pro

// tldr

// deep dive

// read from source

iPhone 17 Pro Demonstrated Running a 400B LLM

// community takes

// share this