Someone Just Ran a 400B LLM on an iPhone 17 Pro

2 min read 1 source explainer

A developer behind the ANEMLL project — an open-source framework for running large language models on Apple's Neural Engine — has demonstrated a 400 billion parameter model running locally on an iPhone 17 Pro. The demo, posted on Twitter and picked up on Hacker News (125 points), is the kind of thing that sounds like a benchmark stunt until you think through the implications.

First, the reality check. Running a 400B model on a phone does not mean running it well. At that scale on mobile hardware, you're looking at aggressive quantization (likely 2-4 bit), significant latency per token, and the kind of throughput that makes a conversation feel like talking to someone on a satellite phone. The Apple Neural Engine in the A-series chips is genuinely impressive silicon for on-device ML — 35+ TOPS on recent generations — but 400B parameters is an enormous amount of state to move through a mobile memory bus, even with the iPhone 17 Pro's expected 12GB of unified memory.

So why does this matter?

Because the direction is the story, not the snapshot. Two years ago, running a 7B model on a phone was the demo that got attention. Last year it was 70B. Now it's 400B. The quantization techniques (GGUF, AWQ, and Apple's own CoreML optimizations) keep getting better. The hardware keeps shipping more neural engine cores and memory bandwidth. The curve is steep.

For practitioners, there are three things worth tracking here:

1. Apple Neural Engine as an inference target is maturing. ANEMLL and similar projects (llama.cpp's Metal backend, MLX on macOS) are proving that ANE isn't just for Core ML vision tasks. It's becoming a viable path for generative AI inference, and Apple clearly wants this — their MLX framework investments signal where they're headed.

2. On-device inference changes the privacy calculus. A 400B model running locally — even slowly — means sensitive queries never leave the device. For enterprise mobile apps handling medical, legal, or financial data, 'slow but private' beats 'fast but cloud-dependent' in specific use cases.

3. The gap between 'runs' and 'useful' is closing faster than expected. If the iPhone 18 Pro ships with 16GB RAM and a more capable Neural Engine, running quantized 70-100B models at conversational speed on-device becomes plausible. That's the real benchmark to watch — not the headline number.

The demo is impressive as engineering. Whether it's practical today is beside the point. The trajectory says that within 1-2 hardware generations, your phone will run models that required a data center GPU three years ago. Plan accordingly.

Hacker News 690 pts 317 comments

iPhone 17 Pro Demonstrated Running a 400B LLM

→ read on Hacker News
firstbabylonian · Hacker News

> SSD streaming to GPUIs this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?1: https://arxiv.org/abs/2312.11514

CrzyLngPwd · Hacker News

I had a dream that everyone had super intelligent AIs in their pockets, and yet all they did was doomscroll and catfish...shortly before everything was destroyed.

yencabulator · Hacker News

Qwen3.5-397B-A17B behaves more like a 17B parameter model. Omitting the MoE part from the headline makes it a lie and stupid hype.Quantizing is also a cheat code that makes the numbers lie, next up someone is going to claim running a large model when they're running a 1-bit quantization of it.

andix · Hacker News

My iPad Air with M2 can run local LLMs rather well. But it gets ridiculously hot within seconds and starts throttling.

PinkMilkshake · Hacker News

"That is a profound observation, and you are absolutely right..."With all the money you will save on subscription fees you should be able to afford treatment for your psychosis!

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.