A 26M-Parameter Model That Does Tool Calling at 1200 tok/s on Your Phone

4 min read 1 source clear_take
├── "Tool calling is fundamentally retrieval-and-assembly, not reasoning — massive models are overkill for it"
│  └── HenryNdubuaku (Hacker News, 346 pts) → read

Henry from Cactus Compute argues that agentic experiences are built upon tool calling, and their investigations showed it's fundamentally a retrieval-and-assembly task. A 26M parameter model can replicate the structured output patterns of much larger models — selecting functions and mapping arguments — without needing chain-of-thought reasoning or world knowledge.

├── "Edge AI for agentic tasks is underserved — capable models should run on budget phones, not just datacenter GPUs"
│  └── HenryNdubuaku (Hacker News, 346 pts) → read

Henry expresses frustration at the lack of effort toward building agentic models for budget phones. Needle's 6,000 tok/s prefill and 1,200 tok/s decode on consumer devices demonstrates that meaningful AI capabilities can run on hardware already in people's pockets, eliminating latency, cost, and network dependency imposed by API calls to remote infrastructure.

└── "Distillation from frontier models is a viable path to building tiny, task-specific models"
  └── Cactus Compute (Hacker News) → read

Cactus Compute distilled Gemini's tool-calling behavior into a model roughly 100-1,000x smaller, using Gemini's outputs as training signal. This demonstrates that frontier model capabilities can be compressed into task-specific architectures when the target behavior is well-defined and constrained, like structured function calling.

What happened

Cactus Compute, a startup focused on edge AI, open-sourced Needle — a 26-million-parameter model purpose-built for function calling (tool use). The model is available on GitHub under the cactus-compute/needle repository and hit the front page of Hacker News with 346 points, a strong signal that the developer community sees something worth paying attention to.

The numbers are striking. Needle runs at 6,000 tokens per second for prefill and 1,200 tokens per second for decode on consumer devices — the kind of hardware sitting in budget Android phones, not datacenter GPUs. For context, most production tool-calling workflows today route through models 100-1,000x larger: GPT-4, Claude, or Gemini, all of which require API calls to remote infrastructure, adding latency, cost, and a hard dependency on network connectivity.

The model was distilled from Gemini's tool-calling behavior, meaning Cactus used Gemini's outputs as training signal to teach a dramatically smaller architecture to replicate the same structured output patterns — selecting the right function, mapping arguments correctly, and formatting the call.

Why it matters

The intellectual core of Needle isn't the model itself — it's the claim underneath it. Henry from Cactus Compute put it directly: tool calling is fundamentally retrieval-and-assembly, and massive models are overkill for it. This is a falsifiable architectural thesis, not marketing, and it deserves serious scrutiny.

Think about what happens when a model "calls a tool." It receives a user intent ("what's the weather in Tokyo?"), matches it against a set of available function signatures, selects the right one (`get_weather`), and populates the arguments (`{"city": "Tokyo"}`). This is pattern matching and structured output generation. It does not require chain-of-thought reasoning, world knowledge, or the kind of emergent capabilities that justify 70B+ parameter models. It's closer to what a well-trained classifier with a structured decoder does.

The counter-argument is real, though. Simple tool calls are easy. But agentic workflows increasingly involve multi-step planning: call tool A, interpret the result, decide whether to call tool B or tool C, handle errors, and compose a final response. A 26M model can almost certainly handle `get_weather(city="Tokyo")`. Whether it can handle a five-step workflow where step three depends on an ambiguous result from step two is a different question entirely. The Hacker News discussion predictably split along this line — practitioners building simple integrations were excited, while those building complex agent chains were skeptical.

The real insight is that tool calling probably shouldn't be a monolithic capability inside a single model. The industry has been treating it that way because frontier models happen to be good at it, and the API interface makes it convenient. But there's no architectural reason your reasoning model and your function-routing model need to be the same model. Needle is an existence proof that they can be separated — and that the function-routing half can be absurdly small.

This mirrors a pattern we've seen before in systems design: the decomposition of monolithic capabilities into specialized components. Just as microservices separated concerns that monoliths bundled together, we may be entering an era where "the AI" decomposes into a reasoning model, a tool-routing model, a retrieval model, and a generation model — each sized appropriately for its actual task.

What this means for your stack

If you're building agentic applications today, Needle is worth benchmarking against your current tool-calling setup, even if you don't ship it. The exercise of measuring where a 26M model succeeds and fails on your specific function schemas will teach you something about your own tool-calling complexity. If 80% of your calls are simple single-function dispatches, you may be paying 100x more compute than necessary for those calls.

The on-device angle is where this gets practically interesting. A model that runs at 1,200 tok/s on a phone opens a design space that API-dependent tool calling cannot: offline-capable agents, zero-latency function routing, and — critically — tool calling without sending user intent data to a remote server. For mobile developers building assistant features, voice interfaces, or IoT control layers, this removes the network round-trip from the critical path.

The distillation approach also matters for teams with proprietary tool schemas. If Cactus can distill Gemini's general tool-calling capability into 26M parameters, a team could potentially fine-tune a similarly small model on their specific API surface — 50 internal endpoints instead of the open-ended function space Gemini handles. Smaller function space, smaller model, faster inference. The economics of running this at the edge become compelling quickly: no API costs, no rate limits, no cold starts.

There's a practical caveat: 26M parameters means this model has essentially no world knowledge. It cannot interpret ambiguous user requests that require understanding context beyond the function signatures provided. You'll need a larger model upstream to handle intent disambiguation, and Needle (or something like it) downstream to handle the structured dispatch. This is a router, not a replacement for your reasoning model.

Looking ahead

Needle is one data point in what's becoming a clear trend: the unbundling of capabilities that frontier models packaged together. We've already seen this with retrieval (RAG moved knowledge out of model weights), code generation (specialized coding models outperform general ones), and now tool calling. The question isn't whether specialized small models can handle structured tasks — Needle answers that. The question is whether the tooling and orchestration layers will mature fast enough to make multi-model architectures practical for teams that aren't Google-sized. Right now, running one big model is operationally simpler than orchestrating three small ones. That's an engineering problem, not a fundamental one, and engineering problems get solved.

Hacker News 737 pts 207 comments

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok&#x2F;s prefill and 1200 tok&#x2F;s decode on consumer devices.<p>We were

→ read on Hacker News
nl · Hacker News

Do you have any examples or data on the discriminatory power of the model for tool use?The examples are things like &quot;What is the weather in San Francisco&quot;, where you are only passed a tool like tools=&#x27;[{&quot;name&quot;:&quot;get_weather&quot;,&quot;parameters&quot;:{&quot;location&qu

ilaksh · Hacker News

Hmm.. this might make it feasible to build something like a command line program where you can optionally just specify the arguments in natural language. Although I know people will object to including an extra 14 MB and the computation for &quot;parsing&quot; and it could be pretty bad if everyone

varenc · Hacker News

Are you worried about Google&#x27;s response to this? Google reportedly reacts to distillation attempts &quot;with real-time proactive defenses that can degrade student model performance&quot;. So if they detected you, they could have intentionally fed you a dumber but plausible variant of Gemini: h

simonw · Hacker News

Suggestion: publish a live demo of the &quot;needle playground&quot;. It&#x27;s small enough that it should be pretty cheap to run this on a little VPS somewhere!

kgeist · Hacker News

&gt;Experiments at Cactus showed that MLPs can be completely dropped from transformer networks, as long as the model relies on external knowledge source.Heh, what a coincidence, just today one of my students presented research results which also confirmed this. He removed MLP from Qwen and the model

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.