A 26M-Parameter Model That Does Tool Calling at 1200 tok...

What happened

Cactus Compute, a startup focused on edge AI, open-sourced Needle — a 26-million-parameter model purpose-built for function calling (tool use). The model is available on GitHub under the cactus-compute/needle repository and hit the front page of Hacker News with 346 points, a strong signal that the developer community sees something worth paying attention to.

The numbers are striking. Needle runs at 6,000 tokens per second for prefill and 1,200 tokens per second for decode on consumer devices — the kind of hardware sitting in budget Android phones, not datacenter GPUs. For context, most production tool-calling workflows today route through models 100-1,000x larger: GPT-4, Claude, or Gemini, all of which require API calls to remote infrastructure, adding latency, cost, and a hard dependency on network connectivity.

The model was distilled from Gemini's tool-calling behavior, meaning Cactus used Gemini's outputs as training signal to teach a dramatically smaller architecture to replicate the same structured output patterns — selecting the right function, mapping arguments correctly, and formatting the call.

Why it matters

The intellectual core of Needle isn't the model itself — it's the claim underneath it. Henry from Cactus Compute put it directly: tool calling is fundamentally retrieval-and-assembly, and massive models are overkill for it. This is a falsifiable architectural thesis, not marketing, and it deserves serious scrutiny.

Think about what happens when a model "calls a tool." It receives a user intent ("what's the weather in Tokyo?"), matches it against a set of available function signatures, selects the right one (`get_weather`), and populates the arguments (`{"city": "Tokyo"}`). This is pattern matching and structured output generation. It does not require chain-of-thought reasoning, world knowledge, or the kind of emergent capabilities that justify 70B+ parameter models. It's closer to what a well-trained classifier with a structured decoder does.

The counter-argument is real, though. Simple tool calls are easy. But agentic workflows increasingly involve multi-step planning: call tool A, interpret the result, decide whether to call tool B or tool C, handle errors, and compose a final response. A 26M model can almost certainly handle `get_weather(city="Tokyo")`. Whether it can handle a five-step workflow where step three depends on an ambiguous result from step two is a different question entirely. The Hacker News discussion predictably split along this line — practitioners building simple integrations were excited, while those building complex agent chains were skeptical.

The real insight is that tool calling probably shouldn't be a monolithic capability inside a single model. The industry has been treating it that way because frontier models happen to be good at it, and the API interface makes it convenient. But there's no architectural reason your reasoning model and your function-routing model need to be the same model. Needle is an existence proof that they can be separated — and that the function-routing half can be absurdly small.

This mirrors a pattern we've seen before in systems design: the decomposition of monolithic capabilities into specialized components. Just as microservices separated concerns that monoliths bundled together, we may be entering an era where "the AI" decomposes into a reasoning model, a tool-routing model, a retrieval model, and a generation model — each sized appropriately for its actual task.

What this means for your stack

If you're building agentic applications today, Needle is worth benchmarking against your current tool-calling setup, even if you don't ship it. The exercise of measuring where a 26M model succeeds and fails on your specific function schemas will teach you something about your own tool-calling complexity. If 80% of your calls are simple single-function dispatches, you may be paying 100x more compute than necessary for those calls.

The on-device angle is where this gets practically interesting. A model that runs at 1,200 tok/s on a phone opens a design space that API-dependent tool calling cannot: offline-capable agents, zero-latency function routing, and — critically — tool calling without sending user intent data to a remote server. For mobile developers building assistant features, voice interfaces, or IoT control layers, this removes the network round-trip from the critical path.

The distillation approach also matters for teams with proprietary tool schemas. If Cactus can distill Gemini's general tool-calling capability into 26M parameters, a team could potentially fine-tune a similarly small model on their specific API surface — 50 internal endpoints instead of the open-ended function space Gemini handles. Smaller function space, smaller model, faster inference. The economics of running this at the edge become compelling quickly: no API costs, no rate limits, no cold starts.

There's a practical caveat: 26M parameters means this model has essentially no world knowledge. It cannot interpret ambiguous user requests that require understanding context beyond the function signatures provided. You'll need a larger model upstream to handle intent disambiguation, and Needle (or something like it) downstream to handle the structured dispatch. This is a router, not a replacement for your reasoning model.

Looking ahead

Needle is one data point in what's becoming a clear trend: the unbundling of capabilities that frontier models packaged together. We've already seen this with retrieval (RAG moved knowledge out of model weights), code generation (specialized coding models outperform general ones), and now tool calling. The question isn't whether specialized small models can handle structured tasks — Needle answers that. The question is whether the tooling and orchestration layers will mature fast enough to make multi-model architectures practical for teams that aren't Google-sized. Right now, running one big model is operationally simpler than orchestrating three small ones. That's an engineering problem, not a fundamental one, and engineering problems get solved.

A 26M-Parameter Model That Does Tool Calling at 1200 tok/s on Your Phone

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

// community takes

A 26M-Parameter Model That Does Tool Calling at 1200 tok/s on Your Phone

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

// community takes

// share this