Henry from Cactus Compute argues that agentic experiences are built upon tool calling, and their investigations showed it's fundamentally a retrieval-and-assembly task. A 26M parameter model can replicate the structured output patterns of much larger models — selecting functions and mapping arguments — without needing chain-of-thought reasoning or world knowledge.
Henry expresses frustration at the lack of effort toward building agentic models for budget phones. Needle's 6,000 tok/s prefill and 1,200 tok/s decode on consumer devices demonstrates that meaningful AI capabilities can run on hardware already in people's pockets, eliminating latency, cost, and network dependency imposed by API calls to remote infrastructure.
Cactus Compute distilled Gemini's tool-calling behavior into a model roughly 100-1,000x smaller, using Gemini's outputs as training signal. This demonstrates that frontier model capabilities can be compressed into task-specific architectures when the target behavior is well-defined and constrained, like structured function calling.
Cactus Compute, a startup focused on edge AI, open-sourced Needle — a 26-million-parameter model purpose-built for function calling (tool use). The model is available on GitHub under the cactus-compute/needle repository and hit the front page of Hacker News with 346 points, a strong signal that the developer community sees something worth paying attention to.
The numbers are striking. Needle runs at 6,000 tokens per second for prefill and 1,200 tokens per second for decode on consumer devices — the kind of hardware sitting in budget Android phones, not datacenter GPUs. For context, most production tool-calling workflows today route through models 100-1,000x larger: GPT-4, Claude, or Gemini, all of which require API calls to remote infrastructure, adding latency, cost, and a hard dependency on network connectivity.
The model was distilled from Gemini's tool-calling behavior, meaning Cactus used Gemini's outputs as training signal to teach a dramatically smaller architecture to replicate the same structured output patterns — selecting the right function, mapping arguments correctly, and formatting the call.
The intellectual core of Needle isn't the model itself — it's the claim underneath it. Henry from Cactus Compute put it directly: tool calling is fundamentally retrieval-and-assembly, and massive models are overkill for it. This is a falsifiable architectural thesis, not marketing, and it deserves serious scrutiny.
Think about what happens when a model "calls a tool." It receives a user intent ("what's the weather in Tokyo?"), matches it against a set of available function signatures, selects the right one (`get_weather`), and populates the arguments (`{"city": "Tokyo"}`). This is pattern matching and structured output generation. It does not require chain-of-thought reasoning, world knowledge, or the kind of emergent capabilities that justify 70B+ parameter models. It's closer to what a well-trained classifier with a structured decoder does.
The counter-argument is real, though. Simple tool calls are easy. But agentic workflows increasingly involve multi-step planning: call tool A, interpret the result, decide whether to call tool B or tool C, handle errors, and compose a final response. A 26M model can almost certainly handle `get_weather(city="Tokyo")`. Whether it can handle a five-step workflow where step three depends on an ambiguous result from step two is a different question entirely. The Hacker News discussion predictably split along this line — practitioners building simple integrations were excited, while those building complex agent chains were skeptical.
The real insight is that tool calling probably shouldn't be a monolithic capability inside a single model. The industry has been treating it that way because frontier models happen to be good at it, and the API interface makes it convenient. But there's no architectural reason your reasoning model and your function-routing model need to be the same model. Needle is an existence proof that they can be separated — and that the function-routing half can be absurdly small.
This mirrors a pattern we've seen before in systems design: the decomposition of monolithic capabilities into specialized components. Just as microservices separated concerns that monoliths bundled together, we may be entering an era where "the AI" decomposes into a reasoning model, a tool-routing model, a retrieval model, and a generation model — each sized appropriately for its actual task.
If you're building agentic applications today, Needle is worth benchmarking against your current tool-calling setup, even if you don't ship it. The exercise of measuring where a 26M model succeeds and fails on your specific function schemas will teach you something about your own tool-calling complexity. If 80% of your calls are simple single-function dispatches, you may be paying 100x more compute than necessary for those calls.
The on-device angle is where this gets practically interesting. A model that runs at 1,200 tok/s on a phone opens a design space that API-dependent tool calling cannot: offline-capable agents, zero-latency function routing, and — critically — tool calling without sending user intent data to a remote server. For mobile developers building assistant features, voice interfaces, or IoT control layers, this removes the network round-trip from the critical path.
The distillation approach also matters for teams with proprietary tool schemas. If Cactus can distill Gemini's general tool-calling capability into 26M parameters, a team could potentially fine-tune a similarly small model on their specific API surface — 50 internal endpoints instead of the open-ended function space Gemini handles. Smaller function space, smaller model, faster inference. The economics of running this at the edge become compelling quickly: no API costs, no rate limits, no cold starts.
There's a practical caveat: 26M parameters means this model has essentially no world knowledge. It cannot interpret ambiguous user requests that require understanding context beyond the function signatures provided. You'll need a larger model upstream to handle intent disambiguation, and Needle (or something like it) downstream to handle the structured dispatch. This is a router, not a replacement for your reasoning model.
Needle is one data point in what's becoming a clear trend: the unbundling of capabilities that frontier models packaged together. We've already seen this with retrieval (RAG moved knowledge out of model weights), code generation (specialized coding models outperform general ones), and now tool calling. The question isn't whether specialized small models can handle structured tasks — Needle answers that. The question is whether the tooling and orchestration layers will mature fast enough to make multi-model architectures practical for teams that aren't Google-sized. Right now, running one big model is operationally simpler than orchestrating three small ones. That's an engineering problem, not a fundamental one, and engineering problems get solved.
Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.<p>We were
→ read on Hacker NewsHmm.. this might make it feasible to build something like a command line program where you can optionally just specify the arguments in natural language. Although I know people will object to including an extra 14 MB and the computation for "parsing" and it could be pretty bad if everyone
Are you worried about Google's response to this? Google reportedly reacts to distillation attempts "with real-time proactive defenses that can degrade student model performance". So if they detected you, they could have intentionally fed you a dumber but plausible variant of Gemini: h
Suggestion: publish a live demo of the "needle playground". It's small enough that it should be pretty cheap to run this on a little VPS somewhere!
>Experiments at Cactus showed that MLPs can be completely dropped from transformer networks, as long as the model relies on external knowledge source.Heh, what a coincidence, just today one of my students presented research results which also confirmed this. He removed MLP from Qwen and the model
Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.
Do you have any examples or data on the discriminatory power of the model for tool use?The examples are things like "What is the weather in San Francisco", where you are only passed a tool like tools='[{"name":"get_weather","parameters":{"location&qu