Through a systematic 27,000-query experiment, the author demonstrated that no LLM could produce the same carbohydrate estimate twice for identical food images. The variance ranged 20-30% or more on the same plate, revealing that LLMs' stochastic sampling makes them structurally unsuitable for deterministic numerical tasks like carb counting.
As a Type 1 diabetic, the author highlights that a swing from 45g to 75g of carbs for the same bowl of pasta directly translates to insulin dosing errors — the difference between normal blood sugar and an emergency room visit. The study is framed as a direct challenge to health-tech startups raising money on the promise that photo-based AI can deliver reliable macronutrient data to diabetics.
The editorial emphasizes that AI nutrition apps are a booming investment category explicitly targeting diabetics — people for whom 'close enough' isn't a rounding error but a medical event. The 27,000-query dataset is characterized as a direct rebuke to every health-tech startup claiming AI can reliably estimate nutritional content from photos.
The editorial argues the story matters on two levels: immediate health risk and a much broader engineering lesson. The core insight is that LLMs are fundamentally stochastic systems being asked to do deterministic work, and this mismatch applies far beyond nutrition — to any domain where consistent, reproducible numerical outputs are required.
The editorial describes this as potentially the most methodical public test of LLM numerical reliability ever published. Unlike anecdotal spot checks, this systematic repeated-measures approach strips away the anecdotal and reveals structural limitations, suggesting this methodology should be the baseline for evaluating AI claims in high-stakes domains.
A researcher and Type 1 diabetic who runs the Diabettech blog conducted what may be the most methodical public test of LLM numerical reliability ever published. The setup was simple: take photos of meals and ask AI models to estimate the carbohydrate content — the exact use case that dozens of health-tech startups are building products around right now. Then do it not once, not ten times, but 27,000 times across the same set of food items.
The result? Across 27,000 queries, the AI could not produce the same carbohydrate estimate twice for identical inputs. The variance wasn't subtle rounding differences. We're talking about swings of 20–30% or more on the same plate of food, depending on when you asked and how the model's sampling happened to land. For a bowl of pasta, estimates might range from 45g to 75g of carbs. For a diabetic calculating an insulin dose, that's the difference between a normal blood sugar reading and a trip to the emergency room.
The study tested multiple prominent LLMs — the models that power the very apps being marketed to diabetics as meal-tracking assistants. The author didn't just run a few spot checks and call it a day. This was a systematic, repeated-measures approach that strips away the anecdotal and reveals the structural: LLMs are fundamentally stochastic systems being asked to do deterministic work.
This story matters on two levels: the immediate health implications and the much broader engineering lesson about what LLMs are and aren't good at.
On the health side, the timing is pointed. AI-powered nutrition apps are a booming category. Companies are raising real money on the promise that you can snap a photo of your lunch and get reliable macronutrient data. Some of these apps are explicitly targeting diabetics — people for whom "close enough" isn't a rounding error, it's a medical event. The 27,000-query dataset is a direct challenge to every health-tech startup claiming AI can reliably quantify nutritional content from images.
But the deeper lesson is for every developer building on top of LLMs, regardless of domain. The core finding — that the same input produces meaningfully different numerical outputs on every call — isn't a bug. It's a feature of how these models work. LLMs generate tokens probabilistically. When you ask for a number, you're not querying a lookup table; you're sampling from a distribution shaped by training data, temperature settings, and the model's internal state. For creative writing, that stochasticity is a feature. For numerical estimation, it's a landmine.
The Hacker News discussion around this study surfaced a familiar split. Some developers argued that this is obvious — "of course LLMs aren't calculators." Others pointed out that the marketing of these tools actively obscures this limitation, and that end users (especially non-technical ones managing chronic conditions) have no reason to suspect that "AI-powered" means "gives a different answer every time." The gap between how LLMs are marketed and how they actually behave is widest in exactly the domains where consistency matters most: health, finance, and engineering calculations.
There's a deeper technical nuance here too. Even setting temperature to 0 (greedy decoding) doesn't fully eliminate variance in many API implementations, because batching, floating-point nondeterminism in GPU operations, and infrastructure-level routing can all introduce variation. The author's 27,000 data points put a number on what many practitioners have suspected but few have measured at scale.
If you're shipping any product where an LLM's numerical output feeds into a downstream calculation — dosing, pricing, scoring, resource allocation — this study is your wake-up call to build explicit guardrails.
First, treat LLM outputs as distributions, not point estimates. Run the same query N times, compute the mean and standard deviation, and present confidence intervals to users. If the variance exceeds an acceptable threshold for your domain, fall back to a deterministic system. This is more expensive, but it's the honest engineering choice.
Second, implement hard bounds. For any numerical estimation task, define domain-specific min/max ranges and reject outputs that fall outside them. A carb estimate of 200g for a single apple should be caught before it reaches the user. This is basic input validation applied to model outputs — something most LLM integrations skip entirely.
Third, separate what LLMs are good at from what they're not. LLMs excel at classification, summarization, and fuzzy reasoning. They are not databases, calculators, or measurement instruments. If your product needs a number, consider using the LLM for the perception layer ("this image contains pasta, sauce, and bread") and a traditional lookup table or regression model for the quantification layer ("pasta typically contains X grams of carbs per serving"). Hybrid architectures outperform pure LLM pipelines in every benchmark that measures numerical accuracy.
Fourth, be honest with users. If your app uses AI estimation, say so — and communicate the uncertainty. A carb counter that says "approximately 55–70g" is more useful and more ethical than one that confidently states "62g" when it would say "48g" if you asked again in five seconds.
This study will likely accelerate two trends: regulatory scrutiny of AI in health applications, and a broader industry reckoning with the gap between LLM capabilities and LLM marketing. The 27,000-query methodology deserves to become a standard benchmark approach — pick a task, run it thousands of times, and publish the variance. Most AI product claims wouldn't survive that treatment, and that's exactly why we need more of it. For developers, the lesson is old but freshly quantified: know your tools, measure their failure modes, and never let a probabilistic system make a deterministic promise.
There is a lot of hate in the comments but there is some merit to the post existing: 1. Even if the task is unreasonable, it is good to showcase that the LLM will perform poorly - warning not to be used for diabetes. 2. As it is a probabilistic model, the approach was to execute it multiple times an
It’s just an impossible problem. Photons don’t provide sufficient information to determine calories (at least not in any way they could practically be captured). Inside that sandwich could be drenched with olive oil or it could be hollow cheese with lettuce. It’s impossible to tell.
This will surprise nobody here, but it’s important to communicate to audiences that are new to LLMs.This is targeted at people with diabetes because there are AI carb counting apps appearing in app stores> If you’re using AI carb counting in a diabetes appThese apps are probably not even using th
I am... unsure why anyone would think LLMs would be able to do this. They are not magic oracles. Like I think even most humans would be extremely bad at this.Like, are people actually using LLMs for this? Please do not, it won't work.
Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.
There's an incredibly serious lack of education with how LLMs & carb-counting works. This entire article would be better suited to astrology.com than hackernews.When I opened it up, I assumed the author would have at least attempted a calculation service, maybe even placed something like th