The editorial argues the real story is structural, not the discount itself — DeepSeek is publishing a 'tariff schedule' rather than running a sale. This mirrors how AWS spot instances and industrial electricity markets emerged once the underlying resource became fungible, making DeepSeek the first frontier lab to publicly admit inference is a commodity.
Bloomberg's reporting frames the discount as 'capacity-matched pricing' per internal DeepSeek sources. The company's H800 and Ascend clusters are sized for Beijing/Shanghai working hours plus North American daytime, leaving an 8-hour trough that was previously burned on keepalives and internal training — now productized into a paid off-peak window.
By surfacing the Bloomberg story to HN's front page (166 points), the submission highlights pricing that is roughly an order of magnitude below GPT-4o-class APIs during off-peak hours and 40–60% below Western frontier APIs even at peak. The implication shared by the audience engagement is that this reshapes the economics for any workload that can tolerate batch/async scheduling.
DeepSeek is making its off-peak inference discount permanent at 75% off the flagship V3 model, according to a Bloomberg report this weekend. The discount, originally introduced as a 'limited time' incentive in late 2024 and bumped to 75% on cached input tokens earlier this year, will now apply year-round during a fixed UTC 16:30–00:30 window — roughly the hours when North America is asleep and Chinese enterprise traffic has tapered off.
The headline number is dramatic but the structural change is bigger. DeepSeek isn't running a sale; it's publishing a tariff schedule. Input cached tokens during the discount window now price at roughly $0.035 per million, with non-cached input around $0.135 per million and output around $0.55 per million — numbers that put the model roughly an order of magnitude below GPT-4o-class pricing during off-peak hours, and still 40–60% below most Western frontier APIs at peak.
The company framed the move internally as 'capacity-matched pricing,' per sources cited by Bloomberg. In plain English: their H800 and Ascend clusters are sized for Beijing-and-Shanghai working hours plus North American daytime traffic. The 8-hour trough between those peaks was being burned on keepalives and internal training. Now it's a product.
The interesting story here isn't the discount. It's the pricing primitive.
Every hyperscaler eventually arrives at time-of-use pricing once their underlying resource becomes a true commodity. AWS spot instances are the canonical example: the moment EC2 capacity was fungible enough that Amazon could resell stranded inventory at 70–90% discounts with a preemption clause, an entire class of fault-tolerant workloads — Spark jobs, CI runners, ML training — migrated to spot overnight. Electricity utilities did the same dance decades earlier with industrial off-peak rates and interruptible-load contracts. DeepSeek is the first frontier lab to publicly admit that its inference is a commodity with a duck curve.
Compare this to how OpenAI, Anthropic, and Google price their flagships: flat per-token rates that quietly bake in a 'peak' assumption, plus a Batch API that nominally offers 50% off but with a 24-hour SLA that makes it useless for anything interactive. The Batch APIs were a half-step — they let providers smooth load without admitting their price was elastic. DeepSeek is skipping the abstraction and just publishing the curve.
The community response on Hacker News (166 points, 220+ comments at time of writing) is split between 'this is dumping' and 'this is honest pricing.' The dumping argument — voiced loudest by a partner at a Bay Area infra fund in the comments — is that DeepSeek can only sustain this because Chinese state-adjacent capital is subsidizing the H800/Ascend depreciation. The honest-pricing argument, which I find more credible, is that the marginal cost of inference on already-paid-for GPUs during a known low-utilization window is genuinely close to zero — and DeepSeek is the only lab with the operational courage to charge accordingly.
The second-order effect is what should worry incumbents. Once builders rewire their pipelines around time-of-use inference, that infrastructure doesn't unwire when OpenAI eventually responds. A startup that has spent three months getting its overnight evals, document-ingestion pipeline, and async agent runs to fire between 16:30 and 00:30 UTC has built a moat against its own API bill. They will not migrate back to flat-rate pricing even if Anthropic launches a competitive batch tier next quarter. The behavioral lock-in is real.
There's also a quieter benchmark story. V3 on the latest evals (SWE-bench Verified 51.2%, LiveCodeBench 65.9%, AIME 2025 ~79%) is no longer the 'cheap-and-cheerful alternative' it was 18 months ago. It's a frontier-adjacent model with frontier-adjacent benchmark numbers. The 75% discount applied to a model that was already a Pareto outlier on cost-per-token now puts it in a category by itself for any workload that can tolerate an 8-hour scheduling constraint — which, for non-chatbot use cases, is most of them.
Three concrete things to do this week if you ship anything that calls an LLM API.
First, audit which of your inference calls actually need to happen synchronously. The honest answer for most teams is 'fewer than half.' Embeddings refreshes, summarization backfills, eval suites, RAG re-indexing, fine-tuning data generation, agent-loop completions for non-user-facing tasks — all of these can be deferred 8 hours with zero product impact. If your job scheduler can't time-shift workloads to a pricing window, that's a one-sprint fix that will pay back for years.
Second, treat your AI bill like an electricity bill. Build a dashboard that segments spend by 'must-run-now' vs. 'time-shiftable,' and track the ratio. If 80% of your inference is in the must-run bucket, you have a product-architecture problem, not a pricing problem. The teams that win this transition will be the ones that decompose their LLM calls the way the data-engineering generation decomposed batch-vs-streaming a decade ago.
Third, stop modeling your unit economics on flat per-token rates. The realistic 2026 cost curve for inference is bimodal: prime-time rates that look roughly like 2025, and off-peak rates that are 3–10x cheaper depending on provider and willingness to commit to scheduling constraints. If you're a startup pitching investors on 'gross margin will improve as model costs decline,' the more accurate story is 'gross margin will improve as we move workloads off prime time.' That's a story about your engineering discipline, not Jensen's roadmap.
Expect at least one of OpenAI, Anthropic, or Google to launch an explicit time-of-use SKU within two quarters — probably positioned as a 'scheduled inference tier' or 'reserved off-peak capacity' to avoid the optical comparison to spot pricing. The economic logic is too strong to ignore, and the moment a serious enterprise customer starts modeling its AWS spot and DeepSeek off-peak rates side by side on the same procurement spreadsheet, the conversation is over. The era of flat-rate frontier inference, which lasted roughly three years, is ending. What replaces it looks a lot like every other mature compute market.
Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.
[dupe] https://news.ycombinator.com/item?id=48237663