Epoch AI's component cost breakdown shows HBM stacks, silicon interposers, and CoWoS-style packaging now make up nearly two-thirds of an AI chip's BOM, with the compute die itself a minority cost. They document that earlier logic-heavy designs have given way to accelerators that are effectively small mountains of DRAM welded to a compute die.
By submitting the Epoch AI breakdown to Hacker News and drawing 126 points and 125 comments, intelkishan amplified the framing that memory economics — not logic die fabrication — now define modern accelerator cost structure. The submission signals agreement that this is the underappreciated story of AI hardware.
The editorial argues that Nvidia's Blackwell shipment constraints have been CoWoS capacity and HBM allocation rather than TSMC's leading-edge logic nodes, with HBM sold out through 2026 across SK Hynix, Micron, and Samsung. This reframes two years of conventional wisdom about where the real supply-chain choke points sit in the AI hardware stack.
The editorial frames the BOM shift as the physical manifestation of transformer arithmetic intensity being low enough that H100s spend most of their time waiting for weights and KV cache from memory. HBM3e's ~4.8 TB/s on H200 and Blackwell's 8+ TB/s per package show the industry is literally paying for bandwidth — that's what the money buys.
Epoch AI published a component cost breakdown for modern AI accelerators showing that memory has grown to nearly two-thirds of the bill of materials. The compute die — the thing everyone calls 'the chip' — is now a minority cost on the parts list of an AI chip. HBM stacks, the silicon interposer that wires them to the logic die, and the advanced CoWoS-style packaging that holds it all together together dominate the BOM.
The trend isn't subtle. Earlier accelerator generations were logic-heavy: a big GPU die, GDDR memory soldered around it on a PCB, packaging that was mostly mechanical. Today's parts — H100, H200, B200, MI300X, TPU v5p — bond eight or twelve HBM3/HBM3e stacks directly to the logic die through a silicon interposer. Each HBM stack is itself a 3D-stacked tower of DRAM dies with a base logic die and through-silicon vias. You are buying, in effect, a small mountain of DRAM with a compute die welded to one side.
The pricing reflects this. HBM3e sells at multiples of commodity DDR5 per gigabyte, and HBM supply is sold out through 2026 at SK Hynix, Micron, and Samsung. Nvidia's bottleneck on Blackwell shipments has been CoWoS capacity and HBM allocation, not its relationship with TSMC's leading-edge logic nodes.
The cost shift is the physical manifestation of a workload shift. Training a frontier model is bandwidth-bound long before it is FLOP-bound. The arithmetic intensity of a transformer attention layer — FLOPs per byte moved — is low enough that on an H100, you spend most of your time waiting for weights and KV cache to arrive from memory. The industry's response has been to throw memory bandwidth at the problem: HBM3e on H200 hits ~4.8 TB/s, and Blackwell pushes past 8 TB/s per package. That bandwidth isn't free — it is, almost literally, what you're paying for.
This reframes a lot of conventional wisdom about the AI hardware stack. The narrative for two years has been 'TSMC is the bottleneck' and 'Nvidia's moat is CUDA.' Both are still true, but they're not where the marginal dollar goes. The marginal dollar goes to Korean and American DRAM fabs, and the marginal capacity constraint is advanced packaging — the CoWoS-L and CoWoS-S lines at TSMC that bond HBM to logic. When Jensen Huang said on earnings calls that supply is constrained, the constraint he was describing lives in Hsinchu's packaging fabs and Icheon's DRAM fabs, not in the 3nm logic line.
It also explains the competitive dynamics. AMD's MI300X competes with H100 not because AMD finally caught up on logic — it competes because AMD bought more HBM per package (192GB vs 80GB) and priced accordingly. The MI300X's selling point in inference benchmarks is largely a memory-capacity story: it fits bigger models in a single GPU's address space, which collapses tensor parallelism overhead. Groq's LPU, Cerebras's wafer-scale engine, and SambaNova's RDU all make different memory-bandwidth bets, and each company's pitch eventually reduces to a slide about bytes-per-flop.
The community reaction on Hacker News tracked the obvious second-order question: if memory is the bottleneck, why isn't the memory vendor capturing more of the margin? SK Hynix's HBM3e is sold out and prices have roughly doubled year-over-year, but Nvidia's gross margin on a fully-built H200 system is still north of 70%. The answer is the integration premium — the interposer, the packaging, the validated thermal solution, and the software stack. DRAM is a commodity; integrated AI accelerators are not, yet. The interesting question for the next five years is whether HBM vendors can climb the value chain by selling more integrated subsystems, or whether they remain the supplier that captures the volume but not the margin.
If you're a practitioner doing capacity planning for 2026, the operational implication is direct: when you model GPU availability and pricing, the supply curve to watch is HBM, not logic. Cloud providers' GPU shortages aren't going to ease because TSMC opens another N3 line — they ease when Micron's Taichung HBM expansion comes online and when CoWoS-L capacity ramps. That's a different timeline and a different set of public earnings calls to track.
For inference workloads specifically, the math now favors fitting your model in fewer, fatter GPUs over more, smaller ones — because the marginal cost of HBM-per-GB is what you're optimizing against, and tensor parallelism across nodes is paying the latency tax twice (NVLink isn't HBM, and Ethernet really isn't HBM). MI300X's 192GB and H200's 141GB are the targets to think in. If your serving stack is built around H100-era 80GB assumptions — heavy tensor parallelism, aggressive KV cache eviction, paged attention — you may be able to simplify the topology in 2026 with parts that just hold the whole model.
For training, the same logic suggests that the next architectural win for frontier labs isn't a smarter optimizer — it's any technique that reduces bytes moved per FLOP. That's why mixture-of-experts, sparse attention variants, FP8 weights, and KV cache quantization keep getting research attention disproportionate to their accuracy gains. They directly attack the line item that's eating the BOM.
The 'AI chip' as a discrete object is a useful fiction; what's actually being sold is a memory-bandwidth subsystem with enough logic attached to consume the bandwidth. Expect the next two generations of accelerators to lean harder into this — HBM4 arrives in 2026 with wider interfaces and stacked logic, and the long-rumored 'memory-centric computing' research from Samsung and SK Hynix (processing-in-memory, near-memory compute) suddenly looks less speculative when memory is already two-thirds of the cost. The companies that win the next round won't be the ones with the best matrix multipliers; they'll be the ones who figured out how to move fewer bytes.
Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.