Business

SaaS

Technology

•

May 18, 2026

HBM the Chip Nobody Planned For

Q: What is a KV cache and why does it need so much GPU memory?

The KV cache stores the 'key' and 'value' attention representations of every prior token in the context window — so the model does not recompute them on each generation step. At 1 million-token context the cache exceeds 600 GB per active request. Context window size, not model weight size, is the dominant memory constraint.

AI GPU prices have roughly doubled in the past 18 months. Cloud AI compute costs keep climbing. And at the centre of all of it is a chip most people outside the semiconductor industry had never heard of three years ago: High Bandwidth Memory, or HBM.

Here is the uncomfortable part. The memory industry came out of a brutal 2022–2023 oversupply bust — manufacturers slashed production to defend prices — then watched AI demand explode in 2024 with almost no spare capacity left. The buffer was gone.

This article is the technical foundation for the AI memory crunch series. It gives you a plain-English explanation of what HBM is, why it is so expensive to make, and why the shortage is structural rather than a temporary blip. It assumes you are technically literate — you have written software, you understand servers — but you are not a semiconductor engineer. Good. Let’s get into it.

What is High Bandwidth Memory and why does every AI chip need it?

High Bandwidth Memory is a type of DRAM — the same base technology as the DDR5 in a server — but physically restructured to deliver radically more bandwidth. Instead of sitting on a PCB slot away from the processor, HBM stacks multiple memory dies vertically and mounts them within approximately 1 mm of the GPU die, connected via a silicon interposer.

The bandwidth gap is the clearest way to understand the difference. HBM3E delivers approximately 1.2 TB/s per stack. DDR5 in dual-channel delivers approximately 102 GB/s. That is a 12-to-1 ratio.

Modern large language models use a transformer architecture. During inference — generating each token — the GPU must load model weights and KV cache values from memory on every single step. The processor can compute far faster than commodity DRAM can supply data. This is the memory wall, and generative AI has weaponised it at commercial scale.

The KV cache holds the “key” and “value” attention representations of every prior token in the context window. For a 128,000-token context window at FP16 precision, that runs to tens of gigabytes per transformer layer per active request. Inference is memory bandwidth-bound, not compute-bound. HBM is not a nice-to-have.

The H100 carries 80 GB of HBM3; the H200 carries 141 GB of HBM3E; the Blackwell B300 carries 288 GB — demand per GPU roughly doubled between generations. HBM now accounts for 50–60% of GPU manufacturing cost. That is why this is the central challenge driving this structural memory shortage.

How does die stacking work, and what are Through-Silicon Vias?

HBM achieves its bandwidth advantage by stacking up to 16 DRAM dies vertically and connecting them through tiny copper conductors drilled through the silicon itself. Picture a multi-storey car park: each floor is a memory die, and the elevator shafts running floor-to-floor are Through-Silicon Vias (TSVs).

A TSV is a vertical electrical connection drilled through the full thickness of a silicon die and filled with copper. In an HBM3E 12-Hi stack, each die is ground to approximately 30 µm thick — roughly a third of the diameter of a human hair. The dies are joined at each TSV endpoint by microbumps, tiny solder connections completing the circuit.

The result: a 1,024-bit-wide memory bus compared to DDR5’s 64-bit channel. That is why bandwidth is so dramatically higher even though the underlying DRAM cell technology is identical.

Both the HBM stack and the GPU die sit on a silicon interposer — a passive substrate providing the dense wiring to connect HBM’s wide bus to the processor. The assembly is integrated through TSMC’s CoWoS (Chip on Wafer on Substrate) process, which is itself a supply constraint covered in CoWoS advanced packaging — the second bottleneck in the chain.

All of those steps — TSV drilling, wafer thinning, microbump formation — compound to produce the most important economic fact in the shortage.

Why does building one gigabyte of HBM consume three times the wafer capacity of DDR5?

Micron has stated on record that producing one unit of HBM output requires the wafer capacity that would otherwise have yielded three units of DDR5 output. That 3-to-1 ratio is the single most important economic fact in this shortage.

Three compounding causes explain it.

TSV area overhead. Each TSV and its surrounding keep-out zone consumes silicon area that would otherwise hold memory cells. An HBM die is significantly less bit-dense per unit of wafer area than an equivalent DDR5 die. You are trading storage density for interconnect capability on every square millimetre of silicon.

Multi-step back-end processing. Standard DRAM finishes at the wafer level with relatively straightforward packaging. HBM requires TSV etch, TSV fill, wafer thinning, microbump formation, die-to-die stacking, and CoWoS integration — each step adding its own yield loss.

Wafer thinning yield penalty. At 30 µm thickness, silicon is fragile — closer to glass than rigid chip material. Breakage here is not trivial.

Every wafer diverted to HBM removes roughly three wafers-worth of DDR5 output from the market. DRAM lead times have stretched beyond 40 weeks. Every HBM wafer is a wafer denied to the server RAM in your cloud provider’s next refresh.

What is wafer reallocation, and why is the shortage a policy decision rather than a physics problem?

Wafer reallocation is the deliberate decision by a DRAM manufacturer to redirect wafer starts — units of production scheduled on fab equipment — from lower-margin products like DDR5 and LPDDR5 to higher-margin HBM. It is a scheduling decision, not a physical constraint.

The cleanroom equipment that runs DDR5 can, with modification, run HBM dies. The shortage is not a physics problem — it is a scheduling decision, and it reverses only when the economics shift.

Right now, the economics are not shifting. HBM3E commands prices approximately 6–8x higher per gigabyte than conventional DDR5. SK Hynix posted record operating profit for full-year 2025, surpassing Samsung for the first time since 1992. Micron’s HBM capacity for both 2025 and 2026 is fully committed.

SK Hynix holds approximately 45% of HBM market share and was first to ship HBM3E at scale. Samsung’s HBM3E modules did not qualify at Nvidia until November 2025. That yield situation is counter-intuitive: capacity that cannot ship qualifying product still consumes wafer starts and removes them from DDR5 supply. Bad yield tightens supply. The full competitive picture is in SK Hynix and Samsung — who makes HBM and why one dominates.

HBM3, HBM3E, HBM4 — what changes with each generation and why does it compound the shortage?

Each generation of HBM delivers higher bandwidth and capacity per stack but adds manufacturing complexity in doing so. The shortage deepens with each generation rather than plateauing — and AI chip roadmaps are already committed to the most complex generation before supply is established.

HBM3 (JEDEC January 2022) was first shipped in Nvidia’s H100, delivering up to 819 GB/s via a 1,024-bit bus across 8–12 die stacks. It established HBM as the default AI accelerator memory interface.

HBM3E (current mainstream, 2024–2025) raises I/O speeds to 1.2 TB/s across the same bus. The H200 and AMD MI300X are its primary consumers. SK Hynix was first to mass production; Samsung qualified at Nvidia only in November 2025.

HBM4 (JEDEC April 2025, shipping 2026) doubles the interface to 2,048 bits and targets 2 TB/s, with up to 16-Hi stacks and 64 GB capacity. The base die is fabricated at TSMC’s N5 node — the first time HBM’s logic layer has required a leading-edge logic process, adding a new dependency on TSMC capacity. SK Hynix shipped its first HBM4 samples to Nvidia in March 2026, six to nine months ahead of Samsung and Micron.

Nvidia’s Vera Rubin platform was designed around HBM4 before HBM4 supply was established. That is exactly why the fab timeline in the next section matters.

Why won’t new fabs coming online in 2027 fix this?

New fab capacity will not quickly resolve the shortage. The gap between a groundbreaking announcement and qualified production output is measured in years — and the capacity arriving first is already spoken for.

The construction lag is 3–5 years from groundbreaking to first qualified output. Qualifying a finished fab for HBM production adds another 6–12 months of test runs and customer sign-off. “Fab running” and “qualified HBM shipping at volume” are not the same milestone.

Known timelines: Micron’s Singapore and Boise Idaho fabs come online mid-2027. SK Hynix Cheongju follows in 2027, Indiana in late 2028. Samsung Pyeongtaek is 2028. Intel CEO Lip-Bu Tan was direct: “There’s no relief until 2028.”

When that new capacity arrives, it will not be on the open market. Hyperscalers are signing multi-year supply agreements to lock in capacity before fabs are even operational. SK Hynix has stated that HBM4 demand over the next three years already exceeds its supply capacity.

GPU availability and cloud AI pricing are unlikely to ease materially before 2028. Build that into your infrastructure planning — the budget implications are in what these price surges mean for enterprise hardware budgets.

Why is agentic AI worse for memory demand than large model training?

Training gets most of the attention when people talk about AI memory demand. But agentic inference is structurally more memory-intensive per unit of useful output.

Training is bounded and predictable: data flows through in batches, gradients are computed, weights are updated. You can tune a cluster for it. Agentic inference is a loop — reason, call tools, retrieve data, generate intermediate outputs, arrive at a response. Each step reloads context into the KV cache. SK Hynix said this explicitly in Q1 2026 earnings: “As AI evolves from large-scale model training to agentic AI… demand for memory is expected to continue growing.” That is Nvidia’s primary HBM supplier.

As context windows grow from 8,000 to 1 million tokens, the KV cache grows proportionally — exceeding 600 GB per request at 1 million tokens. Cloud providers serve many simultaneous sessions, each maintaining its own cache. An agentic fleet’s memory requirements scale with adoption and context length in ways no training job ever did. Micron predicts the HBM market will grow from $35 billion in 2025 to $100 billion by 2028 — larger than the entire DRAM market in 2024.

HBM is not a temporary supply blip. It is the simultaneous compounding of three forces: a physics constraint (the memory wall), a deliberate economic choice (wafer reallocation), and a demand acceleration (agentic AI creating memory requirements that scale non-linearly with adoption).

The question to carry out of this article is whether this is structural or cyclical. Cyclical corrections require a period of oversupply. Given advance commitment of new fab capacity and demand forecasts that continue to outpace supply projections, that oversupply window is not visible in the near term.

The manufacturer concentration that makes this shortage so acute — and so resistant to market correction — is the subject of SK Hynix and Samsung — who makes HBM and why one dominates. The packaging bottleneck that constrains GPU production even when HBM dies are available is in CoWoS advanced packaging — the second bottleneck in the chain. What all of this means for enterprise hardware budgets is in what these price surges mean for enterprise hardware budgets. For a complete overview of how all these forces interact, see our comprehensive AI memory crunch resource.

Frequently Asked Questions

Is HBM the same as GPU memory?

Not exactly. HBM is the memory inside modern AI GPUs — but consumer gaming GPUs use GDDR6X, which is cheaper and lower-bandwidth. HBM is reserved for data-centre AI accelerators (H100, H200, B200, MI300X) where bandwidth is the binding constraint.

Can HBM be used in regular servers or laptops?

No. HBM requires a silicon interposer and CoWoS packaging that is incompatible with standard DIMM slots or laptop form factors, and costs roughly 6–8x more per gigabyte than DDR5. Apple’s M-series unified memory is a different approach — it is not HBM.

Is the HBM shortage the same as the DDR5 shortage?

Related but distinct. The HBM shortage is driven directly by AI demand. The DDR5 shortage is a secondary effect: every HBM wafer removes three DDR5 wafers-worth of output from the market. DDR5 prices are collateral damage.

Why do AI companies need so much memory if they already have fast processors?

Processors can only work on data already in fast memory. If data cannot be loaded fast enough, the processor waits idle — this is the memory wall. For LLMs, model weights and KV cache are enormous relative to on-chip SRAM. Without HBM bandwidth, raw compute is wasted.

What is wafer capacity and why does it matter to me?

Wafer starts per month is the fundamental production unit for semiconductor fabs — fixed in the short term. When HBM consumes 3x the wafer capacity per gigabyte versus DDR5, every HBM wafer start removes three DDR5 wafers-worth of supply from the market. That lands on your cloud and hardware costs.

Why is RAM so expensive right now in 2026?

Two compounding factors: wafer reallocation from DDR5 to HBM cut commodity DRAM output, and production capacity was slashed in 2022–2023, leaving no buffer when AI demand surged in 2024. Samsung and SK Hynix raised server memory prices by up to 70% in Q1 2026, following 50% increases throughout 2025.

What is the difference between HBM3, HBM3E, and HBM4?

HBM3 (2022) delivers up to 819 GB/s via a 1,024-bit bus — first used in the H100. HBM3E (2024) delivers 1.2 TB/s on the same bus, used in the H200 and AMD MI300X. HBM4 (2025) doubles the bus to 2,048 bits and targets 2 TB/s. Each generation adds capability and manufacturing complexity.

How does HBM connect to a GPU physically?

HBM stacks and the GPU die both sit on a silicon interposer — a passive substrate providing the dense wiring to connect HBM’s wide bus to the processor. The assembly is integrated through TSMC’s CoWoS process, which is itself a supply constraint covered in CoWoS advanced packaging — the second bottleneck in the chain.

What is a KV cache and why does it need so much GPU memory?

The KV cache stores the “key” and “value” attention representations of every prior token in the context window — so the model does not recompute them on each generation step. At 1 million-token context the cache exceeds 600 GB per active request. Context window size, not model weight size, is the dominant memory constraint.

Will the HBM shortage ease once new fabs open?

Partially, but slowly. Micron’s Singapore and Boise fabs arrive in 2027; SK Hynix Indiana and Samsung Pyeongtaek follow in 2028. That capacity is pre-committed to hyperscalers. SK Group Chairman Chey Tae-won (March 2026) put the shortage persisting until 2030, with a projected wafer shortfall exceeding 20%.

Who makes HBM and why are there only three suppliers?

SK Hynix (~45% share), Samsung, and Micron. Producing HBM requires mastery of TSV processing, advanced packaging, and a decade of co-development with GPU makers. A greenfield DRAM fab costs $15–25 billion with a 4–5 year lead time; no new entrant is plausible at scale. The full competitive picture is in SK Hynix and Samsung — who makes HBM and why one dominates.

HBM the Chip Nobody Planned For

What is High Bandwidth Memory and why does every AI chip need it?

How does die stacking work, and what are Through-Silicon Vias?

Why does building one gigabyte of HBM consume three times the wafer capacity of DDR5?

What is wafer reallocation, and why is the shortage a policy decision rather than a physics problem?

HBM3, HBM3E, HBM4 — what changes with each generation and why does it compound the shortage?

Why won’t new fabs coming online in 2027 fix this?

Why is agentic AI worse for memory demand than large model training?

Frequently Asked Questions

Is HBM the same as GPU memory?

Can HBM be used in regular servers or laptops?

Is the HBM shortage the same as the DDR5 shortage?

Why do AI companies need so much memory if they already have fast processors?

What is wafer capacity and why does it matter to me?

Why is RAM so expensive right now in 2026?

What is the difference between HBM3, HBM3E, and HBM4?

How does HBM connect to a GPU physically?

What is a KV cache and why does it need so much GPU memory?

Will the HBM shortage ease once new fabs open?

Who makes HBM and why are there only three suppliers?

Related Articles

Team extension, extended team & out-sourcing FAQ

You Won’t Be Killed by a Weekend SaaS Clone. Here’s Who You Need to Watch

SoftwareSeni AI Adoption Update

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG