DRAM prices have doubled. HBM is sold out through at least 2027. And unless you’ve got the kind of procurement leverage that comes with being a hyperscaler, you’re wearing the full impact. Every infrastructure line item — cloud GPU instances, on-prem servers, endpoint devices — costs more than it did six months ago. And it’s staying that way through the end of 2027.
You need to work two tracks at the same time:
- Hardware-side decisions: cloud vs on-prem, procurement tactics, hardware refresh timing
- Software-side mitigations: model quantisation, KV cache optimisation, inference architecture
This article walks through both, with a clear recommendation at the end of each section. For the full context of the AI memory shortage, start there.
What is the actual cost impact of the memory shortage on enterprise infrastructure budgets?
Let’s talk numbers. DRAM contract prices surged 90–95% quarter-on-quarter in Q1 2026 — revised upward from an initial 55–60% forecast. DDR5 32GB modules have crossed the $500 mark. Samsung bumped their 32GB DDR5 module prices from $149 to $239 — a 60% increase, with contract pricing surging more than 100%.
That flows through to everything you buy. Enterprise PC and laptop prices are expected to rise 17% in 2026 because memory now makes up 23% of the total bill of materials, up from 16% in 2025. Cloud GPU instance pricing is climbing too — AWS raised EC2 Capacity Block prices for ML instances by around 15% in January 2026.
The shortage hits both CapEx and OpEx. There’s no cost-free path right now. And it hits smaller organisations harder — the hyperscalers lock in pricing through Long-Term Supply Agreements with multi-year, high-volume commitments. If you’re buying on shorter contracts or spot pricing, you’re absorbing the worst of the volatility.
Gartner projects a 130% year-on-year rise in DRAM prices in 2026, with prices not expected to normalise through the end of 2027. Intel CEO Lip-Bu Tan put it bluntly: “There’s no relief until 2028.”
There are actually two distinct shortages happening at once. HBM production eats roughly three times the wafer capacity of standard DRAM per gigabyte, and SK Hynix (~50% market share), Samsung (~40%), and Micron (~10%) have it sold out through at least 2027. Then the DDR5/NAND shortage affects everything else as manufacturers shift capacity toward HBM. For the price data to use in budget modelling and for why AI applications need so much memory, those pieces cover the foundations.
The recommendation: Budget for sustained elevated pricing through 2027. Model your infrastructure spend across cloud GPU instances, on-prem server DRAM, and enterprise endpoint devices — understand your exposure in each.
Cloud or on-prem: which is the better choice during a memory shortage?
Neither is free. Cloud providers are passing HBM cost increases through. On-prem server DRAM has doubled. The right answer depends on five things: your existing infrastructure investment, your team’s operational capability, the type of workload (training vs inference), how much contract flexibility you need, and whether you prefer CapEx or OpEx.
Cloud has some real advantages during the shortage: no upfront capital for hardware that may depreciate faster than expected, the ability to blend spot and reserved instances, and access to alternative silicon like AWS Trainium for compatible workloads. On-prem has its own: predictable costs once you’ve bought the hardware, better economics at sustained high utilisation, and no exposure to future cloud price increases.
Be honest about the TCO comparison. A proper apples-to-apples model needs a 3-year horizon that includes hardware amortisation, staffing costs to manage the infrastructure, power, networking, and opportunity cost. Cloud looks expensive on a per-hour basis but it includes all of that. On-prem looks cheap on paper until you add the hidden costs — data egress fees, idle resource costs, and the people to keep it running. And don’t assume cloud availability is a given — as hyperscalers prioritise scarce GPU capacity, other workloads can experience throttling or degraded performance.
But utilisation makes or breaks on-prem. Nearly half of enterprises waste millions on underutilised GPU capacity. Industry data shows inference GPU utilisation as low as 15–30% without active management. If your GPUs are sitting idle, cloud is cheaper. It really is that simple.
Here are the rough breakpoints:
- 1–8 GPUs: Cloud wins. The operational overhead of managing a small on-prem cluster usually costs more than the compute itself.
- 8–64 GPUs, sustained utilisation above 70%: On-prem becomes competitive over a 3-year TCO horizon.
- In between or uncertain: Hybrid — cloud for burst, on-prem for baseline. This is where most organisations in the 50–500 employee range end up.
How do reserved instances and spot pricing compare for AI workloads?
Spot instances cut your cloud GPU costs by 60–70% below on-demand pricing, but they can be interrupted. For training: implement checkpoint/resume patterns (saving model state to S3 at regular intervals) and interruptions become manageable. For production inference: use reserved capacity for baseline demand, spot for overflow. Running a mixed fleet across availability zones improves spot availability. Orchestration frameworks like Kubeflow or Ray automate instance selection and failover.
The recommendation: Model the decision using your actual utilisation data. If you don’t have utilisation data, start on cloud until you do.
That covers the hardware-side decisions. But there’s a parallel track that’s faster to implement and doesn’t require any new procurement at all — the software side.
How can you reduce your AI applications’ memory footprint without sacrificing performance?
Software-side mitigation requires no new procurement. You can start today.
Start with quantisation. 4-bit quantisation cuts memory footprint by approximately 75% versus 32-bit models; 8-bit cuts it by approximately 50%. Weight-only quantisation (A16W4) works best for memory-bound workloads with small batch sizes — that’s the typical inference pattern for organisations running their own models. Activation quantisation (A8W8) suits high-throughput serving with large batch sizes.
Dropbox Dash is a good reference point. Their engineering team deployed FP8, INT8, and INT4 quantisation along with KV cache quantisation across their AI product, and they documented concrete cost and performance results. Worth looking at if you want to see what this looks like in production rather than just benchmarks.
Which quantisation format should you try first?
For models under 13B parameters on workstation GPUs: start with INT4 weight-only. Larger models on data centre GPUs (A100, H100): FP8 preserves more accuracy while cutting memory roughly in half. INT8 is a solid general-purpose choice. Newer formats like MXFP and NVFP4 are worth watching as hardware support broadens. The accuracy tradeoff depends on the task — summarisation tolerates quantisation well, precise numerical reasoning less so. Test on your actual workload.
What is the KV cache and how does it drive inference costs?
The KV cache is the model’s short-term memory — it stores intermediate attention results during inference so the model doesn’t have to recompute them. Every new token needs to “look back” to every previous token. Longer context windows mean proportionally larger KV caches. If you’re running a 128K context window when 16K would do, you’re burning memory for nothing.
PagedAttention, implemented in vLLM, treats GPU memory like virtual memory — it eliminates fragmentation and enables prefix sharing across requests. If you’re running LLM inference in production without vLLM or something similar, that’s your next move.
The concept tying all of this together is the memory wall. As Sha Rabii, co-founder of Majestic Labs, puts it: “Your performance is limited by the amount of memory and the speed of the memory that you have, and if you keep adding more GPUs, it’s not a win.” Majestic Labs is building a 128TB inference system that deliberately avoids HBM — showing that memory-efficient architecture is a legitimate design direction, not just a workaround. For the technical foundation: memory wall and KV cache, that piece covers why AI applications need so much memory and why adding more GPUs alone doesn’t solve the problem.
The memory wall is also why the remaining techniques matter. Speculative decoding uses a smaller draft model to generate candidate tokens that the main model verifies in parallel — you get better throughput without a proportional memory increase. Disaggregated serving separates the prefill phase from the decode phase, running each on different hardware tiers. Red Hat‘s llm-d project implements this and reports 30–50% cost reductions. Knowledge distillation trains permanently smaller models from larger ones — it takes weeks to months, but the results are durable.
Rank by implementation effort: quantisation (days), vLLM/PagedAttention (framework selection), speculative decoding (configuration tuning), disaggregated serving (architecture change), distillation (medium-term training investment). Start at the top.
The recommendation: Implement quantisation this week. Evaluate vLLM this month. Queue the rest after you’ve captured the quick wins.
What procurement tactics are available for organisations that cannot match hyperscaler purchasing power?
You can’t get the Long-Term Supply Agreements the hyperscalers use. That’s a market structure issue, not a negotiation skill gap. Samsung recently signed a “non-cancellable, non-returnable” contract with a key server customer — supply earmarked for that customer provides zero relief for the rest of the market. These agreements require volume commitments and multi-year horizons that simply aren’t available at smaller scales.
Here’s the tactical checklist.
Multi-vendor sourcing. Engage SK Hynix, Samsung, and Micron (or their distributors) at the same time. Competitive tension between suppliers improves pricing even at modest volumes. You won’t get hyperscaler terms, but you can avoid being captive to a single vendor’s allocation decisions.
Pre-load inventory. If you know you need hardware in the next 12 months, buying now locks in current pricing. Deloitte projects further DRAM price increases of ~50% are possible. Gartner’s Ranjit Atwal puts it directly: “Buy now, or wait, because whatever you’re getting at the moment is going to be the best price.”
Work with procurement aggregators who combine demand from multiple organisations to negotiate enterprise-tier pricing. This is the closest thing you’ll get to hyperscaler leverage.
Lock extended quote validity. Vendors are guaranteeing prices for only two or three weeks. Push for 60–90 day windows.
Choose contract duration carefully. Six-month contracts preserve flexibility but offer no price protection. Annual contracts provide modest discounts but risk overpayment if prices stabilise faster than expected. Match your commitment length to your business certainty, not to the vendor’s sales cycle. Cloud GPU spot instances also work as a hedge — if on-prem hardware becomes uneconomical, spot capacity gives you a fallback without long-term commitment.
The recommendation: Source from multiple vendors, pre-load known-needed hardware, engage a procurement aggregator. For the price data to use in budget modelling, that piece has the specific numbers.
Should you delay your hardware refresh cycle, and how do you decide?
15% of organisations are extending PC refresh timelines. But deferral doesn’t save money if prices keep rising. And OEMs are reducing default RAM and SSD configurations to manage headline pricing — so you might end up paying the same price for a lower-spec device later.
Windows 10 End-of-Life (October 2025) adds a forcing function for compliance-related endpoints. Windows 11 has higher RAM and SSD requirements, and that’s hitting at exactly the wrong time. If you’re thinking about AI PCs, those need a minimum of 16GB RAM (the Microsoft Copilot+ PC spec), making memory even more of a budget factor.
The right approach is category-by-category, not all-or-nothing.
Accelerate refresh for compliance-related endpoints — anything still on Windows 10 that handles sensitive data or faces regulatory requirements. Lock pricing on approved quotes immediately. Accelerate developer workstations too — these are productivity multipliers, and skimping on developer hardware costs you more in lost output than you save on the purchase price.
Extend general office hardware 6–12 months. Standard office endpoints running Windows 11 can wait. There’s no functional reason to replace them at inflated prices if they’re doing the job. Deprioritise meeting room hardware and shared devices entirely.
For AI infrastructure specifically, fab expansion timelines suggest meaningful relief is unlikely before Q4 2027. If the workload can stay on cloud in the interim, that’s the safer path.
The recommendation: Prioritise by category. Lock pricing on compliance and developer hardware. Extend everything else. Budget for 15–20% cost inflation on all endpoint hardware through 2027.
How do you build a memory cost scenario into your 2026 and 2027 infrastructure budget?
Point estimates won’t hold. IDC‘s December 2025 “most pessimistic scenario” was already exceeded by February 2026 — the speed at which pricing has moved has shocked everybody, including the analysts. Model three scenarios instead.
Base case: prices plateau at current levels. DRAM stays at roughly current pricing. Cloud GPU instances hold at current rates. This is your planning floor.
Downside case: further increases through 2027. Supply continues tightening and prices climb another 30–50%. DRAM capex from memory makers is expected to rise only 14% in 2026, and NAND capex only 5% — the manufacturers aren’t rushing to fix the shortage. Stress-test against it.
Upside case: relief begins Q4 2027. New fab capacity from SK Hynix (Cheongju, 2027), Micron (Singapore, 2027; Idaho, 2027–2028), and Samsung (Pyeongtaek, 2028) begins delivering. Prices start declining in early 2028. This is your planning horizon for deferred purchases.
Project costs across all three budget categories for each scenario: cloud AI compute, on-prem server hardware, and enterprise endpoint refresh. The practical allocation: commit 70% of budget against the base case, reserve 20% as downside contingency, and identify which purchases can be deferred to capture the upside.
Focus your optimisation effort where it counts. Inference accounts for 80–90% of total AI spend. Every dollar saved through quantisation or KV cache optimisation reduces your exposure across all three scenarios. Build ongoing visibility through FinOps practices — track cost-per-inference as your primary unit metric. Tools like OpenCost give you real-time cost data across namespaces, pods, and nodes, so you can measure the actual impact of each optimisation rather than guessing.
The recommendation: Build all three scenarios into your infrastructure budget. Allocate against the base, reserve for downside, identify deferral candidates for upside. Track cost-per-inference quarterly. Revisit every quarter — conditions are moving faster than the forecasts.
For a complete overview of the memory supply crisis driving these costs — including the root causes, market dynamics, and what comes next — see the full context of the AI memory shortage.
FAQ
How much does model quantisation actually reduce memory usage?
4-bit quantisation cuts your memory footprint by roughly 75% compared to FP32 models. 8-bit brings it down by about 50%. These ratios hold across most transformer-based LLMs. If you’re running typical inference workloads with small batch sizes, weight-only quantisation (A16W4) is the place to start.
Is it cheaper to run AI workloads on cloud or on-premises hardware right now?
It depends on scale and utilisation. For 1–8 GPUs, cloud is usually the better deal. At 8–64 GPUs with sustained utilisation above 70%, on-prem starts to make sense over a 3-year TCO horizon. Neither option escapes the memory cost inflation.
What is the KV cache and why does it matter for AI infrastructure costs?
The KV cache stores intermediate attention computation results during LLM inference so the model doesn’t have to re-compute context for every generated token. Longer context windows need proportionally larger KV caches, which directly increases your memory demand and cost.
How long will the memory shortage last?
New fab capacity from SK Hynix, Samsung, and Micron won’t deliver meaningful supply relief before Q4 2027 at the earliest. Budget for sustained elevated pricing through at least 2027 — meaningful price declines aren’t likely before early 2028.
What is vLLM and should my organisation use it?
vLLM is an open-source LLM inference framework that implements PagedAttention, continuous batching, speculative decoding, and quantisation support. It’s become the de facto standard for production LLM serving. If you’re running LLM inference in production, vLLM should be your first evaluation point.
Can spot instances reliably handle production AI workloads?
Spot instances cut cloud GPU costs by 60–70% but they can be interrupted. They work well for training with checkpoint/resume patterns. For production inference, use reserved capacity for your baseline demand and spot for overflow. A mixed fleet across availability zones improves spot availability.
What is AWS Trainium and is it a viable alternative to NVIDIA GPUs?
AWS Trainium (Trn1/Trn2 instances) offers meaningful cost reductions compared to P5 (H100) instances for compatible workloads. It requires the AWS Neuron SDK and isn’t universally compatible. Worth evaluating if your workloads run on AWS and you can tolerate some framework lock-in.
Should I accelerate or delay enterprise PC purchases given current memory prices?
Accelerate for compliance-related endpoints and developer workstations. Extend general office hardware by 6–12 months. OEMs are reducing default configurations to manage pricing, so deferred purchases may get you lower-spec devices at similar prices.
What is disaggregated serving and how does it reduce AI infrastructure costs?
Disaggregated serving separates the prefill phase (processing input) from the decode phase (generating output). Prefill runs on high-performance GPUs while decode runs on cheaper hardware. The result is 30–50% infrastructure cost reductions by matching hardware to what each phase actually needs.
How do I track whether my AI cost optimisation efforts are working?
Start with cost-per-inference as your primary metric. Layer in GPU utilisation tracking and cloud spend breakdowns by workload type. Measure each optimisation individually so you know what’s actually moving the needle. OpenCost and cloud provider tools handle the instrumentation.
What is the memory wall and why does it matter for AI scaling?
The memory wall is the point where adding more GPUs without proportional memory bandwidth doesn’t actually improve performance. That’s why software-side optimisations like quantisation and KV cache management are structurally valuable — they reduce memory demand rather than throwing more hardware at the problem.