Cost-per-hour figures are not the metric you should be building decisions around. The correct unit for production inference economics is cost per million tokens ($/M). Until your organisation is measuring in $/M, every infrastructure comparison is sitting on the wrong foundation.
This article fills a gap most inference benchmark pieces leave open: an end-to-end ROI modelling framework you can adapt to your own workload and hand directly to a CFO or board. It’s part of the broader analysis in Nvidia’s AI hardware empire and its pricing implications.
What does inference cost per token actually mean — and what is it commonly confused with?
Inference cost is the cost of running data through a trained model to produce an output. In production, that cost is driven by tokens — and you express it as cost per million tokens ($/M).
Here’s the formula: (GPU hourly rate ÷ utilisation rate) ÷ (tokens per second × 3,600) × 1,000,000 = $/M. Not at peak throughput. At the batch size and utilisation rate you’re actually running in production.
Training cost doesn’t belong in that formula. Training is a one-time investment. Inference is the ongoing expense — industry analyses put it at roughly 80% of AI budget once a system is in production. Conflating the two makes every infrastructure decision harder to justify.
The other thing people get confused about is the memory-bandwidth-bound regime. At low batch sizes — fewer than eight concurrent requests — a GPU’s speed is limited by how quickly it can load model weights from HBM, not by raw compute throughput. In this regime, a cheaper GPU with similar memory bandwidth can match a more expensive one on tokens per second. Push batch sizes up to 64–512 and you’re compute-saturated — that’s where more powerful hardware earns its price.
One caveat worth noting: CUDA-optimised workloads and non-CUDA workloads aren’t directly comparable. A benchmark run on Nvidia’s stack will reflect software optimisation differences as much as hardware differences.
What does the data actually show about Nvidia’s “10x inference cost reduction” claim for Vera Rubin?
Jensen Huang announced at GTC that Vera Rubin delivers approximately 10x more inference throughput per watt compared to Blackwell. The claim is real. But the comparison point matters enormously.
Vera Rubin is being compared to Blackwell (B200) — not to the H100 or H200 that most organisations are actually running in production right now. So if your fleet is H100s, the “10x” multiplier is not your planning starting point.
Nvidia’s own data is instructive here. The GB300 NVL72 comparison shows H200 at $1.41/hr generating 90 tokens per second at $4.20/M tokens, versus GB300 NVL72 at $2.65/hr generating 6,000 tokens per second at $0.12/M tokens. That’s a 35x cost-per-token reduction from Hopper to Blackwell — already delivered. Vera Rubin’s “10x” sits on top of that.
The catch: the “10x” figure assumes full utilisation, FP4 quantisation, and optimal batching. Hitting all three simultaneously is uncommon in practice. Artificial Analysis publishes regularly updated cost-per-token and throughput benchmarks — use their data as the independent check on any vendor-supplied improvement claim. For vendor negotiations, the “10x” is a useful anchor, but validate it with third-party throughput data at your expected utilisation rate before it goes into a board presentation.
How do GPU, LPU, and wafer-scale architectures compare on inference cost per token in 2025–2026?
Three architectural approaches are competing for inference workloads right now: Nvidia GPUs (HBM-based, general purpose), Groq LPU (SRAM-based, latency-optimised), and Cerebras WSE-3 (wafer-scale, high bandwidth for large dense models). Each suits different workloads. None is the universal winner.
Here’s how the main options stack up at representative utilisation:
Nvidia H100 SXM — small-model workloads (7B–34B): ~$0.026/M; 70B FP8 serving: ~$0.227/M on-demand at $2.50/hr. Best for 13B–70B production APIs with multi-tenancy.
Nvidia H200 SXM — 70B FP8 serving: ~$0.288/M on-demand. Higher cost per token than H100 for standard serving, but earns it through 141 GB VRAM and 128K-plus context windows.
Nvidia L40S — small-model workloads (7B–34B): ~$0.023/M at $0.72/hr on-demand. Lower cost per token than H100 for small models. Best for cost-sensitive 7B–34B endpoints at low batch sizes.
Groq LPU (GroqCloud) — Llama 3.1 8B: $0.05/$0.08 per million tokens at 840 tokens/second; Llama 3.3 70B: $0.59/$0.79 per million tokens at 394 tokens/second. Best for low-latency, low-concurrency workloads.
Cerebras WSE-3 — Llama 3.3 70B: $0.85/$1.20 per million tokens at 2,100 tokens/second. Best for very large dense models where inter-chip communication overhead is a constraint.
AMD MI355X — competitive cost-per-token for organisations that own hardware; 3.7x interactivity gap vs GB200 NVL72 at high throughput. Best for workloads where Nvidia vendor lock-in is a concern.
CUDO Compute’s controlled benchmark puts H100 SXM at $0.026/M and L40S at $0.023/M for small-model workloads — cheaper despite lower raw speed because the L40S costs 35% less per hour. The selection framework is straightforward: 7B–34B models at low batch sizes go on the L40S; 13B–70B production APIs with multi-tenancy go on the H100 SXM; 70B-plus at FP16 or long-context go on the H200.
The Groq LPU is built around large on-chip SRAM and compiler-scheduled deterministic execution — for full detail on the LPU architecture behind Groq’s inference benchmark results, see the deal analysis. It eliminates the model-weight-loading bottleneck that makes GPUs slow at low batch sizes. GroqCloud pricing as of January 2026: Llama 3.1 8B at $0.05/$0.08 input/output per million tokens at 840 tokens/second; Llama 3.3 70B at $0.59/$0.79 per million tokens. The constraint is SRAM capacity: LPUs fill a niche, not a general-purpose H100 replacement.
Cerebras WSE-3 pricing as of January 2026: Llama 3.3 70B at $0.85/M input and $1.20/M output at 2,100 tokens/second. The wafer-scale design eliminates inter-chip communication overhead, making it the leading independent alternative for very large dense models after the Groq deal. For MoE models or variable batch sizes, GPU clusters remain more flexible.
SemiAnalysis InferenceX benchmarks position the AMD MI355X as competitive on cost-per-token for organisations that own hardware, but Nvidia wins on perf/$ for short-to-medium cloud rentals. For risk diversification away from Nvidia, MI355X is the reference point.
How do you build a cloud vs on-premises inference ROI model?
Utilisation rate is the dominant variable in the buy-vs-rent decision. It’s the input that moves the model more than anything else. Build the model with four inputs.
GPU CapEx: hardware purchase price plus installation and integration. An 8x H100 NVL server runs approximately $833,806 in total system cost.
OpEx: power, cooling, networking, and staff over a five-year horizon. Power and cooling alone runs approximately $0.87/hr at $0.15/kWh for an 8x H100 server.
Cloud equivalent cost: AWS p5.48xlarge (8x H100) at $98.32/hr on-demand. Cloud GPU specialists like CoreWeave cut that to $4.76/hr for H100 on-demand — 40–80% savings over hyperscalers.
Utilisation rate: the highest-sensitivity variable. Lenovo’s TCO analysis for Llama 70B on 8x H100 shows on-premises at $0.11/M versus Azure at $0.89/M — but that assumes sustained high utilisation. Change that assumption and the numbers change with it.
Breakeven: (GPU purchase price + five-year OpEx) ÷ (cloud hourly cost × 8,760 hours × utilisation rate) = breakeven years. Lenovo puts this at approximately 11.9 months for an 8x H100 against AWS on-demand.
Before purchasing new hardware, have a look at what continuous batching does to your current utilisation numbers. Enabling it increases effective utilisation without changing hardware — improving the on-premises case before any new purchase is evaluated.
When does on-premises win and when does cloud win — and what are the utilisation thresholds that matter?
Cloud wins when inference demand is unpredictable or bursty, your team lacks GPU infrastructure expertise, or the workload is early-stage with uncertain volume. On-premises wins when sustained utilisation exceeds roughly 60–70% over the amortisation period, the workload is stable and forecastable, or data sovereignty constraints apply.
For workloads with bursty demand patterns, the honest utilisation assumptions are: conservative 25–35% (bursty activity, overnight low traffic, development load); base case 50–60% with active continuous batching; optimistic 70–75%, but only if you have documented historical utilisation data to back it up.
The problem many organisations run into is that “average utilisation 65%” often conceals peaks at 95% and troughs at 15%. Those trough periods are what undermines on-premises ROI. At conservative utilisation rates, cloud rental almost always wins on $/M.
The hybrid approach helps: owned hardware for baseline load, cloud for peak overflow. Right-sizing and buy-vs-rent are linked decisions — an L40S at high utilisation delivers better ROI than an H100 at moderate utilisation for 7B–34B workloads. Make both decisions together. For the full procurement decision framework, see GPU procurement for AI infrastructure.
How do you build an inference ROI model your CFO will accept — the four numbers that matter?
The section above helps you work out what to buy. This one is about how to present it. Frame everything around four numbers, each in three scenarios: conservative (low utilisation), base (expected utilisation), and optimistic (high utilisation). That’s the format boards use for capital allocation decisions.
Current cost per million tokens. Your baseline. Run the formula at your actual production batch size and measured utilisation.
Projected cost per million tokens under the proposed infrastructure. Use the same formula with new hardware specs across all three scenarios. Use Artificial Analysis or Lenovo’s TCO analysis as your TPS source — vendor figures assume conditions that may not match your workload.
The breakeven point, expressed as time. “We break even in 2.3 years at base utilisation” is something a non-technical board can interpret. “The breakeven utilisation rate is 58%” is not. Convert the rate to a time horizon.
The five-year NPV delta. A 10-percentage-point shift in utilisation rate has a larger impact on five-year ROI than a 20% change in GPU purchase price for most workloads. Show that sensitivity explicitly — a CFO who sees it in the model trusts the model.
Make all assumptions explicit: GPU purchase price, amortisation period, utilisation rate, TPS at production batch size, power cost per kWh, cloud instance pricing. Add a downside scenario for the probability of utilisation falling below breakeven — expressed as a downside NPV impact.
One more lever worth putting in the board presentation: quantisation. Moving from FP16 to FP8 — typically 1.5–2x throughput on H100/H200 hardware — reduces $/M by 33–50% without new hardware. For the strategic context behind these economics, see the full analysis.
What does the Groq deal mean for inference pricing — and how does it change the landscape going forward?
There’s one variable that deserves its own sensitivity column in the ROI model: GroqCloud pricing. Because the Groq deal changes what you can rely on it for.
In December 2025, Nvidia completed a $20 billion acquisition of Groq’s intellectual property — structured as a non-exclusive licence combined with a broad hiring initiative. GroqCloud continues to operate independently for now. The $20B price tag was 2.9x Groq’s $6.9B valuation from just three months earlier, and a Senate inquiry was launched in March 2026 questioning whether the deal amounts to an acquisition without regulatory scrutiny. But the practical outcome is the same: Groq is now on Nvidia’s roadmap.
Cerebras WSE-3 is now the leading independent inference-specialist alternative. AMD MI355X is the primary GPU-based Nvidia alternative. Nvidia controls approximately 93% of the AI accelerator market — a moat that deepens as LPU efficiency moves under Nvidia’s umbrella.
For your ROI model: treat GroqCloud pricing as a reference point, not a contract anchor. Build it as a sensitivity variable and use a three-year primary horizon for cloud API cost inputs. See Vera Rubin’s 28.8 exaflops claim in competitive context for how the hardware roadmap shapes the longer-term cost picture.
Once you have your inference ROI model built, the next step is incorporating inference ROI into your GPU procurement framework — the complete decision framework for determining whether to stay all-in on Nvidia or begin vendor diversification.
Frequently Asked Questions
What is the difference between cost per token and cost per GPU hour?
Cost per GPU hour is the price to run a GPU for one hour. Cost per token divides that by how many tokens the GPU generates in that hour. The formula: ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000 = $/M. A lower hourly rate does not guarantee a lower cost per token — throughput is what translates the hourly rate into the actual unit of production. Always convert to $/M before comparing hardware options.
How do I calculate cost per million tokens from a cloud GPU instance price?
Formula: (hourly price ÷ utilisation rate) ÷ (tokens per second × 3,600) × 1,000,000 = $/M. At $3.00/hr, 60% utilisation, 2,000 TPS: approximately $0.69/M tokens. Adjust TPS for your model size and batch configuration — benchmark figures assume conditions that are often more favourable than real production workloads.
How much does FP8 quantisation reduce inference cost in practice?
FP8 on H100/H200 delivers roughly 1.5–2x throughput over FP16 with less than 1–2% quality degradation for Llama, Mistral, and DeepSeek models. Code generation and mathematical reasoning are more sensitive to quantisation degradation than conversational tasks. FP4 (Blackwell-only) achieves roughly 3–4x FP16 throughput but needs task-specific quality validation first.
When is Groq’s LPU cheaper than an Nvidia H100?
The LPU cost advantage shows up in low-latency, low-batch-size workloads where GPUs operate in the memory-bandwidth-bound regime. For real-time applications requiring time-to-first-token under 100ms at low concurrency, GroqCloud pricing can be competitive with equivalent H100 cloud instances. For high-throughput, high-concurrency workloads, Nvidia GPUs reclaim their advantage through batching efficiency.
What utilisation rate should I assume when building a buy-vs-rent model?
At conservative utilisation rates — 25–35%, reflecting bursty activity, overnight low traffic, and development load — cloud rental almost always wins on $/M. Base case is 50–60% with active continuous batching. Optimistic is 70–75%, but only if you have documented historical data to back it up. If you’re unsure, start conservative.
Where can I find independent inference benchmark data?
Artificial Analysis publishes regularly updated cost-per-token and throughput benchmarks across GPU and API providers. CUDO Compute, Spheron Network, and Lenovo TechPress have published real-world $/M benchmarks for H100, H200, A100, L40S, and AMD MI355X as of 2025–2026.
Is the Nvidia “10x inference cost reduction” claim for Vera Rubin based on current hardware?
No — the “10x” claim compares Vera Rubin to Blackwell (B200), not to the H100 or H200 most organisations are currently running. Nvidia’s own data shows Blackwell already delivers 35x lower cost-per-million-tokens than Hopper. Vera Rubin adds another 10x on top — a multiplier on a multiplier, not a comparison to where your fleet is today.