AI inference — not training — is now the dominant cost line for companies scaling AI products. By early 2026, inference workloads account for over 55% of AI-optimised infrastructure spending. The assumption that you can simply scale API calls indefinitely breaks at production volume. The question is not whether the economics shift, but when.
Three deployment models exist: cloud, on-premises, and hybrid. Deloitte’s Tech Trends 2026 research gives you a specific, actionable decision trigger — the 60-70% cloud threshold — that tells you when the on-premises evaluation is worth running.
This article gives you a structured decision framework: the TCO methodology, the GPU utilisation problem that changes the on-premises calculation, and the three-tier hybrid architecture that most enterprises arrive at as the pragmatic outcome. For broader context on the full scope of this challenge, see our AI inference cost crisis guide.
What are the three AI inference deployment models and how do their cost structures differ?
The three deployment models are not equally suited to all workloads. You need to understand their cost structures before you make any infrastructure decision.
Cloud AI inference (AWS, Azure, GCP, OpenAI, Anthropic) is OpEx-heavy. You pay per GPU-hour or per token, with no upfront capital commitment. The elasticity is real and valuable. What is less visible is the pricing premium — cloud providers charge 2-3x wholesale GPU rates. Data egress adds another layer on top: for data-intensive AI workloads, egress fees typically add 15-30% to your total cloud AI spend. Use reserved instance rates as your comparison baseline, not on-demand rates.
On-premises AI inference (NVIDIA H100/H200, AMD MI300X, Lenovo ThinkSystem servers) flips the economics entirely. The capital cost is substantial — a Lenovo ThinkSystem SR675 V3 with 8× NVIDIA H100 GPUs runs approximately $833,806, with ongoing operational costs around $0.87/hour. No egress fees. Fixed costs that get cheaper per inference as your volume grows. The trade-off is CapEx exposure, operational overhead, and hardware refresh cycles every 3-5 years.
Hybrid AI inference splits workloads across both tiers based on their characteristics, and adds a third tier at the edge for latency-critical use cases. You keep cloud elasticity for burst and experimental workloads while moving consistent high-volume production inference on-premises.
Here is how the three models stack up:
- Cost at scale: Cloud is higher (premium pricing plus egress); on-premises is lower at sufficient volume; hybrid is optimised by workload tier
- CapEx requirement: Cloud requires none; on-premises requires $120k–$833k+; hybrid requires moderate CapEx for the on-premises tier only
- Operational overhead: Cloud is low (managed); on-premises is high (0.5–1.5 FTE)
- Egress costs: Cloud adds 15-30% of total AI spend; on-premises has none
Cloud is OpEx — ongoing and variable. On-premises is CapEx — upfront and fixed. Hybrid combines both, managed by workload classification. The decision is not binary. It is a spectrum.
What is the 60-70% cloud threshold and how do you calculate it for your workload?
The 60-70% threshold is the single most useful decision trigger in AI infrastructure planning. Deloitte’s Tech Trends 2026 research puts it clearly: when your cloud AI costs reach 60-70% of what equivalent on-premises hardware would cost over a comparable period, the economics of on-premises begin to compete — even after accounting for CapEx and operational overhead.
This is a ratio, not an absolute dollar figure. A 100-person company and a 5,000-person company can both hit the threshold at very different spend levels.
Here is how to calculate your threshold ratio:
- Establish your monthly cloud AI spend using reserved instance pricing, not on-demand rates.
- Add your monthly egress charges from your cloud billing dashboard.
- Price equivalent on-premises hardware amortised over 3-5 years. Lenovo’s reference data: for the 8× H100 configuration, the 5-year on-premises total is $871,912 vs $2,362,811 on 3-year reserved cloud pricing.
- Add the staffing overhead delta: on-premises adds 0.5-1.5 FTE in DevOps and ML infrastructure ($60,000–$180,000/year at $120,000 fully loaded).
- Divide your cloud cost (steps 1+2) by the on-premises equivalent (steps 3+4). If the ratio exceeds 0.60, run the full TCO analysis.
One thing to watch: agentic AI workloads will push you toward the threshold faster than you expect. Token consumption per task has jumped 10x-100x since December 2023. A single agentic workflow may make 10-50 API calls per user request versus 1-2 for a simple chatbot. If agentic AI is on your roadmap within the next 12-18 months, model the threshold with 5-10x your current token volumes. If you are arriving at this analysis because your costs unexpectedly surged, see our breakdown of how the PoC-to-production cost explosion happens — understanding the cause clarifies which infrastructure path makes sense.
A useful mid-market rule of thumb: at approximately 10-50 million tokens/day with consistent workload patterns, run the calculation. Below 10 million tokens/day, cloud APIs remain cost-competitive.
What does TCO really mean for AI inference — and what costs are companies missing?
Most cloud vs on-premises comparisons undercount the true cost on at least one side. ICONIQ‘s 2026 State of AI report found that inference costs average 23% of revenue at scaling-stage AI companies — a figure that holds from pre-launch through scale. If you are underestimating AI infrastructure costs, you are underestimating a 23% slice of your revenue.
A complete TCO analysis requires six cost categories:
1. Compute costs: Cloud — GPU-hours at premium rates (2-3x wholesale) or per-token API pricing. On-premises — hardware amortised over 3-5 years.
2. Storage costs: Model weights, KV cache, vector stores, and data pipelines. Commonly underestimated on-premises.
3. Egress costs: Cloud adds 15-30% of total AI spend. On-premises: zero. This is the most commonly omitted cost in cloud comparisons.
4. GPU premium pricing: Cloud providers charge 2-3x wholesale GPU rates on every GPU-hour, indefinitely.
5. Staffing delta: On-premises inference adds 0.5-1.5 FTE ($60,000-$180,000/year at $120,000 fully loaded). Omitting this is the single most common error in on-premises business cases.
6. Hardware refresh cycles: GPU servers have a 3-5 year economic lifespan. Refresh cycles add approximately 20-30% to the 5-year on-premises cost.
To put numbers on it (8× H100 configuration, Lenovo reference data): cloud on-demand at 5 years costs $4,306,416; cloud 3-year reserved costs $2,362,811; on-premises costs $871,912 total plus a staffing delta of $300k-$900k. Even at 3-year reserved pricing, and after adding staffing costs, on-premises is cheaper for sustained 24/7 workloads at enterprise scale.
The 3-5 year horizon is standard for TCO comparison. Comparing cloud vs on-premises over 12 months produces a misleading analysis that always favours cloud.
Why do GPU clusters operate at only 30-50% utilisation — and why does this matter for the on-premises decision?
Before you evaluate on-premises infrastructure, there is a step zero: understanding GPU utilisation. The 30-50% industry average is a real threat to the on-premises cost case.
At a 64-GPU H100 cluster at $3.50/GPU-hour, 40% utilisation means 60% of your capacity is generating no productive output — annual financial waste exceeding $1.1 million per cluster. At 35% MFU on an $833,806 H100 server, your effective cost-per-inference is nearly 3× what the hardware specification suggests.
There is a catch with how most teams measure GPU performance. nvidia-smi reports kernel scheduling activity, not actual Tensor Core computational efficiency. A GPU showing 95% in nvidia-smi may be achieving only 30-40% Model FLOP Utilisation (MFU). Your TCO calculation must use realistic projected MFU — not peak hardware capacity.
vLLM addresses this directly. vLLM is an open-source LLM inference serving framework implementing continuous batching and PagedAttention. Continuous batching dynamically groups concurrent requests to maximise throughput, eliminating the sequential idle time that produces the 30-50% MFU problem. At scale, vLLM achieves 793 tokens/second versus Ollama‘s 41 — MFU can reach 60-80%, effectively halving your per-inference cost.
For a 50-300 person company running 2-4 GPUs on-premises, this matters a lot. The three-tier hybrid architecture is the practical solution: run only consistent, high-volume workloads on-premises, and route variable or experimental workloads to cloud.
For deeper coverage, see our guide to optimisation techniques for your chosen AI inference infrastructure.
What is the three-tier hybrid AI architecture and why do most enterprises end up here?
The three-tier hybrid architecture routes workloads to the infrastructure tier where the unit economics are best. Per Deloitte’s Tech Trends 2026 research, it looks like this:
Tier 1 — Cloud (AWS, Azure, GCP): burst workloads, model training, new model evaluation, unpredictable or experimental inference. This is where you absorb uncertainty without committing CapEx.
Tier 2 — On-Premises (NVIDIA H100/H200 servers, served via vLLM): consistent, high-volume production inference where fixed costs get cheaper per inference at sufficient sustained volume.
Tier 3 — Edge: ultra-low-latency use cases requiring sub-50ms response — real-time fraud detection, on-device inference, industrial automation.
Hybrid is not a compromise. It is the expected architectural trajectory for organisations that have grown past early-stage experimentation.
Workload classification is the implementation task. Assign each workload to the most cost-effective tier using four dimensions:
- Volume: consistent, high → on-premises; variable, burst → cloud
- Latency: under 100ms → on-premises; under 50ms → edge; batch-tolerant → cloud
- Data sensitivity: regulated → on-premises preferred; public → cloud acceptable
- Unit economics: run the TCO comparison at each tier
When you are ready to migrate from cloud-only to hybrid, follow this sequence:
- Identify your highest-volume, most consistent production inference workloads
- Model TCO for those workloads on-premises using the six-component framework
- If the threshold ratio exceeds 0.60, migrate those workloads first
- Retain cloud for everything else
- Expand on-premises as volume grows
Organisations that implement hybrid workload routing correctly have documented 40-70% cost reductions versus all-API approaches.
For governance structures that manage multi-tier hybrid infrastructure cost over time, see our guide to AI infrastructure cost governance.
When does on-premises AI inference make sense for a 50-500 person company?
Most published TCO analyses serve either small research setups or Fortune 500 configurations. The 50-500 person SaaS, FinTech, or HealthTech company is underserved. So here is what the numbers actually look like for you.
Minimum viable scale heuristics:
- Token volume: 10 million+ tokens/day with consistent patterns. At 10M tokens/day, GPT-4o Mini (approximately $300/month) beats a self-hosted 7B model (approximately $850/month). At 50M tokens/day, self-hosted wins by a wide margin.
- GPU hours: 12+ GPU-hours/day of sustained inference — sufficient to achieve 60%+ MFU with vLLM batching.
- Time horizon: 3+ year product roadmap with stable model architecture.
- Team capacity: 0.5 FTE of DevOps/ML infrastructure already allocated. If it does not exist, add it to the TCO.
The open-source model breakeven is a compelling calculation. A self-hosted 7B model on a single H100 at 70% utilisation costs approximately $0.013 per 1,000 tokens. GPT-4o Mini is $0.15-$0.60 per 1,000 tokens — that is 10-46× more expensive at volume. At production volumes, breakeven arrives in 3-6 months.
Data sensitivity can accelerate the decision. For regulated HealthTech and FinTech companies, on-premises inference avoids egress AND compliance risk simultaneously. One telehealth company cut monthly AI costs from $48,000 to $32,000 by moving chat triage to a self-hosted LLM, while simplifying its HIPAA compliance posture at the same time.
How do you build a business case for an AI infrastructure decision?
An infrastructure decision involving on-premises GPU hardware or a shift in cloud commitment requires board or CFO-level approval. Your job is to translate a technical and economic analysis into financial language. Dollar figures, CapEx schedules, and break-even timelines. Not MFU percentages.
Here is a six-part structure that works:
1. Current state cost baseline: Monthly cloud AI spend at reserved instance pricing, egress charges, AI infrastructure cost as a percentage of engineering budget. ICONIQ’s 2026 benchmark puts inference at 23% of revenue at scaling-stage AI companies.
2. Threshold analysis: Cloud cost ÷ on-premises equivalent = threshold ratio. If it exceeds 0.60, proceed to full TCO.
3. TCO comparison: Cloud vs on-premises (or hybrid) over 3-year and 5-year horizons. Lenovo reference for 8× H100: breakeven at on-demand pricing is approximately 11.9 months; at 3-year reserved, approximately 21.8 months.
4. Risk and sensitivity analysis: What happens if token volume grows 3×? If agentic AI is on the roadmap, model the TCO with 5-10× current volumes.
5. Operational requirements: Staff cost delta in dollar terms. Translate “0.5 FTE” into: “approximately $60,000 in additional annual staffing cost, included in the TCO.”
6. Recommendation with trigger criteria: Tied to the threshold calculation, with explicit criteria that would change it. For example: “We recommend on-premises for workload X. If monthly token volume drops below 8M tokens/day, we will revisit.”
You will also need to answer these objections:
“Cloud is always more flexible” — Correct for burst workloads. The hybrid architecture preserves that flexibility where it matters, while eliminating cloud costs on predictable production workloads where elasticity provides no benefit.
“We don’t have the staff” — The staffing cost is quantified in the TCO: 0.5-1.0 FTE at $X versus $Y in annual cloud savings. If the payback is unacceptable, narrow the hybrid scope to the highest-volume workloads only.
“What if GPU prices keep dropping?” — Per-token pricing has fallen 10× annually, but total inference spending grew 320% over the same period. AWS raised capacity pricing in January 2026.
For ongoing governance of AI infrastructure costs post-decision, see our guide to AI infrastructure cost governance.
Frequently Asked Questions
What is the 60-70% cloud threshold and how do I measure it?
It is a ratio: cloud AI costs ÷ on-premises equivalent costs. When this ratio reaches 0.60-0.70, on-premises or hybrid economics become competitive. To measure it: (1) calculate monthly cloud AI spend at reserved instance pricing; (2) add egress costs; (3) price equivalent on-premises hardware amortised over 3 years plus staffing delta; (4) divide step 1+2 total by step 3 total. Source: Deloitte Tech Trends 2026, based on research with 60+ global technology leaders.
Does on-premises AI inference require dedicated IT staff?
Yes. The realistic requirement is 0.5-1.5 FTE depending on scale. For a 50-150 person company with an existing DevOps function (2-4 GPUs), 0.5 FTE is realistic. For a 200-500 person company with multiple GPU servers, plan for 1.0-1.5 FTE. Include this at $120,000+ fully loaded per FTE in your TCO — omitting it is the most common error in on-premises business cases.
What is the minimum inference volume that justifies evaluating on-premises?
The evaluation trigger is 10 million+ tokens/day with consistent patterns, or 12+ GPU-hours/day of sustained inference. Below these volumes, cloud reserved instances almost always produce better TCO when staffing costs are included. At volumes above 50 million tokens/day, the on-premises or hybrid case is almost always financially superior.
What is vLLM and why does it matter for the on-premises decision?
vLLM is an open-source LLM inference serving framework implementing continuous batching and PagedAttention — the primary techniques for improving GPU utilisation. Without continuous batching, sequential requests leave GPUs idle, producing the 30-50% MFU industry average. With vLLM, MFU can reach 60-80%, effectively halving per-inference cost. It is the de facto standard for self-hosted open-source model serving (Llama, Qwen, Mistral).
How is GPU utilisation different from what nvidia-smi reports?
nvidia-smi reports kernel scheduling activity, not actual Tensor Core efficiency. Model FLOP Utilisation (MFU) measures how much of the GPU’s theoretical throughput is used for productive work. A GPU can show 75-85% in nvidia-smi while achieving only 30-40% MFU, because memory fetches, attention overhead, and scheduling latency register as “active” without contributing to throughput.
Is it cheaper to run your own AI models or use OpenAI and Anthropic APIs?
At low volume (under 10 million tokens/day): API pricing almost always wins when staffing costs are included. At high volume (50 million+ tokens/day) with consistent workload patterns: self-hosting open-source models via vLLM can break even against API costs in 3-6 months, then produce 60-80% lower per-token costs. Data sensitivity can force the decision regardless of cost: regulated industries that cannot send data to third-party APIs must self-host.
How does agentic AI change the infrastructure decision?
Agentic AI has caused token consumption per task to jump 10-100× since December 2023. A single agentic workflow may make 10-50 API calls per user request versus 1-2 for a simple chatbot. If agentic AI is on your roadmap within 12-18 months, model the TCO with 5-10× higher token volumes — the threshold may be considerably closer than your current spending suggests.
What are data egress costs and why do they matter for cloud AI decisions?
Data egress charges are fees imposed by cloud providers when data moves out of their infrastructure. For data-intensive AI applications, egress typically adds 15-30% to total cloud AI spend; for high-bandwidth applications, it can reach 70%. On-premises inference avoids egress entirely for workloads where data remains within your network. Estimate your monthly data movement in GB, multiply by your provider’s egress rate, and add it to the cloud cost baseline before computing the threshold ratio.
What is the breakeven point for on-premises AI infrastructure?
Formula: (CapEx + cumulative operational costs) ÷ monthly cloud savings = breakeven in months. Lenovo reference data for an 8× H100 server: breakeven at on-demand pricing is approximately 11.9 months; at 3-year reserved pricing, approximately 21.8 months. GPU utilisation is highly sensitive: at 35% MFU, breakeven extends; at 70%+ MFU (achievable with vLLM), breakeven accelerates toward the 12-month end.
What hardware should I evaluate for on-premises AI inference?
Current-generation: NVIDIA H100 (80GB HBM3) and H200 (141GB HBM3e). The Lenovo ThinkSystem SR675 V3 with 8× H100 GPUs is the enterprise reference at approximately $833,806. For mid-market: 2-4 NVIDIA A100 GPUs — $120,000-$300,000, appropriate for 10-30M tokens/day workloads. AMD MI300X is competitive for memory-bound inference but has a less mature software ecosystem. Always model hardware refresh cycles (3-5 year lifespan) with zero recovery value.
How do I classify AI workloads for a three-tier hybrid architecture?
Four dimensions: (1) Volume — consistent, high → on-premises; variable or burst → cloud; (2) Latency — real-time under 100ms → edge or on-premises; batch-tolerant → cloud; (3) Data sensitivity — regulated → on-premises preferred; public → cloud acceptable; (4) Cost per inference — run the unit economics at each tier and compare. On-premises candidates: high-volume consistent APIs (document processing, fraud scoring). Cloud candidates: model training, new model evaluation, burst demand.
The infrastructure decision is one component of managing AI inference costs at scale. For a complete overview of AI inference economics — from why the cost crisis exists through to pricing strategy and governance — see our overview of AI inference economics and the forces driving this crisis.