Business

SaaS

Technology

•

Mar 18, 2026

The AI Inference Optimisation Playbook — Caching, Quantization, and Model Routing in Priority Order

Inference now accounts for 80–90% of total AI compute costs across a model’s production lifetime, yet most guides throw every optimisation technique at you in random order and leave you to work out where to start. If the PoC bill shock has already hit, you don’t need a catalogue — you need a sequence.

This is that sequence. Three tiers, ordered by effort-to-impact ratio. Tier 1: zero-infrastructure API-level changes you can make today. Tier 2: configuration changes that take days to a week. Tier 3: structural engineering work for this quarter. Each tier tells you whether the next one is worth the investment.

One note before we get into it: if you’re building or running agentic AI, the same techniques apply — but at a 5–20x cost multiplier per user action. Optimising at the individual call level misses the point. You need to optimise the chain. More on that at the end.

For the broader economic context behind why inference costs are eating AI budgets, the AI inference cost crisis guide covers the full picture. This article is the action layer on top of that foundation.

Why Does the Order of AI Inference Optimisation Matter as Much as the Technique?

Most resources catalogue inference optimisation techniques without telling you which to do first. You end up with a list of equally-weighted options when what you actually need is a priority stack.

The sequencing axis that matters is effort-to-impact ratio — not technique prestige or theoretical maximum savings. A team that enables prompt caching this week (zero infrastructure change, 50–90% cost reduction for qualifying workloads) gets a faster return than one that spends two months deploying a self-hosted vLLM stack with quantization. The data backs this up: teams that implement caching, routing, and batching before touching their serving infrastructure consistently outperform teams that go straight to structural interventions.

Here’s the effort spectrum: API configuration changes take hours. Serving configuration changes take days to weeks. Quantization pipelines take weeks to months. And the cost reduction spectrum doesn’t map to effort in the direction you’d expect. The lowest-effort interventions often produce the largest percentage reductions for API-heavy workloads.

Reversibility matters too. Tier 1 changes are reversible with a config flag. Tier 3 changes — quantization, model replacement — require rigorous accuracy validation before production. Higher tiers carry higher rollback risk, which matters when you’re touching production systems.

ICONIQ’s analysis of scaling-stage AI companies found that model inference averages 23% of total AI product costs — nearly as expensive as the entire AI team. The pressure to optimise is real. The question is just where to aim first.

What Are the Fastest AI Inference Cost Wins You Can Implement Today?

Three actions qualify as Tier 1 zero-infrastructure wins: enable API-level prompt caching, audit model routing, and shift latency-tolerant workloads to async batch processing. None of these require standing up new infrastructure, hiring ML engineers, or touching model weights.

The starting point for all three is a workload audit. Categorise requests by: (a) prompt repetition rate, (b) complexity requirements, and (c) latency sensitivity. That audit maps directly to which Tier 1 wins apply to your situation.

Prompt caching applies if your workload includes repeated system prompts or shared contexts. Model routing applies if you’re sending simple queries to expensive frontier models. Async batch processing applies if you have analytics jobs, nightly content moderation passes, or embedding generation that doesn’t need a real-time response.

The wins also compound. A workload with 60% cacheable prompts and a 50/50 routing split between simple and complex queries can see 40–60% total cost reduction from configuration changes alone — before touching a single line of infrastructure.

How Does Prompt Caching Reduce LLM Inference Costs by 50–90%?

Prompt caching works by storing the computed key-value (KV) attention representations of repeated prompt prefixes. When the same prefix appears again, the API skips recomputation and charges only for the new tokens — reducing cost on cached tokens by 50–90%.

There are two distinct layers of caching to understand.

API-level prompt caching is exposed by Anthropic and OpenAI as a managed feature. No infrastructure change required. You restructure your prompts and the cost reduction happens automatically. Google’s Gemini API calls the same thing “context caching.” This is the Tier 1 entry point.

Infrastructure-level KV cache is the GPU memory mechanism in self-hosted inference engines. It serves the same purpose but lives at the infrastructure level — storing intermediate attention computations to avoid recomputing them for already-processed tokens. Storing the KV cache for a 500B parameter model over a 20,000-token context requires about 126GB of memory — which gives you a sense of the scale involved. This is a Tier 2 concern for teams running self-hosted inference.

For teams on managed APIs, the only implementation action is prompt structure. Stable, reusable content goes at the beginning — the prefix position where caching applies. Variable content, like the user’s actual query, goes at the end.

How do you know if your workload qualifies? High qualification signals include static system prompts longer than 1,024 tokens, RAG retrieval contexts that repeat across users, few-shot examples embedded in every request, and multi-turn conversations where the system prompt is always present.

Cache hit rate is the key metric to monitor. Tensormesh’s LMCache CacheBlend reports 85% cache hit rates for agentic AI workloads with repeated tool descriptions in system prompts — at that hit rate, effective cost per request drops to near the cost of the variable suffix only.

As a rough illustration: a $10,000/month API spend on a RAG system with 70% cache-eligible requests could realistically land at $3,000–$4,500/month after caching. Your workload will vary, but the mechanism is consistent.

How Does Model Routing Stop You Overpaying Frontier Model Prices for Simple Queries?

Model routing inserts a lightweight classification layer between your application and your LLM. Simple queries go to smaller, cheaper models. Complex requests escalate to frontier models only when genuinely needed. ICONIQ identifies this as table-stakes cost management at scaling-stage companies — not an optional nicety.

The core problem is straightforward: most production inference workloads contain a significant fraction of requests that don’t require frontier model capability. FAQs, classification tasks, simple data extraction, templated responses — all priced at frontier rates because routing logic doesn’t exist.

Routing a query to Claude Haiku ($0.25/$1.25 per million input/output tokens) instead of Claude Opus ($15/$75) represents a 60x cost reduction for that request. Even routing 30% of your queries to cheaper models produces meaningful savings at any volume.

LiteLLM and Portkey are the primary open-source routing tools. Portkey supports conditional routing — “use cheaper model for summarisation, premium model for reasoning” — with fine-grained conditions and a unified API across multiple providers.

Routing strategies range from simple to sophisticated: rule-based (prompt length, keyword detection), classifier-based (a small model scores complexity), or threshold-based (use the cheaper model’s confidence score to decide escalation). Start simple. Rule-based routing based on prompt length and task type gets you most of the savings with minimal configuration overhead.

One thing to get right: validate routing decisions against a benchmark of accepted output quality for each task type before deploying to production. Route based on measured output quality, not expected output quality.

Multi-provider routing also adds resilience beyond cost. Enterprises processing millions of requests daily almost always hedge across two or more providers — routing across Anthropic, OpenAI, and a self-hosted endpoint gives you both price arbitrage and failover capacity.

How Do You Improve GPU Utilisation from 30–40% to 70–80% and Why Does It Matter?

This is Tier 2 territory — it requires self-hosted infrastructure. If you’re exclusively on managed APIs, your provider handles this. Focus Tier 2 effort on your own inference deployments. (If you haven’t yet settled your deployment model, the cloud vs on-premises deployment decision framework covers that decision before you commit to self-hosted infrastructure investment.)

Enterprise GPU clusters typically operate at only 30–50% utilisation. The reason is static batching: the serving engine waits for a fixed batch size before processing, then works through the whole batch before accepting new requests. The idle time while waiting is paid-for capacity doing nothing.

A 64 H100 GPU cluster at 40% utilisation at $3.50/GPU-hour costs $161,280/month total — of which roughly 60% is waste. Raising utilisation from 40% to 80% effectively doubles compute capacity without spending another cent on hardware.

Continuous batching eliminates the wait. As each sequence in a batch completes, a new request slots in immediately — keeping the GPU continuously occupied. TensorRT-LLM calls this “in-flight batching”; the mechanics are the same. TGI and vLLM both implement it natively. Ollama does not.

vLLM pairs continuous batching with PagedAttention — its KV cache memory management system that treats GPU memory as virtual pages, eliminating fragmentation and enabling efficient memory sharing across concurrent requests. Three vLLM parameters determine whether a GPU saturates or wastes: --max-num-seqs, --gpu-memory-utilization, and --tensor-parallel-size.

One trade-off to monitor: time-to-first-token (TTFT) may increase slightly with continuous batching under high load. Track TTFT alongside GPU utilisation % and throughput (tokens/sec) — expose these through Prometheus/Grafana or your inference engine’s built-in metrics endpoint.

What Are the Best Structural Interventions for Reducing AI Inference Costs Long-Term?

Tier 3 requires dedicated engineering time, accuracy validation pipelines, and staging environments before production deployment. Don’t skip Tier 1 and Tier 2 before committing to this tier — the ROI justification should be based on validated earlier-tier data.

Here are the three structural interventions, in order of how frequently they’ll apply.

Model quantization reduces the numerical precision of model weights, shrinking model size and GPU memory footprint substantially. Quantization is the single biggest optimisation you can apply before touching your serving engine — it cuts VRAM requirements by 50–75% and lifts throughput by removing memory bandwidth bottlenecks.

The format decision maps to your hardware:

FP8: Best on NVIDIA H100 hardware. Native Tensor Core support, under 1% perplexity delta versus FP16, 50% VRAM reduction. Start here if you have H100s.
AWQ (Activation-aware Weight Quantization): Best for INT4 (4-bit) deployment on Ada Lovelace hardware (RTX 4090, A6000 Ada). ~3% perplexity delta. Consistently outperforms GPTQ at the same bit-width.
GPTQ: The mature fallback for Ampere and older GPUs (A100, V100). ~6% perplexity delta at 4-bit. Second choice to AWQ.
GGUF: Only if running on Ollama or llama.cpp — CPU offload environments and local development.

Post-training quantization (PTQ) is the correct entry point — calibrate on a representative dataset, convert to the target format, validate accuracy, deploy. Quantization-aware training (QAT) adds training cost that is usually not justified.

One important caveat: benchmark accuracy drops and task-specific accuracy drops are different things. INT4 formats require task-specific validation — don’t assume sub-1% loss without measuring on your actual workload, particularly for legal, medical, or financial reasoning tasks.

vLLM with PagedAttention is the production serving standard for teams running self-hosted inference at scale. PagedAttention manages KV cache memory as virtual memory pages, dramatically reducing memory fragmentation and enabling efficient memory sharing across concurrent requests. vLLM supports over 100 model architectures and runs on NVIDIA V100 through current, AMD MI200/MI300, Google TPUs, AWS Inferentia, and Intel Gaudi — this breadth prevents vendor lock-in.

Speculative decoding is a latency optimisation, not primarily a cost optimisation. It pairs a small draft model with the large target model — the draft model generates 4–5 candidate tokens; the target model verifies them in a single forward pass. When draft tokens are correct (70–80% of the time for chat workloads), this delivers 1.8–2.2× speedup on generation throughput. Use it for latency-sensitive applications — real-time chat, voice interfaces, interactive coding assistants. Don’t prioritise it as a cost reduction measure.

vLLM vs Ollama vs LocalAI: Which Inference Serving Stack Is Right for Production?

Here’s an honest breakdown.

vLLM: Production standard for teams running self-hosted inference at scale. Continuous batching, quantization (FP8, AWQ), and speculative decoding in a single stack. Supports high-concurrency multi-user workloads.

The honest limitation: vLLM requires CUDA-compatible GPU hardware, Python inference stack knowledge, and ongoing monitoring configuration. Without a dedicated ML engineering resource, the operational burden is significant. Managed inference APIs have a lower operational cost even if per-token pricing is higher — and at moderate traffic, serverless costs 77% less than a 24/7 dedicated pod. The break-even point occurs when utilisation consistently exceeds 65–70% of an always-on deployment.

Ollama: Best for local development, single-developer environments, and testing model behaviour before production deployment. Uses GGUF format via llama.cpp; supports CPU offload. Does not implement continuous batching natively. Not suitable for multi-user production API backends. In practice, Ollama is frequently deployed in multi-user scenarios where it underperforms significantly.

LocalAI: Best for OpenAI API-compatible self-hosting where provider lock-in is the primary concern — niche regulatory or air-gapped environments. Production-readiness is lower than vLLM. Not the first choice for pure inference cost optimisation.

SGLang (for agentic workloads): optimised for structured generation and multi-call agentic pipelines. SGLang’s RadixAttention caches and reuses KV states across requests sharing common prefixes — in agentic pipelines where every tool-call response starts with the same system prompt, that prefix sharing reduces TTFT by 30–60% compared to naive per-request serving. The guidance from production teams is clear: use vLLM for chat, completions, and RAG; use SGLang when your workflow runs multiple sequential LLM calls.

The decision tree: continuous batching + quantization + speculative decoding at high concurrency → vLLM; local dev and testing → Ollama; OpenAI API compatibility in an air-gapped environment → LocalAI; agentic pipelines with structured output → evaluate SGLang.

Before committing to self-hosted vLLM, evaluate managed inference providers (Together AI, Baseten, RunPod). They provide vLLM-like performance without the operational overhead — and for many teams at the 50–500 person scale, that trade-off is worth it.

How Do Agentic AI Workloads Multiply Inference Costs and How Do You Manage Them?

Agentic AI systems multiply inference costs 5–20x per user action compared to single-call interactions. Each agent step is a separate inference call with its own context, tool descriptions, and reasoning output. A customer support agent that looks efficient at 100 tokens per interaction can easily use 2,000–5,000 tokens when a scenario requires multiple tool calls, context retrieval, and multi-step reasoning.

The implication for optimisation: think at the chain level, not the call level. Optimising individual call cost by 20% in a 10-call chain yields 20% savings across the chain. Restructuring the chain to eliminate three redundant calls yields a saving that compounds with every execution.

Prompt caching is disproportionately valuable for agentic workloads specifically because agent system prompts include tool descriptions, reasoning instructions, and context that repeat across every step. These are exactly the high-repetition prefixes that caching eliminates — and the same high cache hit rates apply here with even greater effect.

Model routing within the chain is an underexplored lever. Tool selection and simple classification steps can route to a smaller, cheaper model tier. Synthesis and reasoning steps escalate to the full model. Apply the same routing logic you set up in Tier 1, but at each step of the chain.

Track cost per agent execution — the full chain cost per user action — not cost per call. “Dollar-per-decision is a better ROI metric for agentic systems than cost-per-inference because it captures both the cost and the business value of each autonomous decision.”

For a full cost governance framework covering monitoring, alerting, and FinOps practice — including how to sustain the gains you make applying this playbook — the AI FinOps governance to sustain your optimisation gains covers the next layer. This playbook is the optimisation layer; governance is what keeps it working at scale.

Frequently Asked Questions

Can I apply prompt caching without self-hosting my own models?

Yes. API-level prompt caching is a managed feature from Anthropic (Claude 3.x) and OpenAI (cached input tokens) — no infrastructure change required. Google’s equivalent is called “context caching” in the Gemini API. The only implementation action is restructuring prompts so stable content appears at the start. No GPU server, no model deployment, no DevOps overhead.

How much accuracy do I lose with INT8 quantization?

On standard benchmarks, INT8 quantization produces less than 1% accuracy degradation. FP8 (native on NVIDIA H100) offers higher accuracy with similar compression benefits — prefer FP8 if H100 hardware is available. Always validate on a representative sample of production queries before deploying quantized models, particularly for tasks sensitive to output precision.

Is vLLM suitable for a company with no dedicated ML infrastructure team?

Honestly, it is operationally demanding. Without a dedicated ML engineering resource, the operational burden is significant. Managed inference providers (Together AI, Baseten, RunPod) provide vLLM-like performance without the overhead. The self-hosting break-even typically becomes compelling when monthly managed API spend exceeds approximately $20,000–$50,000. Complete all Tier 1 optimisations first — some teams find that prompt caching and model routing bring managed API costs low enough that self-hosting is never justified.

What is the difference between KV cache and prompt caching?

KV cache is the GPU memory mechanism in inference engines that stores intermediate attention computations to avoid recomputing them — this operates at the infrastructure level in self-hosted serving stacks. Prompt caching is the API-level feature from Anthropic and OpenAI that exposes the same underlying mechanism as a managed service. Both serve the same purpose at different layers.

What is continuous batching and why does it matter more than a GPU hardware upgrade?

Continuous batching dynamically slots new requests into a running batch as completed sequences free up GPU capacity. Typical enterprise GPU utilisation without it: 30–40%. With it: 70–80%+. That improvement effectively halves cost per request on the same hardware — more impactful than a GPU hardware upgrade costing tens of thousands of dollars. Available natively in vLLM, TGI, and TensorRT-LLM. Not available in Ollama.

Which quantization format should I choose — AWQ, GPTQ, or FP8?

FP8 if running NVIDIA H100 hardware — native hardware support, highest accuracy, under 1% perplexity delta. AWQ for INT4 (4-bit) on Ada Lovelace hardware — ~3% perplexity delta, outperforms GPTQ at the same bit-width. GPTQ for Ampere and older GPUs — ~6% perplexity delta at 4-bit. GGUF only for Ollama or llama.cpp in CPU offload or local development environments.

How do I know when to move from managed APIs to self-hosted inference?

The primary signal is when monthly managed API spend reaches the point where self-hosted infrastructure TCO becomes cheaper over a 12–24 month horizon. At moderate traffic, serverless GPU costs 77% less than a 24/7 dedicated pod — the break-even occurs when utilisation consistently exceeds 65–70% of an always-on deployment. Run all Tier 1 optimisations first. Some teams find prompt caching and model routing bring costs low enough that self-hosting is never justified.

What is speculative decoding and when should I use it?

Speculative decoding pairs a small draft model with the large target model. The draft model generates 4–5 candidate tokens; the target model verifies them in a single forward pass. The primary benefit is latency reduction, not cost reduction. Use it for latency-sensitive applications: real-time chat, voice interfaces, interactive coding assistants. Don’t prioritise it as a cost reduction measure.

What is the difference between async batch processing and continuous batching?

Async batch processing (Tier 1): grouping latency-insensitive workloads — document analysis, embeddings, content moderation, nightly jobs — and submitting them to a deferred batch API. OpenAI’s Batch API offers a flat 50% discount for this class of request. No infrastructure change required. Continuous batching (Tier 2): a real-time serving strategy that groups concurrent incoming requests dynamically to maximise GPU utilisation. One is a scheduling strategy; the other is a serving strategy.

How do I track whether my optimisations are actually working?

The primary metric is cost per million tokens, measured before and after each optimisation tier. For prompt caching, cache hit rate is the key leading indicator — most API providers expose this in usage dashboards. For model routing, track cost distribution by model tier alongside quality metrics by task type. For GPU utilisation work, monitor GPU utilisation %, TTFT, and throughput through Prometheus/Grafana or your inference engine’s built-in metrics endpoint. For a complete cost governance framework, see the AI infrastructure cost governance guide.

What is the agentic AI cost multiplier and why does it change the optimisation calculus?

Agentic AI systems make multiple sequential LLM calls per user action — tool selection, execution, result interpretation, response generation. A user action in an agent pipeline costs 5–20x more than an equivalent single-call interaction because each pipeline step accumulates token costs including repeated tool descriptions and reasoning context. Optimisations must be evaluated at the chain level. Prompt caching is disproportionately valuable here because the tool description system prompts that repeat across every chain step are exactly the high-repetition prefixes that caching eliminates.

This playbook is one part of a broader series on AI inference economics. For a complete AI inference cost guide covering the financial reality, infrastructure decisions, pricing strategy, and governance practice, see What the AI Inference Cost Crisis Means for Growing Software Companies.