Business

Generative AI

SaaS

Technology

•

Jun 17, 2026

Why AI Agents Are Bringing CPUs Back to the Centre of the Data Centre

For the last three years the story about AI infrastructure has been simple: you need GPUs. Lots of them. The CPU was the boring bit that booted the server and handed data off to the real silicon. A commodity afterthought.

That story broke in 2026. At Computex that year, Arm CEO Rene Haas coined the term “CPU renaissance” to describe something the industry was only beginning to admit: agentic AI is pulling CPUs back to the centre of the data centre.

The numbers tell you why. During the training era, CPU-to-GPU ratios settled at 1:4 to 1:8. Now those ratios are climbing toward 1:1 and beyond, and Arm estimates agentic-era demand at 120 million CPU cores per gigawatt, a fourfold increase. Research from Georgia Tech and Intel has quantified what is driving the shift, and hyperscalers are already provisioning for it. Here is how the picture fits together.

What is the CPU renaissance and why is it happening now?

The CPU renaissance describes a structural pivot where CPU provisioning has moved from commodity afterthought to core infrastructure strategy. It is not a cyclical bump or a marketing slogan, even if the name sounds like one.

The headline metric is the ratio. Counterpoint Research and TrendForce have both tracked the shift from training-era 1:8 toward 1:1 for agentic deployments. Arm’s 120-million-cores-per-gigawatt estimate may be conservative. Haas later told investors “we probably have undercalled the CPU demand” and suggested the number could go higher.

The timing is not accidental. 2025 and 2026 are when agentic AI crossed from research prototype to hyperscale production. As soon as AI systems began executing multi-step tool chains, the compute profile inverted. Every additional tool call multiplies CPU demand, and production agents routinely make five, ten, or more calls per request.

The supply side has responded. Intel and AMD raised prices across select CPU lines in early 2026. NVIDIA began selling its Vera CPU as a standalone product. Arm announced its first in-house CPU. A GPU company and an IP licensing firm both entering the CPU market in the same month is not a coincidence. It is a signal.

How does agentic AI differ from chatbot-style AI in terms of compute requirements?

Chatbot AI, the 2022 to 2024 paradigm, runs a single inference pass. The model receives a prompt, generates tokens, and returns a response. Stateless, single-turn, no external tools. The compute profile is overwhelmingly GPU-dominated because there is nothing happening between inference calls that requires a general-purpose processor.

Agentic AI breaks that model. A single user request can trigger a chain of actions: the agent plans execution on GPU, dispatches tool calls to APIs and databases on CPU, parses structured outputs on CPU, evaluates whether goals have been met on CPU, and potentially spawns sub-agents that repeat the entire loop on CPU again. Every cycle between GPU inference calls is a CPU-bound operation.

A customer support chatbot processes one prompt and returns one answer. GPU time is the whole story. A coding agent that reads a repository, runs tests, fixes errors, and iterates across multiple tool calls has a distinct CPU-bound execution window between each GPU inference step. The GPU still matters, but the CPU determines end-to-end responsiveness.

Agentic systems also maintain persistent context windows and conversation state across multi-turn sessions. Serialising, storing, and reloading KV cache data is a CPU-side memory operation that becomes measurable under sustained long-context loads. The structural consequence is that agentic AI introduces entire workload categories that require general-purpose processors and did not exist in chatbot deployments.

What role do CPUs play in agentic AI orchestration and tool calling?

The CPU is the operating system of the agent loop. If the GPU does the thinking, the CPU runs the show. Orchestration is the CPU’s domain: planning the next action, dispatching tool calls, parsing structured outputs, managing memory and state between steps, and coordinating sub-agents. Intel describes it as the CPU serving as the control plane for increasingly agentic workloads.

Tool calling is the mechanism that makes this structural. Every time an agent invokes an external API, runs a code snippet, queries a database, reads a file, or spawns a sub-agent, that operation executes on CPU cores. In a multi-step agent chain, a single user request can trigger five, ten, or more tool calls, each compounding CPU load. A 10-agent pipeline making three tool calls per step needs roughly 30 simultaneous CPU threads serviced per inference round.

Framework overhead is also CPU-bound. LangGraph‘s state graph evaluation, the logic that determines which node in an agent workflow executes next, ranges from 5 to 50 milliseconds per agent step. Under high concurrency, that overhead compounds and becomes the dominant infrastructure cost. Protocols like MCP amplify this further. Each tool call triggers data retrieval, transformation, and formatting that runs entirely on CPU. Arm estimates multi-agent systems could drive a 15X increase in tokens per user from the compounding effect of tool calls alone.

Why do tool-dominated agentic workloads consume up to 90% of end-to-end latency on the CPU?

The compounding effect is clear in principle. Georgia Tech and Intel have quantified exactly how large the gap is in practice. Their research profiled five representative agentic workloads and found that CPU-side tool processing accounted for 50 to 90 percent of total end-to-end latency.

Here is what that looks like. A user submits a complex query. The GPU infers the execution plan in about 200 milliseconds. The CPU dispatches an API query to a database, roughly 500 milliseconds including round-trip latency and response parsing. The GPU evaluates the result and decides the next step, another 200 milliseconds. The CPU dispatches a file read from object storage, roughly 400 milliseconds. This pattern repeats across five tool calls. Total GPU time: about one second. Total CPU-side time: roughly 2.5 seconds. The CPU consumes more than 70 percent of end-to-end latency with fast APIs. With slower tool calls, it pushes past 90 percent.

Not all tool calls are equal. A local database query might add 50 milliseconds. A cloud API call with cold start might add 800. A sub-agent spawn with its own multi-step reasoning loop might add seconds. The variance in tool call latency, and the fact the agent cannot proceed until each tool returns, makes CPU-side processing the primary bottleneck for user-perceived responsiveness. Meta’s deployment of Graviton5 at tens of millions of cores for agentic workloads validates that this dynamic is a production concern at hyperscale, not a theoretical edge case.

How is reinforcement learning amplifying the demand for CPU cores?

The latency data is decisive for inference. But the CPU dependency extends further, into how these models are trained.

Unlike pretraining, where models learn from static datasets on GPUs, reinforcement learning training uses algorithms like PPO and GRPO to have models learn by interacting with environments. The model generates an action via GPU inference, executes it in a sandbox on CPU, receives a reward scored on CPU, and updates its policy via GPU gradient computation. A single training run can spawn hundreds or thousands of agent trajectories in parallel, each generating dozens of tool calls that execute on CPU cores.

Microsoft’s Fairwater data centres for OpenAI give you a sense of the scale involved. The configuration, one of the clearest real-world examples, allocates 48 megawatts of CPU and storage infrastructure to support 295 megawatts of GPU compute, a 1:6 CPU-to-GPU power ratio driven by RL workload requirements. On a standard eight-GPU node with 64 vCPUs, a GRPO rollout job calling a code-execution sandbox saturates all CPUs at around 32 concurrent rollout workers. GPUs sit idle waiting for scored trajectories while CPUs are pegged.

Frameworks like veRL and OpenRLHF use Ray to decouple rollout workers on CPU from learner nodes on GPU, allowing independent scaling. When rollout workers cannot keep pace with learner nodes, GPU utilisation drops and training throughput stalls. The hardware response has been a new infrastructure category: dedicated CPU racks. NVIDIA’s standalone Vera rack packs 256 CPUs and over 22,500 cores. Arm’s AGI CPU rack puts 336 CPUs and 45,696 cores in a single rack. CPU-first compute for AI training. Two years ago that would have sounded absurd.

What CPU architecture features matter most for agentic AI workloads?

If you are provisioning for agentic workloads, the features that matter are different from what mattered in the training era. Core count and thread density are the headline metrics because every tool call, sub-agent, and state graph evaluation runs as an independent thread.

The current generation reflects this. AMD EPYC Venice offers 256 Zen 6c cores and 512 threads. Intel’s Clearwater Forest pushes 288 E-cores. NVIDIA Vera uses 88 custom Olympus cores with Spatial Multithreading, which physically partitions core resources between threads rather than time-slicing them. This provides per-thread latency isolation, the property you want when packing thousands of agent sandboxes onto one socket.

Memory bandwidth is the agent-state bottleneck. Conversation history, KV cache data, and tool call results must be held in memory and accessed at low latency. NVIDIA Vera’s LPDDR5X memory subsystem hits 1.2 terabytes per second, roughly twice the bandwidth of DDR5 at a fraction of the power. AMD Venice’s MRDIMM configuration pushes to 1.64 terabytes per second.

On-die matrix accelerators also matter. Intel’s AMX, built into every Xeon 6 core, accelerates INT8 and BF16 matrix operations natively. This makes it viable to run smaller models like Llama-3.1-8B on CPU-only infrastructure at roughly 10 tokens per second for a single user, scaling to hundreds of tokens per second under concurrency. It is production-viable for smaller models and hybrid workloads alongside GPU-based inference for large models.

Interconnect bandwidth determines how tightly CPUs can couple with GPUs. NVIDIA’s NVLink-C2C provides 1.8 terabytes per second bi-directional between Vera CPU and Rubin GPU. PCIe Gen6 and CXL 3.0 support memory disaggregation across racks. Some customer designs are now requesting a hundred PCIe lanes, compared to 16 lanes for AI training.

What CPU-to-GPU ratio should data centres target for agentic AI?

There is no universal answer because the right ratio depends on what your agents are doing. But the industry consensus has shifted from the training-era default.

For simple single-model inference with no tool calling, 4 to 8 vCPUs per GPU is still adequate. Add retrieval-augmented generation and you are looking at 8 to 16. Multi-agent pipelines with tool calling push to 16 to 24. Large-scale RL rollouts need 32 to 64 vCPUs per GPU, and at that point you should be decoupling CPU rollout pools from GPU inference nodes entirely.

Finding your number is straightforward. Measure the ratio of CPU time to GPU time per request. When GPU utilisation drops while CPU cores are saturated, your ratio is too GPU-heavy. A CPU-to-GPU time ratio above 0.5 signals future bottlenecks. Above 1.0 means CPU is already the limiting factor.

The trend line matters more than any single number. Arm estimates a fourfold increase in CPU cores per gigawatt. TrendForce projects ratios continuing toward 1:2 and beyond. As one analyst put it, if the GPU is waiting on the CPU tier to prepare the next step, the most expensive part of the cluster is underutilised because the cheaper part is underprovisioned. That is a poor trade before you get to power and capex allocation.

Hyperscale operators have already internalised this. Meta is running tens of millions of Graviton5 cores and co-developing the Arm AGI CPU. OpenAI’s AWS deal explicitly covers tens of millions of CPUs for agentic workloads. If you are provisioning today using yesterday’s ratio assumptions, you will undershoot within a single procurement cycle.

What this means for data centre planning

Agentic AI has rewritten the compute profile. The work between inference calls, orchestration, tool calling, state management, has become the dominant cost, and that work runs on CPUs. Georgia Tech and Intel’s finding that CPU-side processing consumes 50 to 90 percent of end-to-end latency is not a worst case. It is the expected profile for multi-tool agent chains, and it means user-perceived responsiveness now depends on CPU performance more than GPU throughput.

Data centre procurement cycles run three to five years. Ratios provisioned for yesterday’s chatbot workloads will starve tomorrow’s agentic deployments of CPU cores, leaving expensive GPU clusters idle. The evidence is already in the market: dedicated CPU racks from NVIDIA and Arm, tens of millions of CPU cores contracted by Meta and OpenAI, and a 2026 server CPU market more contested than any in decades. The question now is what CPU architecture and ratio will determine whether your agentic workloads succeed.

Frequently Asked Questions

Are GPUs becoming obsolete because of this shift?

No, GPUs remain essential. They do the thinking. The shift is not about replacing GPUs but about fixing the balance. Agentic AI creates work between GPU inference calls, orchestration, tool calling, state management, that structurally requires CPUs. Without CPUs to handle that work, GPUs sit idle waiting for tool results. The two processor types are becoming partners, not competitors. A well-provisioned agentic cluster needs both working in balance rather than a GPU monopoly.

Does every AI agent deployment need a higher CPU-to-GPU ratio?

Not uniformly. Simple single-model inference agents that answer questions without tools need ratios closer to the traditional 4 to 8 vCPUs per GPU. The ratio spikes when agents chain multiple tool calls, maintain long conversation state, or spawn sub-agents. A customer service agent answering FAQ-style questions might need 8 vCPUs per GPU. A coding agent that reads repositories, runs tests, and iterates across corrections might need 24 or more. Profile your workload before you invest.

If CPUs are back at the centre, can I skip GPU investment?

No. CPUs with on-die matrix accelerators like Intel AMX can run smaller models, roughly 10 tokens per second single-user for something like Llama-3.1-8B, but they cannot replace GPUs for the large-scale inference that agentic systems depend on. The CPU renaissance means provisioning enough general-purpose compute to prevent GPU idle time while GPUs continue to handle model inference. Both processor types are non-negotiable for production agentic deployments.

How does this shift affect someone running a small AI deployment, not a hyperscale data centre?

The architectural pattern scales down. Even a single server running an agentic application can hit the same bottleneck: tool calls stacking up while GPU utilisation drops. The fix is the same regardless of scale. Instrument your pipeline, measure CPU time versus GPU time per request, and right-size before you over-invest in GPUs that sit idle. Cloud providers are beginning to offer CPU-optimised instance types tuned for agentic workloads, making these ratios accessible without building your own data centre.

What happens if I keep my training-era CPU-to-GPU ratio for agentic workloads?

Your GPUs will spend measurable time idle waiting for CPU-bound operations to complete. User-perceived latency will degrade because each tool call in an agent chain adds hundreds of milliseconds of CPU-side processing that cannot be parallelised on the GPU. At scale, this means you are paying for GPU compute you cannot use. The diagnostic is straightforward: if GPU utilisation drops while CPU cores are saturated, your ratio is too GPU-heavy and you are bottlenecked on CPUs.

Are NPUs and TPUs affected by the same dynamic?

The dynamic is different because NPUs and TPUs are specialised for matrix operations, much like GPUs. They face the same structural problem: between every inference call sits CPU-bound orchestration work they cannot efficiently execute. While NPUs excel at inference throughput for certain model architectures, they lack the general-purpose execution capabilities needed for tool dispatch, structured output parsing, and state management. The CPU renaissance applies regardless of which accelerator does the inference. The orchestration layer still needs general-purpose cores.

When will this CPU-to-GPU ratio shift be complete, or is it ongoing?

It is ongoing and likely to keep shifting for years. Agentic AI is still in early production deployment. As agent systems grow more complex, longer reasoning chains, more sophisticated tools, richer environments for reinforcement learning, the CPU demand per GPU will continue rising. TrendForce projects the ratio will keep climbing toward and beyond 1:1. The practical implication: if you provision for today’s workload patterns using yesterday’s ratio assumptions, you will undershoot within a single procurement cycle. Continuous monitoring is essential.

What software changes are needed to take advantage of more CPU cores?

Most agentic frameworks, LangGraph, CrewAI, AutoGen, already use CPU cores for orchestration, tool execution, and state management. The software challenge is not about rewriting code but about configuring concurrency correctly. Distributed training frameworks like veRL and OpenRLHF use Ray to decouple CPU rollout workers from GPU learner nodes, allowing independent scaling. For inference serving, the key change is ensuring your agent runtime can dispatch tool calls across the full CPU core pool without serialising on a single thread. Framework-level parallelism settings become infrastructure-critical.

Why were CPUs dismissed as commodity components in the first place?

The training era, 2022 to 2024, genuinely made CPUs look like afterthoughts. Large-scale pretraining is overwhelmingly GPU-dominated: the model ingests static datasets, computes matrix multiplications, and CPUs primarily handle data loading and preprocessing. A 1:8 CPU-to-GPU ratio made economic sense because CPUs were never the bottleneck. Nobody was wrong about training-era ratios. The workload simply changed. Agentic AI introduced entirely new CPU-bound workload categories, orchestration, tool calling, state management, RL environment stepping, that did not exist at meaningful scale during the pretraining boom.

Is this CPU renaissance specific to AI, or does it affect other data centre workloads?

The renaissance is centred on AI infrastructure but the effects go wider. The same server platforms being designed for agentic AI, NVIDIA Vera, AMD EPYC Venice, Intel Clearwater Forest, Arm AGI CPU, also serve traditional cloud workloads. Higher core counts, better memory bandwidth, and improved interconnects benefit database servers, web hosting, and virtualised environments. AI demand is pulling the entire server CPU market forward, making 2026 the most competitive and innovative CPU market in decades. Non-AI workloads will inherit better processors as a side effect.

Why AI Agents Are Bringing CPUs Back to the Centre of the Data Centre

What is the CPU renaissance and why is it happening now?

How does agentic AI differ from chatbot-style AI in terms of compute requirements?

What role do CPUs play in agentic AI orchestration and tool calling?

Why do tool-dominated agentic workloads consume up to 90% of end-to-end latency on the CPU?

How is reinforcement learning amplifying the demand for CPU cores?

What CPU architecture features matter most for agentic AI workloads?

What CPU-to-GPU ratio should data centres target for agentic AI?

What this means for data centre planning

Frequently Asked Questions

Are GPUs becoming obsolete because of this shift?

Does every AI agent deployment need a higher CPU-to-GPU ratio?

If CPUs are back at the centre, can I skip GPU investment?

How does this shift affect someone running a small AI deployment, not a hyperscale data centre?

What happens if I keep my training-era CPU-to-GPU ratio for agentic workloads?

Are NPUs and TPUs affected by the same dynamic?

When will this CPU-to-GPU ratio shift be complete, or is it ongoing?

What software changes are needed to take advantage of more CPU cores?

Why were CPUs dismissed as commodity components in the first place?

Is this CPU renaissance specific to AI, or does it affect other data centre workloads?

Related Articles

These Are The Next Wave of Agentic Coding Tools

The best no code / low code tools and strategies for building your MVP

Solving The Challenge Of Tech Recruitment For Australian SMEs

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG