Business

SaaS

Technology

•

Dec 29, 2025

How Bandwidth and Latency Constraints Are Killing AI Projects at Scale

You’ve got a shiny new GPU cluster humming away in your data centre. You’ve hired the ML talent. You’ve got your datasets ready. And then… nothing. Your GPUs are sitting at 60% utilisation because they’re starving for data.

Here’s the problem: bandwidth constraints jumped from 43% to 59% as a cited infrastructure limitation, while latency challenges surged from 32% to 53% in a single year. You’re hitting what people are calling the “networking wall.”

AI workloads generate fundamentally different network demands than traditional applications, creating bottlenecks that have nothing to do with your internet speed. We’re going to break down why AI stresses networks differently, give you a diagnostic framework, and lay out budget-appropriate solutions from SMB to enterprise scale. Because these constraints are a major contributor to the ROI gap problem facing AI infrastructure investments.

What Is Causing the 59% Increase in Bandwidth Constraints for AI Infrastructure?

AI workloads generate fundamentally different traffic patterns than traditional web applications. Training a single large language model can require moving petabytes of data between GPU clusters, with bandwidth demands growing 330% year-over-year.

Most enterprise networks were designed for north-south traffic – client to server. Not the east-west GPU-to-GPU communication that AI training demands.

Training modern LLMs requires orchestrating work across tens or hundreds of thousands of GPUs. Each training step involves synchronous gradient updates where every GPU exchanges gigabytes of data with every other GPU. This is completely different from traditional workloads.

Web applications primarily use request-response patterns with kilobyte to megabyte payloads. AI training? Continuous gigabyte to terabyte data streams flowing constantly between GPUs. It’s sustained, high-throughput bidirectional communication.

39% of organisations cite legacy systems lacking the capabilities required by modern AI workloads. And model parameter counts are doubling every 6-9 months, directly multiplying bandwidth requirements.

What you see in practice is underutilised GPUs – not from compute limits but from data starvation. Your expensive hardware is waiting on network transfers. The shift to “AI factory” architecture is exposing these inadequacies. While 67% of CIOs/CTOs emphasise compute limitations, compute is the visible bottleneck, but underlying data infrastructure is a compound factor.

How Do AI Workloads Differ from Traditional Applications in Terms of Network Demands?

AI inference is computation-constrained rather than network-constrained. Unlike traditional web apps that fail at 3-second load times, AI response generation takes 20+ seconds. That fundamentally changes what matters.

Network latency of approximately 150ms between New York and Tokyo becomes negligible when ChatGPT response generation takes roughly 20 seconds anyway. Traditional web apps are network-bound. AI shifts the constraint to GPU processing.

Training workloads behave differently. They require continuous bidirectional data streams between GPUs – east-west traffic – while traditional apps use intermittent client-server patterns – north-south traffic. Not bursty HTTP requests but sustained GPU-to-GPU synchronisation streams.

Bandwidth requirements are asymmetric. Training demands sustained high throughput for dataset movement. Inference needs burst capacity for real-time requests. And AI workloads allow pause and resume around network availability, unlike always-on traditional services.

For chat applications, users perceive responsiveness through Time-to-First-Token. Chat applications prioritise 50ms TTFT with moderate tokens-per-second over 500ms TTFT with high-speed generation. This is a completely different user experience metric than traditional page load time.

Why Did Latency Challenges Surge from 32% to 53% in One Year?

Real-time AI applications proliferated. 49% of organisations now consider AI performance important with under 100ms latency tolerance. But only 31% deployed edge infrastructure despite 49% requiring real-time responses. That 18 percentage point gap is forcing latency-sensitive workloads through inadequate central infrastructure.

AI is moving from experimental batch workloads to production systems serving real-time customer requests. 23% rated real-time AI performance as extremely important. When you’re in production, latency matters.

GPU-to-GPU communication requirements have tightened. Distributed training at scale requires under 10 microsecond inter-GPU latency to prevent compute stalls.

Hybrid cloud complexity makes it worse. 150ms+ latency between on-premises training and cloud inference creates architectural bottlenecks.

While 15% cite network latency as performance bottleneck, 49% have latency-sensitive requirements. That gap represents infrastructure that wasn’t built for the workloads it’s now running. And infrastructure placement choices become the determining factor in whether you can meet latency requirements.

How Do You Diagnose if Bandwidth or Latency Is Your Actual Bottleneck?

Monitor GPU utilisation first. If it’s consistently below 80% during training, you likely have a data pipeline bandwidth issue. Your compute is idle because it’s waiting on the network.

Measure Time-to-First-Token versus token generation rate for inference. High TTFT with normal generation indicates network latency issue. Low TTFT with slow generation suggests compute bottleneck.

The problem is visibility. 33% face visibility gaps because traditional monitoring doesn’t capture AI traffic patterns. Your APM tools are watching north-south traffic while your AI workloads are generating east-west traffic between GPUs that you’re not monitoring.

Compare performance metrics across network paths. Test inference through different routes. If performance varies significantly by path, latency is a factor. If it’s consistently slow everywhere, look at compute or storage.

Baseline storage I/O first. Only 9% cite storage IOPS as bottleneck, so if storage metrics are healthy, focus on network.

Here’s a practical threshold guide:

GPU utilisation below 80%: data pipeline bandwidth issue
TTFT over 100ms with fast token generation: network latency
Training runs pausing frequently: bandwidth-constrained data pipeline
Consistent slow performance regardless of path: compute bottleneck

Deploy AI-specific network monitoring. You need tools that understand GPU telemetry, track east-west traffic, and provide real-time dashboards on GPU utilisation. This connects back to data readiness assessment methodologies – your diagnostics need to account for both data quality and the network infrastructure delivering it.

What Network Architecture Changes Are Needed to Support AI Workloads?

Shift from north-south to east-west optimised network fabric for GPU cluster interconnection. Traditional Ethernet was designed for single-server workloads not for distributed AI.

42% already using high-performance networking including dedicated high-bandwidth links. InfiniBand, NVLink, or 400G/1.6T optical interconnects deliver under 1 microsecond latency. That’s what you need.

InfiniBand is the gold standard for HPC supercomputers and AI factories with collective operations running inside the network itself. It uses Scalable Hierarchical Aggregation and Reduction Protocol technology doubling data bandwidth for reductions.

NVLink takes it further. NVLink and NVLink Switch extend GPU memory and bandwidth across nodes with GB300 NVL72 system offering 130 TB/s of GPU bandwidth. With NVLink, your entire rack becomes one large GPU.

Implement AI-specific load balancing. 38% use application delivery controllers tuned for AI traffic patterns – batch processing characteristics instead of request-response patterns.

Separate AI traffic from general enterprise networks. Dedicated network fabric prevents AI workloads from saturating business services. You can use VLANs, dedicated physical networks, or software-defined networking. The key is isolation.

The hybrid architecture question: place latency-sensitive inference on-premises, flexible training workloads in cloud where bandwidth is abundant. This is where architecture decisions for cloud versus on-premises become determining factors.

What Are Budget-Appropriate Solutions for Different Organisation Sizes?

For SMBs and startups: leverage cloud provider infrastructure. They’ve already built GPU clusters with dedicated high-performance networking. Use quantisation to reduce bandwidth needs by 75% – 8-bit instead of 32-bit. Deploy CDN for inference distribution.

Mid-market ($100K-$1M range): hybrid approach with on-premises inference plus cloud training. Implement load balancing. Deploy edge computing for latency-sensitive use cases. Add AI-specific monitoring.

Enterprise ($1M+ budgets): build AI factory infrastructure with high-performance networking. Deploy distributed training clusters with InfiniBand or NVLink. Implement optical networking – silicon photonics switches offer 3.5x more power efficiency.

However, before any infrastructure investment, apply optimisation-first. Quantisation, speculative decoding, and batch processing can reduce bandwidth requirements 30-50% without infrastructure investment. Start there.

The incremental upgrade path: start with monitoring and visibility, optimise workloads, then selectively upgrade network bottlenecks rather than wholesale replacement. 37% plan network capacity upgrades as highest priority.

ROI calculation is straightforward. If a $1M GPU cluster runs at 60% utilisation due to network bottlenecks, you’re wasting $400K annually in compute capacity. That justifies network investment if it recovers that utilisation.

How Can You Optimise Existing Infrastructure Before Major Investment?

Apply quantisation first. Reduce model precision from 32-bit to 8-bit and you decrease bandwidth requirements 75% with minimal accuracy loss. Quickest win with near-zero cost.

Implement speculative decoding. Use smaller draft models to reduce network round-trips. Speculative decoding introduces lightweight draft model achieving 60-80% acceptance rates translating to 1.5-3x speedups.

How it works: draft model generates several candidate tokens quickly, larger target model verifies them in parallel. Shines with high-predictability workloads like code completion.

Optimise batch processing. Group inference requests to maximise GPU utilisation and amortise network overhead across multiple requests.

Deploy checkpoint-based training. Schedule bandwidth-intensive training around off-peak network availability. AI training can pause and resume using checkpoints during planned downtime.

Add network visibility tools before spending on upgrades. 33% lack AI-specific monitoring. Deploy tools capturing east-west traffic between GPUs. Without visibility, you might be solving the wrong problem.

Implementation difficulty varies. Quick wins: CDN deployment, quantisation – hours to days. Complex: distributed training optimisation – weeks to months. Start with the quick wins.

What Does the Future of AI Network Infrastructure Look Like?

AI factory architecture becomes standard. Integrated ecosystems purpose-built for AI rather than retrofitted general-purpose infrastructure.

Optical networking displaces copper for GPU-to-GPU connections. MOSAIC novel optical link technology provides low power and cost, high reliability and long reach up to 50 metres simultaneously. MOSAIC can save up to 68% of power while reducing failure rates by up to 100x.

Edge computing proliferates. 31% adoption growing rapidly as real-time AI applications require under 10ms local processing.

Network and compute co-design becomes the norm. AI accelerators with integrated high-bandwidth networking rather than separate components. The boundaries between network and compute blur.

Hybrid becomes default architecture. On-premises for latency-sensitive production, cloud for flexible development and training. The choice is based on data sovereignty and bandwidth considerations.

The technology roadmap for the next 1-3 years: optical networking protocols mature, GPU interconnect evolves, AI factory vision becomes accessible to mid-market organisations, not just hyperscalers.

The “network as accelerator” concept is emerging – network infrastructure becoming active part of AI computation rather than passive transport.

For strategic infrastructure investment, this all feeds back to the ROI gap problem. Network constraints are solvable, but you need to invest in the right places at the right time.

Conclusion

The 59% bandwidth and 53% latency constraint rates reflect a fundamental mismatch between AI workload patterns and traditional network architecture. Your GPUs are starving because your network wasn’t built for east-west traffic.

Measure before investing to confirm network is the actual bottleneck. 33% face visibility gaps, so deploy AI-specific monitoring before you spend on infrastructure upgrades.

From quantisation and CDN for startups to AI factories for enterprises, you have options. The incremental path works: visibility, optimisation, selective upgrades, integrated infrastructure.

The networking wall is real, but we have clear technical paths forward. Addressing bandwidth and latency is needed to close the ROI gap in AI infrastructure investments. Your expensive GPU compute only delivers ROI if it can actually access the data it needs to process.

FAQ Section

How much bandwidth do typical AI training workloads require?

Large language model training requires 100-400 Gbps sustained bandwidth per GPU. Training requires tens or hundreds of thousands of GPUs orchestrating massive calculations. A 1000-GPU training run generates 40-160 TB per hour of inter-GPU traffic. By comparison, a busy web application might use 10 Gbps peak bandwidth for thousands of concurrent users.

Can edge computing solve latency issues for all AI applications?

Edge computing effectively solves latency for inference workloads requiring under 100ms response times – IoT, manufacturing, autonomous systems. However, it doesn’t address training workloads requiring massive inter-GPU bandwidth, which remain centralised in data centres or cloud GPU clusters.

Why don’t cloud providers’ networks have these bandwidth problems?

Major cloud providers built GPU clusters with dedicated high-performance networking specifically for AI workloads. Hyperscalers were responsible for 57% of metro dark fibre installations between 2020-2024. Organisations citing 59% bandwidth constraints are predominantly running on-premises or hybrid infrastructure not originally designed for AI’s east-west traffic patterns.

Is latency or bandwidth more important for AI production deployments?

Depends on workload type. Inference workloads are latency-sensitive – TTFT matters – while training workloads are bandwidth-constrained – moving datasets. Production systems typically run inference – latency-critical – but require periodic training – bandwidth-critical – needing infrastructure addressing both.

How do you calculate ROI for network infrastructure upgrades?

Calculate cost of underutilised GPU compute versus network upgrade cost. If a $1M GPU cluster runs at 60% utilisation due to network bottlenecks, you’re wasting $400K annually – justifying network investment if it recovers that utilisation. Right-size GPU clusters with autoscaling tied to actual workload demand reduces waste.

What is the difference between InfiniBand and NVLink?

InfiniBand is a network protocol providing under 1 microsecond latency for connecting separate servers in GPU clusters. InfiniBand is gold standard for HPC supercomputers. NVLink is NVIDIA’s proprietary GPU-to-GPU interconnect within a single server, providing even lower latency but shorter reach. Large AI training uses both: NVLink within servers, InfiniBand between servers.

Can you retrofit existing data centre networking for AI?

Partial retrofits are possible: add high-performance switches for GPU clusters, implement separate VLANs for AI traffic, deploy AI-specific load balancers. However, many organisations find that building dedicated AI infrastructure delivers faster results than retrofitting legacy systems.

How does 5G affect AI latency requirements?

5G’s under 10ms latency enables new edge AI applications. However, 5G is the access network – your AI infrastructure still needs low-latency backhaul and processing at the edge or nearby data centres to meet end-to-end latency budgets.

Why did bandwidth issues surge 37% (43% to 59%) in one year?

Model sizes doubled, organisations moved from experimental to production-scale deployments, and hybrid architectures exposed bandwidth limitations. Bandwidth purchased for data centre connectivity surged 330% between 2020 and 2024.

What monitoring tools detect AI-specific network bottlenecks?

AI workload monitoring requires tools capturing east-west traffic between GPUs. Solutions include GPU telemetry – nvidia-smi, DCGM – network performance monitoring – Prometheus plus Grafana – inference profiling – NVIDIA Triton metrics – and custom solutions for distributed training patterns.

Is optical networking necessary for AI or just enterprise scale?

Optical networking becomes necessary at 100+ GPU scale or when GPU-to-GPU traffic exceeds 400 Gbps. Smaller deployments – 10-50 GPUs – can use high-end Ethernet switching. However, optical networking provides future-proofing as AI workloads grow, making it cost-effective for mid-market organisations planning 2+ year infrastructure lifecycles.

How does data locality affect bandwidth requirements?

Training on data stored locally – same data centre as GPUs – eliminates WAN bandwidth constraints. Cloud training with on-premises data requires continuous data transfer. Best practice: collocate data with training infrastructure or use hybrid architecture placing training where data resides. Links to data readiness considerations for data pipeline planning.