Business

SaaS

Technology

•

Apr 27, 2026

GPU Underutilisation, HetCCL, and the Emerging Cracks in Nvidia’s Hardware Moat

Your GPU bill is probably larger than it needs to be. Research cited by Spectrocloud found that nearly half of enterprises are wasting millions on GPU capacity they are not using — and a Fujitsu study puts 75% of organisations running GPUs below 70% utilisation on average. At H100 prices, that is a material line item.

The two problems — GPU underutilisation and CUDA vendor lock-in — share the same root cause. Workloads tightly coupled to specific Nvidia hardware cannot move between GPU pools. So you over-provision to cover for that inflexibility, and you cannot use cheaper alternatives even when they would do the job.

HetCCL, described in arXiv paper 2601.22585, changes the calculus. For the first time, a library enables simultaneous cross-vendor GPU operation without code changes. It is not production-ready today, but it signals clearly where the industry is heading.

In this article we cover the GPU waste problem, the NCCL homogeneity trap, what HetCCL is and how it works, an honest assessment of AMD ROCm maturity in 2025–2026, a practical GPU utilisation audit methodology, and what a mixed Nvidia and AMD cluster looks like in practice. For the broader strategic picture, see our piece on Nvidia’s monopoly and the forces starting to challenge it.

Why are nearly half of enterprises paying for GPU capacity they are not using?

GPU underutilisation means your provisioned hardware is running well below its theoretical compute capacity for most of its operational life. The metric that matters is SM (Streaming Multiprocessor) utilisation, as reported by nvidia-smi or rocm-smi. Sub-70% average SM utilisation is the industry benchmark for underutilised. Below 50% is acute waste.

Three causes compound each other.

The first is defensive buying. Organisations purchase GPUs at anticipated peak demand as insurance against scarcity. The H100 market has normalised — H100s are available at $25,000–$40,000 and cloud rates as low as $1.49/hour — so the scarcity justification no longer holds. The over-provisioned clusters remain anyway.

The second is single-tenant cluster models. If teams cannot time-share capacity, one team’s idle period is wasted spend for the whole organisation.

The third is workload scheduling. AI jobs are bursty. Typical AI workflows spend 30%–50% of runtime in CPU-only stages, with GPUs sitting idle but locked to the job. CUDA dependency amplifies all three: workloads tightly coupled to specific GPU generations or driver versions cannot move between pools to fill idle capacity.

Before buying more hardware, measure what you have. For more on the cost impact of GPU underutilisation in your inference economics, see our dedicated piece.

How does NCCL create the homogeneity trap — and why is it harder to escape than CUDA itself?

NCCL (Nvidia Collective Communications Library) is the default backend for distributed training in PyTorch on Nvidia hardware. It handles collective operations — All-Reduce, All-Gather, Broadcast — that synchronise gradients across GPU nodes. Add a single AMD node and the communication layer breaks. NCCL will not talk to AMD GPUs.

This is the layer of lock-in most organisations miss. Lock-in operates at three layers: code, communications, and toolchain. NCCL is the harder constraint. You can port CUDA code to AMD’s HIP, but if NCCL is handling collective operations, the communication fabric cannot be swapped without re-engineering the distributed training job itself.

AMD’s equivalent — RCCL — has the same constraint in reverse: AMD-only clusters only. MSCCL++ (Microsoft) and TorchComms (Meta) claim multi-vendor support, but they rely on compile-time selection of a single backend, meaning you pick one vendor at build time. True simultaneous cross-vendor execution was not possible before HetCCL.

Even an organisation that has invested in ROCm and ported workloads to AMD cannot run a mixed cluster for distributed training. You end up maintaining two completely separate infrastructure stacks, doubling operational complexity without gaining any hardware flexibility. For more on the moat HetCCL is beginning to crack, see our breakdown of Nvidia’s networking infrastructure.

What is HetCCL and how does it enable mixed Nvidia and AMD GPU clusters?

HetCCL is the first cross-vendor collective communications library to enable deep learning training on heterogeneous clusters with both Nvidia and AMD GPUs — without source code modifications at the driver, runtime, compiler, or application level.

Rather than building a new communications library from scratch, HetCCL acts as an orchestration layer. It invokes NCCL for Nvidia-local collective operations and RCCL for AMD-local ones, then handles cross-vendor coordination via RDMA (Remote Direct Memory Access). Vendor-native optimisations are preserved on each side; HetCCL coordinates between them.

RDMA is the enabling technology. Standard inter-node communication routes data through the CPU at both ends. RDMA eliminates the CPU from the data path entirely, allowing GPU memory on one node to transfer directly to GPU memory on another via the network card. The receiving GPU does not need to know the sending GPU’s vendor.

The straggler effect is the core performance challenge in heterogeneous clusters. Collective operations cannot complete until the slowest GPU finishes. HetCCL addresses this through GPU-aware micro-batch size adjustment: faster GPUs receive proportionally larger micro-batches based on profiled throughput in tokens per second.

Benchmark results from the paper: on a 16-GPU, four-node mixed cluster, HetCCL achieved speedups of up to 1.48× over Nvidia-only training and 2.97× over AMD-only training using LLaMA-1B and LLaMA-3B with DeepSpeed ZeRO.

The caveats are real. Test hardware is older than current enterprise deployments (V100 and AMD W7800, not H100 and MI300X) and the cluster size is small. There is no commercial distribution, no enterprise support, no large-scale validation. Deploy it today and you are on your own. But it demonstrates vendor-agnostic collective communications are technically viable — and that changes the infrastructure planning horizon for anyone doing 3-year capex planning.

How mature is AMD ROCm for enterprise AI workloads in 2025–2026?

AMD ROCm is AMD’s open-source GPU computing stack — HIP programming model, hipcc compiler, RCCL, and native support in PyTorch and TensorFlow. ROCm 7.2 adds GEMM kernel tuning across FP8, BF16, and FP16 on MI300X, SR-IOV for multi-tenant deployments, and RCCL improvements for distributed training.

Where it is viable: PyTorch integration is first-class. CUDA typically outperforms ROCm by 10%–30% in compute-intensive workloads, but AMD hardware is 15%–40% cheaper. For large model inference, SemiAnalysis benchmarks from May 2025 found MI300X beats H100 in absolute performance and performance per dollar for Llama3 405B and DeepSeekV3 670B on directly owned infrastructure.

Where it still trails: DevOps complexity. ROCm setup requires more Linux expertise and manual intervention than CUDA. The toolchain is less mature, some specialised CUDA libraries have ROCm ports that lag, and enterprise documentation is thinner. The code migration is not the hard part; keeping the environment running is.

AMD’s HIPIFY tool translates CUDA source to HIP-compatible code, with 80%–95% automated for standard workloads. SCALE and ZLUDA run CUDA binaries directly on AMD hardware without code modification — a lower-barrier path if you want to experiment before committing to a full migration.

ROCm is viable for organisations with Linux-native infrastructure, strong DevOps capability, and standard workloads. For a deeper look at the software-layer complement to HetCCL — CUDA lock-in and ROCm — see our dedicated piece.

How do you run a GPU utilisation audit before making any new capex decisions?

A GPU utilisation audit measures actual versus provisioned compute capacity, identifies idle periods, diagnoses root causes, and establishes a baseline before you commit to additional hardware spend. It costs nothing and should happen before any procurement decision.

Step 1 — Measure. Use nvidia-smi (Nvidia) or rocm-smi (AMD) to capture SM utilisation, memory utilisation, and power draw at regular intervals over 2–4 weeks. Prometheus and Grafana are the standard open-source stack for time-series collection and visualisation.

Step 2 — Establish a threshold. Sub-70% average SM utilisation is the industry benchmark for underutilised — MLPerf uses exactly this threshold. Build a heatmap by hour-of-day and day-of-week to make structural idle periods visible to non-technical stakeholders.

Step 3 — Diagnose root causes. Distinguish genuine workload absence (no jobs scheduled) from infrastructure inefficiency (jobs scheduled but not running due to scheduling contention, memory fragmentation, or queue starvation). Different causes, different fixes.

Step 4 — Calculate the waste cost. Multiply idle GPU-hours by your effective cost per GPU-hour — blended capex amortisation plus opex. Break-even analysis suggests purchasing only makes sense at 60%–70%+ continuous utilisation. This number becomes your financial justification for investment in better scheduling or hardware diversification.

Step 5 — Set a target. Define a utilisation target — 80% average SM utilisation is reasonable — and re-measure after each infrastructure change.

If existing utilisation is below 70%, fix scheduling and cluster architecture first. Adding hardware to a poorly scheduled cluster does not solve the problem.

How do you build a heterogeneous cluster using both Nvidia and AMD GPUs?

A heterogeneous AI cluster mixes GPU nodes from different vendors in the same distributed workloads. HetCCL makes it technically possible. A few hard requirements apply.

Hardware layer. Nvidia H100 or A100 nodes and AMD Instinct MI300X nodes connected via a high-bandwidth, low-latency network. RDMA capability is mandatory — InfiniBand or 400GbE with RoCE support. Without an RDMA-capable fabric, cross-vendor collective communications are not viable.

Software stack. Each vendor’s nodes run their respective stacks — CUDA/NCCL on Nvidia, ROCm/RCCL on AMD. HetCCL sits above both as the orchestrating layer, routing collective operations across vendor boundaries via RDMA.

Workload scheduling. Kubernetes with GPU operator plugins or Slurm with hardware-aware job affinity handles initial workload placement. Your scheduler needs to understand vendor boundaries to place distributed training jobs correctly.

Where to start. Put AMD Instinct nodes in the inference tier and keep Nvidia for training. MI300X and MI325X outperform H100 and H200 on performance per dollar for large model inference at sub-40-second latencies. This sidesteps the straggler effect entirely, builds your team’s ROCm operational knowledge incrementally, and delivers immediate cost savings.

The DevOps overhead is real: two driver stacks, two profiler toolchains, two sets of performance tuning knowledge. Budget it explicitly. At scale, the hardware cost differential usually exceeds the extra engineering hours. If you rent cloud capacity rather than own it, AMD’s limited availability there is a current constraint.

Why do GPU waste reduction and heterogeneous compute adoption converge at this moment?

GPU underutilisation and the heterogeneous compute opportunity share the same root cause. CUDA-forced vendor dependency has created both the lock-in that drives underutilisation and the barrier to cheaper AMD hardware. Treat them as separate problems and you miss the leverage point.

An organisation that finds 50%–70% utilisation rates in its GPU audit actually has two opportunities at once: fix scheduling to extract more value from existing hardware, and evaluate AMD Instinct as a partial replacement when the next hardware expansion cycle arrives — rather than defaulting to another round of Nvidia procurement.

HetCCL changes the planning horizon even at research stage. It signals that the research community and major framework maintainers are investing in a world where NCCL homogeneity is not a permanent constraint. AMD’s datacentre AI GPU market share has grown steadily since Q1 2023. AMD, Intel, Broadcom, and Meta are converging around open standards including PyTorch 2.0, OpenAI’s Triton, and the Ultra Ethernet Consortium as an alternative to Nvidia’s proprietary stack.

Here is the action sequence, ordered by effort and risk:

Run the utilisation audit now. It is zero-cost and produces the financial justification for any subsequent infrastructure change.

Evaluate ROCm for your specific workload profile. Start with inference — that is where AMD Instinct GPUs are strongest on a performance-per-dollar basis.

Monitor HetCCL development. Watch for an enterprise distribution or commercial support arrangement. The concept works. The question is when — not if — it reaches production readiness.

Include AMD Instinct in vendor evaluation for any hardware expansion planned for 2026–2027. The hardware is competitive, the ecosystem is improving, and the cost differential is meaningful at scale.

For the full procurement decision framework, see our piece on vendor diversification as an actionable step in your procurement strategy. For the full strategic picture of Nvidia’s hardware empire and what it means for your organisation, see our comprehensive overview.

Frequently Asked Questions

What is GPU underutilisation and how is it measured?

GPU underutilisation is when provisioned hardware runs below its theoretical compute capacity for a significant portion of operational time. The primary metric is SM utilisation, reported by nvidia-smi or rocm-smi. MLPerf benchmark standards use 70% as the threshold for adequate utilisation. Below that is underutilised; below 50% is acute waste.

What causes GPU underutilisation in enterprise AI infrastructure?

Three primary causes: defensive over-provisioning, single-tenant cluster architectures that prevent time-sharing, and bursty workload scheduling. Typical AI workflows spend 30%–50% of runtime in CPU-only stages while GPUs sit locked. CUDA version pinning amplifies the problem by preventing workload mobility across GPU pools.

What is HetCCL and is it ready for production use?

HetCCL is the first cross-vendor collective communications library enabling distributed training on heterogeneous clusters with both Nvidia and AMD GPUs, without source code modifications at any level. As of 2025–2026 it is research-stage technology tested on a 16-GPU, four-node cluster. Not production-ready, but a credible proof of concept.

What is the difference between NCCL and HetCCL?

NCCL supports only Nvidia-homogeneous GPU clusters — add a single AMD node and distributed training breaks. MSCCL++ and TorchComms support multiple vendors but require compile-time backend selection, preventing simultaneous cross-vendor execution. HetCCL removes that constraint using RDMA to enable collective operations across a mixed Nvidia and AMD cluster simultaneously, without code changes.

Can I mix Nvidia and AMD GPUs in the same training cluster today?

In limited configurations, yes. HetCCL achieves speedups of up to 1.48× over Nvidia-only training and 2.97× over AMD-only training in a 16-GPU test cluster. Enterprise production deployment requires RDMA-capable networking, ROCm expertise, and tolerance for research-stage maturity. The practical near-term approach is AMD GPUs in the inference tier and Nvidia GPUs for training.

How mature is AMD ROCm for production AI workloads in 2026?

ROCm 7.2 is production-viable for standard training and inference workloads on Linux with PyTorch. Performance typically runs 10%–30% behind equivalent Nvidia H100 hardware. Hardware pricing is 15%–40% cheaper. The main barriers are DevOps complexity and weaker toolchain support for specialised CUDA libraries.

What is RDMA and why does it matter for heterogeneous GPU clusters?

RDMA (Remote Direct Memory Access) allows GPU memory on different nodes to be accessed directly over the network without CPU involvement. GPUDirect RDMA achieves a direct transfer path between GPU nodes, eliminating the CPU hop for higher bandwidth and lower latency. HetCCL depends on RDMA. Without an RDMA-capable fabric — InfiniBand or RoCE — heterogeneous cluster deployment is not viable.

What is the straggler effect and how does HetCCL address it?

In distributed training, collective operations cannot complete until the slowest GPU finishes. HetCCL mitigates this through GPU-aware micro-batch size adjustment: faster GPUs receive proportionally larger micro-batches based on profiled throughput.

How do I migrate CUDA code to AMD ROCm?

AMD’s HIPIFY tool translates CUDA function calls to HIP equivalents. For most standard workloads, 80%–95% of the translation is automated. Remaining manual work concentrates in CUDA-specific intrinsics, custom kernels, and CUDA-native libraries without HIP equivalents. SCALE and ZLUDA run CUDA binaries directly on AMD hardware without code modification — a lower-barrier path to experimentation.

What are the DevOps implications of running a mixed Nvidia and AMD GPU cluster?

Two GPU driver stacks, two profiler and debugger toolchains, and two sets of performance tuning knowledge. Budget it explicitly in your TCO model. The 15%–40% hardware cost savings on AMD need to exceed the additional engineering hours. For organisations that own their infrastructure at scale, the economics are often favourable.

How does CUDA lock-in extend beyond just the programming model?

CUDA lock-in operates at three layers: code (training pipelines require porting), communications (NCCL dependency enforces Nvidia-homogeneous clusters), and toolchain (CUDA-native profilers, debuggers, and optimisation libraries). Most organisations focus on layer one. Layer two — NCCL — is often the harder constraint. Nvidia’s acquisition of SchedMD in December 2025 (makers of Slurm, running on 65% of TOP500 supercomputers) extends lock-in further into workload scheduling.

Is AMD a credible enterprise GPU alternative to Nvidia in 2026?

AMD Instinct MI300X is a credible alternative for organisations with Linux-native infrastructure, strong DevOps capability, and standard workloads. For directly owned clusters, MI300X beats H100 in absolute performance and performance per dollar for Llama3 405B and DeepSeekV3 670B. The gap is in software ecosystem depth — tooling, library support, documentation — rather than raw hardware performance. AMD’s datacentre AI GPU market share has been growing steadily since Q1 2023.