Business

SaaS

Technology

•

Apr 27, 2026

CUDA Lock-in Unpacked — The Software Moat, the Real Switching Costs, and How They Are Changing

Q: Should I build my new AI product on CUDA or try to be hardware-agnostic?

Default to standard PyTorch ops with torch.compile as the compilation path. This gives you CUDA performance now and Triton-backed portability for future hardware flexibility. Avoid writing custom CUDA kernels unless performance requirements specifically demand it. If custom kernels are necessary, use Triton rather than CUDA C++.

Most enterprise AI teams know they have a CUDA dependency. Fewer have actually mapped how deep it goes — or that there are two distinct dependencies that need two different fixes.

Here’s the thing that trips people up: CUDA lock-in operates at a software/framework layer (libraries, toolchain, developer ecosystem) and separately at a networking/collective communications layer (NCCL, NVLink). If you conflate them, your remediation will be incomplete. A team can fully migrate their model code to ROCm and still be locked at the networking layer the moment they run distributed training using NCCL for inter-node communication.

Conditions are shifting in 2025–2026. ROCm 7.x has narrowed the performance gap. HetCCL (January 2026) introduced the first viable candidate for abstracting the networking layer. Triton adoption inside PyTorch 2.x is already providing software-layer portability for teams that haven’t explicitly sought it. “CUDA lock-in is eroding” is simultaneously correct and insufficient — how fast it’s eroding depends entirely on which layer you’re stuck on and what workloads you run.

This article gives you four concrete things: a switching cost breakdown, an honest 2026 ROCm assessment by workload type, a six-dimension CUDA dependency audit framework, and three portability paths with decision criteria. For the broader strategic context, see the full picture of Nvidia’s competitive moat.

What are the two layers of CUDA lock-in — and why do they need different solutions?

CUDA lock-in operates at two structurally distinct layers. They require different solutions.

The software/framework layer covers everything your code touches directly. That’s the CUDA libraries (cuBLAS for linear algebra, cuDNN for deep neural network primitives, cuSPARSE for sparse matrix operations, TensorRT for inference optimisation), the CUDA toolchain (nvcc compiler, Nsight profiler), and two decades of accumulated developer knowledge, framework integrations, and CI/CD infrastructure.

The networking/collective communications layer sits below the framework layer. Nvidia’s NCCL handles multi-GPU, multi-node communication primitives — AllReduce, Broadcast, AllGather — and is deeply integrated with NVLink and InfiniBand fabrics. AMD has its own equivalent in RCCL. These libraries are not interoperable. You cannot run parallel training operations across GPUs from different vendors simultaneously — or at least you couldn’t until January 2026.

Why this distinction matters: software-layer solutions (ROCm, HIP, Triton, PyTorch hardware abstraction) address the code and library dependency. They do nothing for the networking layer. A team that completes a full ROCm migration still faces NCCL dependency the moment they run distributed training across a mixed-vendor cluster.

Software/framework lock-in (cuBLAS, cuDNN, cuSPARSE, TensorRT, nvcc, Nsight) is addressed by ROCm/HIP, Triton, PyTorch hardware abstraction layer, and HIPIFY. Networking/communications lock-in (NCCL, NVLink, InfiniBand) has only one current candidate: HetCCL, published January 2026, still at the research stage.

For the full technical treatment of the networking layer, see HetCCL and the networking-layer complement to software portability.

What are the actual switching costs when migrating a production AI workload off CUDA?

Switching costs are real, quantifiable, and highly workload-dependent. The numbers differ by an order of magnitude between an inference-only deployment on standard PyTorch and a distributed training stack with custom CUDA kernels and NCCL-dependent communication.

Engineering time: Inference-only workloads on standard PyTorch take weeks to months. Training-heavy workloads with custom CUDA kernels and NCCL dependency take 6–12 months. That upper end reflects the cumulative cost of kernel porting, library gap-filling, toolchain rebuild, performance regression testing, and developer ramp-up. It’s not a trivial project — it’s a meaningful engineering commitment.

Performance regression: Thunder Compute‘s April 2026 benchmarks put CUDA ahead of ROCm by 10–30% for most compute-intensive ML tasks. Custom kernels see the upper end of that range because AMD’s memory access patterns differ from Nvidia’s — CUDA kernels don’t achieve parity without explicit rewriting, even after automated HIP conversion.

Library replacement work: Low for inference-only stacks using standard ops — rocBLAS covers cuBLAS, and MIOpen covers most cuDNN primitives. High where TensorRT is in the pipeline. TensorRT has no direct ROCm equivalent, and SemiAnalysis benchmarks show H200 with TensorRT-LLM achieving up to 1.5x throughput advantage over AMD when using equivalent serving frameworks. There is no drop-in fix for that.

The HIPIFY shortcut: AMD’s HIPIFY tool automates CUDA-to-HIP translation, reducing required manual changes to under 5% of the code for many codebases. Useful — but it doesn’t solve library gaps or toolchain migration. HIPIFY handles mechanical translation. Everything else is separate work.

For a detailed look at how performance regression affects your planning, see how CUDA dependency affects your inference cost modelling.

How mature is ROCm in 2026 — and which workloads are actually production-ready on AMD?

ROCm has matured substantially. The 2–5x performance gap of earlier versions has narrowed to 10–30% for most ML tasks as of ROCm 7.x (Thunder Compute, April 2026). That’s a meaningful shift in the competitive picture.

Where ROCm is production-ready: Large dense model inference is AMD’s clearest competitive case. The MI300X carries 192GB HBM3 memory versus 80GB for H100 SXM — and that difference is decisive for memory-bound large-model serving. SemiAnalysis benchmarks show MI300X beating H100 in absolute performance and performance per dollar for Llama 3 405B and DeepSeek v3 670B inference. Standard PyTorch transformer training with standard ops is also viable; rocBLAS is a solid cuBLAS replacement, and ROCm 7.2 added extensive GEMM tuning across FP8, BF16, and FP16.

Where ROCm still trails: TensorRT has no direct ROCm equivalent — AMD’s inference stack is credible for large-model serving but won’t match TensorRT-LLM throughput on H200. Custom kernel development requires more engineering expertise on AMD, and profiling tooling is less mature. ROCm installation still requires kernel parameter modifications and manual dependency resolution — overhead a hyperscaler absorbs easily but a 100-person ML team definitely feels.

One important nuance on the numbers: AMD’s published 3.5x inference improvements are version-over-version, not versus CUDA. Use the Thunder Compute 10–30% CUDA-lead figure for your planning. Hardware pricing runs 15–40% cheaper than comparable Nvidia, though cloud rental prices have elevated recently.

What can OpenCL, Triton, and PyTorch abstraction actually do — and where does CUDA still leak through?

Vendor-agnostic frameworks address the software-layer problem at different levels of abstraction. Knowing what each one covers — and where CUDA semantics still leak through — prevents you from over-estimating your portability.

OpenCL provided hardware-agnostic GPU programming but failed because Nvidia gave CUDA developers a far better experience — superior tooling, richer documentation, a compounding ecosystem. The lesson for 2026: abstraction alone is not enough. This is why Triton succeeds where OpenCL did not — it works within the PyTorch developer experience rather than asking developers to step outside it.

Triton is OpenAI’s Python-based compiler for GPU kernel development. Since PyTorch 2.0, it is the default backend for torch.compile() — which means Triton is already generating kernels on CUDA or ROCm for your standard PyTorch models without any action on your part. Custom kernels written in Triton’s DSL compile to both CUDA PTX and ROCm/HIP backends. What Triton does not cover: library-level dependencies (cuDNN, TensorRT, cuBLAS still require separate handling). The networking layer is entirely outside its scope.

PyTorch hardware abstraction layer: standard model code using only standard PyTorch ops can run on CUDA or ROCm by changing the device argument. CUDA leaks through in specific places: custom CUDA extensions (torch.cuda.* calls, .cu files), direct cuDNN calls, and TensorRT integration. If your codebase passes a clean audit of those — no .cu files, no torch.cuda.* calls, no TensorRT — your software-layer portability is largely already there, and the switching cost is performance regression testing, not code rewriting.

How do you audit your organisation’s actual CUDA dependency level?

Before making migration or procurement decisions, you need to know which tier of dependency you are actually in. Most organisations haven’t done this explicitly. Here is a six-dimension framework that produces a concrete risk profile.

Run through each dimension. Assign a risk tier: Low (weeks), Medium (months), High (6–12 month engineering project).

Dimension 1: Custom kernel code. Search for .cu files and torch.cuda.* calls. Zero hits beyond device management is Low. A substantial custom kernel library is High.

Dimension 2: Library dependencies. cuBLAS → rocBLAS: Low, mature equivalent. cuDNN → MIOpen: Medium, most primitives covered but gaps on some custom transformer ops. TensorRT → nothing: High, no drop-in replacement on AMD.

Dimension 3: Framework usage patterns. Standard PyTorch ops with torch.compile already in use is Low. Extensive CUDA extension libraries with direct CUDA API calls is High.

Dimension 4: Distributed training and communication. This is a networking-layer dependency — separate from the rest, and it needs to be treated that way. Single-GPU inference only is Low. Multi-node distributed training with NCCL inter-node is High. And migrating the rest of your stack does not resolve this.

Dimension 5: Toolchain and CI/CD. Framework-standard Docker images with no nvcc invocations is Low. Nsight-integrated performance regression pipelines with extensive nvcc-compiled components is High.

Dimension 6: Team knowledge. ML team working primarily at the PyTorch level is Low. Multiple engineers whose primary value is CUDA-specific kernel expertise is High.

Scoring: 0–1 High-risk dimensions — migration feasible in weeks to months. Two High-risk dimensions — 3–6 month engineering project. Three or more — full 6–12 month dedicated effort requiring management commitment.

If Dimension 4 scores High, your networking-layer dependency is the constraint. That requires HetCCL reaching production, not just ROCm maturity — track it separately. If Dimension 2 scores High due to TensorRT, your inference pipeline needs specific re-engineering before AMD hardware is viable.

For integrating audit results into procurement decisions, see integrating a CUDA lock-in assessment into your procurement decisions.

How do you use HetCCL adoption as a leading indicator of when networking-layer switching costs drop?

HetCCL is a vendor-agnostic collective communications library published on arXiv in January 2026. It is the first system to enable a single collective operation — AllReduce, Broadcast, AllGather — to execute simultaneously across Nvidia and AMD GPUs via RDMA, without requiring code modifications to existing deep learning frameworks.

Prior work (MSCCL++, TorchComms) required compile-time selection of a single backend — you compiled for NCCL or RCCL, not both. HetCCL acts as an orchestration layer invoking vendor-native NCCL and RCCL for intra-vendor communication while handling cross-vendor coordination separately. Native optimisations are preserved.

Why this is the right leading indicator: teams that have addressed the software layer but run distributed training at scale face one remaining structural barrier — NCCL inter-node communication. When that barrier drops via production-grade HetCCL, the last lock-in becomes a procurement decision rather than a technical constraint. That’s the shift worth watching for.

Current status (April 2026): research and early-adoption stage. The arXiv paper verified training convergence with LLaMA-1B on heterogeneous Nvidia + AMD systems. Tom’s Hardware covered it in February 2026. It is not yet in widespread enterprise production use. Signals to watch: a stable GitHub release, integration into PyTorch Distributed or DeepSpeed, and hyperscaler deployment announcements.

If your audit shows Dimension 4 as your only High-risk dimension, HetCCL production adoption is the specific event to wait for. If your software layer still has multiple High-risk dimensions, HetCCL’s status is premature to track — sort out the software layer first.

For the full technical treatment of HetCCL, see HetCCL and the networking-layer complement to software portability.

What are the three practical paths to portability — and how do you choose between them?

No single portability strategy is optimal for all stacks. The right path depends on your audit results, workload type, and engineering capacity. And these paths are not mutually exclusive — combinations are common and often the right answer.

Path 1: Gradual ROCm Migration. Best for low-to-medium audit scores; inference-dominated workloads; standard PyTorch without custom kernels. Run MI300X hardware in parallel for inference, benchmark against your production baseline, resolve library gaps, then expand to training using HIPIFY for automated translation. Expected timeline: 3–9 months with a clean audit profile. AMD hardware costs 15–40% less than comparable Nvidia — factor migration engineering cost against that saving.

Path 2: Triton / PyTorch Abstraction Layer Adoption. Best for teams building new workloads or willing to refactor custom kernels. Adopt torch.compile as the standard compilation path; rewrite custom kernels in Triton starting with the most-used ones. This path reduces future switching costs regardless of whether you retain Nvidia hardware — it’s a portability investment that pays off when hardware decisions become necessary. Expected timeline: weeks for new projects; months for kernel refactoring in existing codebases.

Path 3: HetCCL + AMD Supplemental Cluster. Best for teams where the primary remaining lock-in after software-layer work is NCCL-dependent distributed training. Deploy MI300X for inference workloads now and monitor HetCCL production signals before expanding to training. Path 3 defers rather than eliminates the software migration cost for training — it buys time and immediate inference savings, not a substitute for the software-layer work.

How to choose: 0–1 High dimensions: Path 1. 1–2 High dimensions driven by custom kernels: Path 2. Dimension 4 only High (NCCL): Path 3. Three or more High dimensions: Path 2 now, revisit Paths 1 and 3 after software-layer work is complete. Greenfield new workload: Path 2 by default.

Paths 2 + 3 is a common and coherent combination: adopt Triton/PyTorch abstraction for portability while adding MI300X hardware for large model inference now. Immediate hardware savings, positioned for full mixed-vendor capability when HetCCL matures.

For integrating your portability strategy into procurement decisions, see integrating a CUDA lock-in assessment into your procurement decisions. And for the full strategic picture, see Nvidia’s AI hardware empire.

Frequently Asked Questions

Is CUDA lock-in still a real problem in 2026?

Yes — but it’s increasingly quantifiable and workload-dependent. Software-layer tools (ROCm, Triton, PyTorch abstraction) are maturing faster than networking-layer alternatives (HetCCL is pre-production). What has changed is that you can now measure your exposure precisely rather than treating it as an undifferentiated barrier.

What is the difference between CUDA lock-in and vendor lock-in?

CUDA lock-in refers to dependency on Nvidia’s proprietary software ecosystem — libraries, toolchain, developer ecosystem. Vendor lock-in is broader and includes hardware procurement dependencies. CUDA lock-in is the deeper structural risk because it persists even if you have pricing leverage — you still cannot run the software on alternative hardware without a migration project.

Can I run my existing PyTorch models on AMD GPUs without rewriting anything?

If your models use only standard PyTorch ops — no custom CUDA extensions, no torch.cuda.* API calls, no TensorRT dependency — you can switch to AMD ROCm by changing the device argument with minimal rework. If your stack includes custom .cu files or TensorRT integration, you will need explicit porting work first.

How hard is it to actually switch from CUDA to AMD in 2026?

For inference-only workloads on standard PyTorch: weeks to months. For distributed training workloads with custom CUDA kernels and NCCL dependency: 6–12 months of dedicated engineering effort. Expect 10–30% performance regression during the transition, narrowing to a 10–30% steady-state Nvidia lead after stabilisation (Thunder Compute, April 2026).

What is HetCCL and why does it matter for enterprise GPU strategy?

HetCCL (arXiv, January 2026) is the first system enabling a single collective operation to execute across Nvidia and AMD GPUs simultaneously via RDMA. It matters because NCCL dependency is the last networking-layer barrier for teams that have already addressed software portability. HetCCL’s production adoption is the threshold event that makes mixed-vendor distributed training straightforward rather than a research project.

Why did OpenCL fail to replace CUDA, and what does that tell us?

OpenCL failed despite genuine hardware agnosticism because Nvidia gave CUDA developers a better experience — superior tooling, richer documentation, faster ecosystem growth. The lesson for 2026: abstraction alone is insufficient. This is why Triton succeeds where OpenCL did not — it works within the PyTorch developer experience.

What is the performance gap between CUDA and ROCm today?

CUDA outperforms ROCm by 10–30% for most compute-intensive ML workloads as of April 2026 (Thunder Compute). The exception is memory-bound large model inference: AMD Instinct MI300X and MI325X beat H100 for large dense model serving due to substantially higher memory capacity and bandwidth.

What does Triton actually abstract, and what does it leave exposed?

Triton abstracts custom GPU kernel code — kernels written in Triton’s Python DSL compile to both CUDA PTX and ROCm backends. Triton does not abstract library-level dependencies: cuDNN, TensorRT, and cuBLAS still require separate handling. It is a tool for new kernel development portability, not a migration tool for existing CUDA library dependencies.

How much does GPU vendor lock-in actually cost in engineering time?

A full production migration from CUDA to ROCm costs 6–12 months for a training-heavy workload with custom CUDA kernels. For inference-only workloads with standard PyTorch, the cost drops to weeks to months. Developer productivity overhead during the ramp period adds further cost due to ROCm’s thinner ecosystem compared to nearly two decades of CUDA community resources.

Should I build my new AI product on CUDA or try to be hardware-agnostic?

Default to standard PyTorch ops with torch.compile as the compilation path. This gives you CUDA performance now and Triton-backed portability for future hardware flexibility. Avoid writing custom CUDA kernels unless performance requirements specifically demand it. If custom kernels are necessary, use Triton rather than CUDA C++.

Is AMD’s MI300X a credible alternative to Nvidia H100 for inference workloads?

For large dense model serving: yes — MI300X outperforms H100 in absolute performance and performance per dollar. Its 192GB HBM3 memory capacity and high memory bandwidth are decisive where memory is the bottleneck. For compute-bound training workloads, H100 retains a meaningful lead. The 15–40% hardware cost advantage makes MI300X viable for inference-dominated deployments once software-layer compatibility is confirmed.

What specific CUDA libraries are hardest to replace when migrating to ROCm?

From hardest to easiest: TensorRT (no direct ROCm equivalent — hardest gap), cuDNN (partially replaced by MIOpen — most ops covered, gaps on custom transformer operations), cuSPARSE (partially replaced by rocSPARSE — workload-specific validation required), cuBLAS (cleanly replaced by rocBLAS/hipBLASLt — mature equivalent with broad GEMM coverage as of ROCm 7.2).