Business

SaaS

Technology

•

Apr 27, 2026

DGX SuperPOD vs CloudMatrix 384 — Inside the Global AI Infrastructure Arms Race

The global AI infrastructure market has split into two camps: an NVIDIA-dominated Western supply chain on one side, and a Huawei Ascend-based ecosystem operating almost entirely within China on the other. US export controls drew that line — and now it shapes every enterprise AI procurement decision with any international exposure.

Huawei’s CloudMatrix 384 has been getting attention for its headline claim of 1.7× the compute of NVIDIA’s GB200 NVL72. That number needs unpacking. It comes from deploying 384 chips against 72. It tells you nothing about per-chip performance, software ecosystem maturity, or the fact that Huawei’s total addressable supply is a fraction of NVIDIA’s output.

So here’s an honest look at what each system actually delivers, where the data gets contested, and what this bifurcated market means if you’re making a multi-year capex decision.

For the broader context — NVIDIA’s vertical integration strategy, the CUDA ecosystem moat, and the competitive dynamics that produced this split — see our comprehensive analysis of the broader context of Nvidia’s hardware empire.

What is the NVIDIA DGX SuperPOD — and how does its Scalable Unit architecture work?

DGX SuperPOD is NVIDIA’s rack-scale AI supercomputing platform. It combines DGX compute nodes, InfiniBand networking, storage arrays, and management software into a single integrated system. You don’t deal with infrastructure integration — NVIDIA has already done that for you.

The building block is the Scalable Unit (SU), each containing 32 DGX systems. A single SU exceeds 40 kW of power draw per rack. The system scales from 4 SUs (1,024 GPUs) to 64+ SUs (16,000+ GPUs), so you can start at a defined scale and expand without redesigning the cluster. That matters when you can’t reliably forecast your AI compute requirements two or three years out.

Current production is built around H100 and H200 GPUs (Hopper generation), with Vera Rubin NVL72 as the next upgrade tier. The full software stack — NVIDIA Base Command, CUDA, Magnum IO, and AI Enterprise — creates real switching costs for teams that have standardised on CUDA.

And those switching costs are worth being honest about. CUDA has over 15 years of ecosystem development, 4 million developers across 40,000 companies, and deep integration with PyTorch and TensorFlow. Switching is possible. But it means retraining engineers, rewriting optimised kernels, revalidating performance pipelines, and absorbing a lot of operational uncertainty. The cost isn’t just technical — it’s organisational.

For more on how this lock-in sustains NVIDIA’s competitive position, see our analysis of DGX SuperPOD as the product expression of Nvidia’s vertical stack.

What does 28.8 exaflops actually mean for LLM training and inference workloads?

The 28.8 exaflops figure is aggregate FP8 compute throughput across eight Vera Rubin NVL72 systems — 576 GPUs total. It’s a theoretical peak. Not a sustained real-world throughput number. That distinction matters for workload planning.

A full DGX SuperPOD with 14 NVL72 systems (1,008 GPUs) delivers 50.4 exaflops of FP4 performance, with 1,046 TB of fast memory and 260 TB/s of NVLink throughput. Enough bandwidth to eliminate the model partitioning overhead that affects multi-node inference on previous generations. The full memory and compute space operates as a single coherent engine.

NVIDIA’s stated inference throughput improvement is approximately 10× per watt compared to Blackwell. For inference-heavy production deployments — which is most enterprise AI workloads — that translates directly into lower cost-per-token. The usual caveats apply: peak figures are theoretical maximums, and real-world efficiency depends on model architecture, batch size, and interconnect latency. Get workload-specific benchmark data before you commit.

For what these throughput improvements mean in per-token cost terms, see our analysis of what Vera Rubin’s performance numbers mean in per-token cost terms.

How did Huawei build CloudMatrix 384 under sanctions — and what can it actually do?

CloudMatrix 384 houses 384 Ascend 910C chips on Huawei’s proprietary high-speed fabric. The “384” is the chip count, not a model number — and understanding the system means understanding why Huawei needs that many chips to hit its headline claims.

The Ascend 910C is a dual-die package with a silicon footprint roughly 60% larger than NVIDIA’s H100. That means lower performance per square millimetre and per watt. It’s fabricated at SMIC on a nominally 7nm process without EUV lithography, with restricted HBM access. The performance ceiling is structural, not a fixable engineering problem.

Huawei’s response: build scale rather than chase single-chip performance. Bind more chips with high-speed interconnects. That’s the 1.7× headline — 384 chips vs. 72. CloudMatrix 384’s processing performance per watt runs at approximately 2.5× lower than top NVIDIA Blackwell servers. Independent benchmarks from DeepSeek researchers put the Ascend 910C at roughly 60% of H100 inference performance per chip.

On production volumes: five analyst estimates put Huawei’s 2025 AI chip output at 40,000 to 146,000 B300-equivalents. NVIDIA is projected to ship 3.67 million the same year.

The software position is the same story. CANN is Huawei’s CUDA equivalent; MindSpore is its AI framework. Both work inside China. Neither has meaningful adoption outside it — MindSpore returns 684 results on Gitee versus 10,000+ for PyTorch and TensorFlow each, even within China.

One more layer worth knowing: Huawei’s Kunpeng CPU (ARM-based) extends the alternative stack to the CPU tier. Full infrastructure independence from Western vendors requires controlling compute end-to-end — the Kunpeng is Huawei’s attempt to close that loop.

How does Ascend 910C actually compare to H100 and H200 in real AI workloads?

The clearest independent data comes from DeepSeek researchers: the Ascend 910C delivers approximately 60% of NVIDIA H100 inference performance in practice. On paper it sits at roughly 80% of H100’s Total Processing Performance. The 20% paper gap becomes a 40% real-world gap. That difference matters when you’re choosing infrastructure.

The sources of the gap are structural.

Process node ceiling: SMIC’s 7nm process produces lower transistor density and memory bandwidth than TSMC‘s 4–5nm nodes. The Ascend 910C’s silicon footprint is 60% larger than the H100 — that’s lower manufacturing efficiency, not greater capability.

HBM restrictions: Memory bandwidth, not raw compute, is frequently the binding constraint in LLM inference. Huawei’s restricted access to HBM directly caps what the chip can deliver.

Software immaturity: CANN is less optimised than CUDA across diverse model architectures. The absence of native FP8 or FP4 support — while H200 has FP8 and Blackwell has both — widens the effective gap further.

Benchmark coverage gap: MLPerf extensively covers NVIDIA hardware. Independent Ascend data is minimal, so enterprises have limited third-party validation to work with.

Huawei’s roadmap shows real progress. The Ascend 960 (Q4 2027) is expected to match H200 performance — roughly matching NVIDIA’s current chips two years after they shipped. Broad deployment is unlikely before 2028.

How have US export controls split the global AI infrastructure market into two tiers?

US AI chip export controls started in October 2022, banning exports to China of chips at or above A100 capability. Controls have expanded since. Biden added H100, H200, and Blackwell to restricted tiers. The Trump administration briefly floated H200 access with a 25% export fee in January 2026, with safeguard provisions that analysts at CNAS called “almost entirely unenforceable.” A China-specific Blackwell variant (B30A) was declined. The practical upshot: China gets the H20 (approximately 1/6th H200 performance) and domestic Ascend chips.

The result is two distinct infrastructure tiers.

Western tier: Full access to NVIDIA’s product line via DGX SuperPOD or hyperscaler cloud (AWS, Azure, GCP), plus a mature, internationally competitive AI software ecosystem.

Chinese domestic tier: Limited to H20 and Ascend. GPU cloud consolidating around Baidu and Huawei. Competing on integration depth and domestic supply security rather than raw compute.

Here’s the implication that pure performance comparisons miss. If you have Chinese subsidiaries or partners, your counterparts there cannot access NVIDIA frontier compute — and you cannot legally access Huawei infrastructure. Training pipelines, inference infrastructure, and cloud relationships may need to be architected separately for China vs. non-China operations. That carries real cost.

For a framework on how geopolitical AI infrastructure risk should factor into your procurement decisions, see our GPU procurement strategy guide.

What does the Blackwell-to-Vera Rubin upgrade cycle mean for enterprise capex planning?

Blackwell (GB200 NVL72) is the current shipping product. Vera Rubin NVL72 is the next generation. NVIDIA’s stated figure: up to 10× reduction in inference token cost compared with Blackwell.

A DGX SuperPOD purchase is a 3–5 year infrastructure commitment. AWS has already shortened server useful life estimates from six years to five as AI hardware cycles accelerate. Buying Blackwell now means locking into a generation that will be superseded within the depreciation cycle — and that’s material if you’re benchmarking on per-token cost.

Here’s how the decision breaks down by workload.

Inference-heavy workloads (most production AI deployments): Vera Rubin’s 10× throughput-per-watt improvement delivers meaningfully lower cost-per-token. If hyperscaler GPU rental covers your near-term needs, deferring purchase may be worth it.

Training-heavy workloads or immediate capacity requirements: Blackwell now. Waiting for Vera Rubin carries lead-time and availability risk.

Not yet committed to on-premises infrastructure: Cloud access via AWS, Azure, or GCP gives you a capital-light path to both generations. Hyperscalers absorb the generational capex risk.

One invisible cost worth flagging: CUDA lock-in. The Blackwell-to-Vera Rubin upgrade is seamless for CUDA-standardised teams — Base Command, AI Enterprise, and NIM carry forward. Switching to AMD Instinct during a hardware refresh adds migration cost on top of hardware transition.

For per-token cost modelling across Blackwell, Vera Rubin, and alternatives, see our inference economics analysis.

Where is the performance gap likely to narrow — and where will it persist?

The Council on Foreign Relations projects the performance gap between leading US and Chinese AI chips will grow from approximately 5× today to approximately 17× by the second half of 2027 — driven by NVIDIA’s Vera Rubin ramp and Huawei’s SMIC fabrication ceiling. That’s not a snapshot. It’s a trend line.

The volume gap reinforces it. US manufacturers are projected to produce 3.67 million B300-equivalent chips in 2025, growing to 6.89 million in 2026. Huawei’s 2025 output sits at 40,000 to 146,000 B300-equivalents — 1 to 4% of US production. The CFR describes the gap as “effectively impossible to close.” Yield data makes it worse: Ascend chip yields at 5–20% versus NVIDIA Blackwell’s 60–80%, giving the US approximately 170–180× the effective logic wafer manufacturing capacity.

Where Huawei shows genuine progress: interconnect engineering, CANN software optimisation, and the Ascend 960 roadmap (Q4 2027 target for H200-class performance). That’s real. But broad deployment of the 960 is unlikely before 2028, by which time NVIDIA’s next generations will have pushed the frontier further out.

Where the gap persists: per-chip compute density, HBM access restrictions, production volume, and CUDA’s developer network effects. All structural. None closing in a normal enterprise planning horizon.

The strategic question isn’t which vendor wins. It’s how you manage your exposure in a market that will remain structurally split.

On non-NVIDIA Western alternatives: AMD Instinct (ROCm ecosystem) is the most mature option, with PyTorch integration and HIP for CUDA code porting — though developer tooling remains less mature than CUDA. Groq‘s LPUs deliver approximately 2× faster inference in independent tests for latency-sensitive workloads. Google TPUs offer a 30–50% cost advantage for inference in specific use cases. None are a like-for-like DGX SuperPOD replacement for general-purpose enterprise AI infrastructure.

For a complete procurement decision framework incorporating geopolitical risk, see our GPU procurement strategy guide and Nvidia’s full competitive position.

Frequently Asked Questions

What is the Huawei CloudMatrix 384 and why does it use so many chips?

CloudMatrix 384 is Huawei’s rack-scale AI computing system housing 384 Ascend 910C chips. The high chip count is Huawei’s “quantity-over-quality” response to the Ascend 910C delivering approximately 60% of NVIDIA H100 inference performance — a per-chip deficit imposed by SMIC’s 7nm fabrication ceiling and HBM constraints.

How does the Ascend 910C compare to the NVIDIA H100 in practice?

DeepSeek benchmark data shows approximately 60% of H100 inference performance in practice, against paper specs at roughly 80% of H100 TPP. The 20% paper gap becomes a 40% real-world gap — driven by SMIC’s fabrication process, HBM constraints, and CANN immaturity.

Can Western enterprises buy Huawei CloudMatrix 384 or Ascend 910C chips?

No. US export controls prevent it. If you have Chinese subsidiaries or partners, they can’t access NVIDIA’s most capable chips either — an operational asymmetry that requires separate infrastructure planning on both sides.

What is the difference between chip-level and system-level AI performance benchmarks?

Chip-level benchmarks measure single-accelerator performance; system-level measures aggregate cluster throughput. CloudMatrix 384’s 1.7× headline over GB200 NVL72 is system-level — 384 chips vs. 72. It does not imply per-chip parity.

What did the US export control changes in late 2025 mean for enterprises?

The Trump administration permitted H200 exports to China with a 25% fee in January 2026, with safeguard provisions analysts called “almost entirely unenforceable.” A China-specific Blackwell variant (B30A) was declined. China remains limited to H20 and Ascend chips. The full NVIDIA product line remains available to Western enterprises globally, except China-destined deployments.

Should enterprises buy Blackwell now or wait for Vera Rubin?

Inference-heavy workloads: consider waiting — Vera Rubin’s 10× inference token cost reduction is material. Training-heavy or urgent capacity needs: Blackwell now. Cloud access via AWS, Azure, or GCP gives a capital-light path to both without a multi-year hardware commitment.

What is the software maturity gap between NVIDIA CUDA and Huawei CANN?

CUDA has 15+ years of development, 4 million developers, 40,000 integrated companies, and deep PyTorch and TensorFlow integration. CANN works within China but has negligible adoption outside it — MindSpore returns 684 results on Gitee vs. 10,000+ for PyTorch and TensorFlow each. Limited third-party support and minimal MLPerf coverage create real operational risk.

What are the realistic AI compute alternatives to NVIDIA for Western enterprises?

AMD Instinct (ROCm) is the most mature non-NVIDIA option, with PyTorch integration and HIP for CUDA porting — though developer tooling lags CUDA. Google TPU serves specific large-scale training use cases. Groq is optimised for low-latency inference. None are a like-for-like DGX SuperPOD replacement today.

How does the US-China AI infrastructure split affect enterprise data residency?

Chinese entities can’t access NVIDIA frontier compute; Western entities can’t legally access Huawei Ascend infrastructure. International enterprises may need to architect training pipelines, inference infrastructure, and cloud relationships separately for China vs. non-China operations.

What is Huawei’s AI chip production volume compared to NVIDIA?

Five analyst estimates place Huawei’s 2025 output at 40,000–146,000 B300-equivalents — 1–4% of US production. NVIDIA is projected to ship 3.67 million in 2025, growing to 6.89 million in 2026. Yield differentials compound the gap: Ascend chips at 5–20% yield vs. NVIDIA Blackwell at 60–80%.

Will Huawei’s Ascend 960 close the gap with NVIDIA?

The Ascend 960 (Q4 2027) is expected to match H200 performance — roughly matching NVIDIA’s current-generation chips two years after they shipped. Broad deployment is unlikely before 2028, when NVIDIA’s Vera Rubin and subsequent generations will have further extended the frontier.

What does the AI infrastructure bifurcation mean for enterprise vendor risk management?

Assess three things: whether key customers, partners, or subsidiaries operate inside China on a different infrastructure tier; whether your AI deployment has data residency requirements that interact with the China/non-China split; and how CUDA standardisation affects your optionality as NVIDIA’s pricing power grows. The question isn’t which vendor wins — it’s how you manage your exposure in a market that will remain split.