Business

SaaS

Technology

•

Apr 26, 2026

Mixture-of-Experts Architecture Explained — What MoE Means for Deployment Cost and Feasibility

A model with 80 billion parameters can cost less to run per token than a model with 10 billion. That is not a typo. It is a direct consequence of Mixture-of-Experts (MoE) architecture, and it is exactly why this matters for your AI infrastructure decisions.

The frontier of open-weight AI — DeepSeek V3, Qwen3-Coder-Next, Llama 4 Maverick — is built almost entirely on MoE. The core insight is this: inference cost is determined by active parameters per token, not total parameter count. That one distinction will change every cost estimate you make. For the broader open-weight landscape context, see our overview of open-source AI models and enterprise strategy.

Let’s get into it.

What is Mixture-of-Experts and why has it become the dominant architecture for frontier AI models?

A MoE model divides its parameters into multiple specialised subnetworks called experts. Each expert is a feedforward block within the full model — not a separate standalone model. A typical frontier MoE contains dozens to hundreds of these experts.

When a token arrives at inference time, a lightweight component called the router — or gating network — picks a small subset of experts to process it. Standard configurations activate two to eight experts per token; the rest sit idle. This is sparse activation: only a fraction of the total parameters fire for any given token.

Experts become specialised through training — not through hand-coding. This emergent specialisation is why a MoE model can match or exceed a dense model’s quality at a fraction of the compute cost. DeepSeek’s training cost for R1 was approximately $5.6 million. Comparable dense Western models ran to $80–100 million. That gap is a direct consequence of MoE.

The leaderboard evidence is unambiguous. DeepSeek V3, Qwen3-Coder-Next, Llama 4 Maverick, gpt-oss-120B — all MoE architectures. See how Chinese open-weight AI labs overtook US proprietary models for the competitive context behind why MoE became the dominant architecture — part of the broader open-weight AI landscape shift reshaping enterprise AI strategy.

What is the difference between active parameters and total parameters, and why does it change what you pay to run a model?

Total parameters is the full count of weights stored in a model — the number that dominates benchmark headlines. Active parameters is the subset that actually fires to process a single token. Active parameters determine compute cost and what you pay.

In a dense transformer, total and active parameters are effectively the same. Every weight participates in every token. In a MoE model they diverge dramatically, and that is the whole point.

Qwen3-Coder-Next: 80 billion total parameters, approximately 3 billion active per token. Running a token through this model costs roughly the same compute as a 3B dense model — around 26 times fewer FLOPs per token than a dense 80B equivalent.

DeepSeek V3: 671 billion total parameters, approximately 37 billion active per token. GPT-4-class performance at a fraction of what a dense model at that scale would cost.

The cost implication is direct. API pricing, self-hosted GPU utilisation, cloud inference cost — all track active parameters, not the headline figure. The “671B model” label tells you about storage and capability. It tells you nothing about what it costs to run.

This is the paradox that catches people out: an 80B MoE model can be cheaper per token than a 10B dense model. Always check both numbers — total parameters tell you what the model knows; active parameters tell you what it costs to ask.

Why does a MoE model require more memory than its active parameter count suggests?

Here is the memory paradox that catches infrastructure planners off guard. Qwen3-Coder-Next activates only ~3B parameters per token. But you cannot store just 3B parameters in VRAM. You must load all 80B so the router can select from every expert at inference time.

There are two distinct resource dimensions you need to track separately:

VRAM (GPU memory): scales with total parameters — all experts must be resident
Compute (FLOPs per token): scales with active parameters — only selected experts run

For DeepSeek V3 (671B total), no single GPU — or even a standard 8-GPU server — can hold the full model. It must be distributed via expert parallelism, where different experts physically reside on different GPUs. This introduces all-to-all communication overhead that dense deployments simply do not have. At scale, networking bandwidth becomes the performance ceiling, not raw compute. This is why high-bandwidth interconnects matter specifically for large MoE deployments.

Can you run a frontier MoE model on local hardware — and what does it actually require?

For some MoE models, yes. Quantisation reduces the numerical precision of stored weights, cutting memory requirements substantially at a modest accuracy cost. GGUF is the format used with llama.cpp — the most widely used local inference runtime for Apple Silicon, consumer GPUs, and prosumer workstations.

Concrete numbers for Qwen3-Coder-Next:

4-bit quantisation (INT4 GGUF): approximately 46 GB — fits on a Mac Studio with 64 GB unified memory or a workstation with two 24 GB GPUs
8-bit quantisation (INT8 GGUF): approximately 85 GB — requires a high-memory workstation or small multi-GPU rig, closer to full-precision output quality

DeepSeek V3 is a different story. At FP8, you’re looking at approximately 671 GB of VRAM. Production configurations require eight H200 141GB GPUs or equivalent distributed infrastructure. The practical rule of thumb: models up to ~80–100B total parameters are candidates for local deployment on well-specced developer hardware. Anything approaching frontier scale (300B+) requires cloud or dedicated infrastructure.

For more on Qwen3-Coder-Next as a coding agent, see which open-weight model wins for coding agents.

How does teacher-student distillation explain why there are thousands of small deployable models derived from large MoE bases?

Teacher-student distillation is a training methodology, not an architecture. A large “teacher” model — often a frontier MoE like DeepSeek R1 or a Qwen3 variant — generates outputs across a wide range of tasks. A much smaller “student” model is then trained to approximate the teacher’s behaviour using those outputs as training data. The student is typically a compact dense model — 1.5B to 32B parameters — that inherits much of the teacher’s capability without inheriting its infrastructure requirements.

This is why HuggingFace hosts thousands of small open-weight models bearing the DeepSeek or Qwen lineage. DeepSeek’s R1-Distill variants (1.5B, 7B, 14B, 32B) are all dense models distilled from the MoE teacher. Alibaba reports more than 170,000 derivative models built on Qwen.

So if you’re asking “how do I get frontier-quality reasoning in a model I can actually deploy?” — the answer is not to self-host the 671B teacher. It is to use a 7B or 14B student trained on the teacher’s outputs. The quality trade-off is real but bounded: coding and structured output tasks distil well; nuanced open-ended reasoning less so.

MoE at the frontier, dense at the edge. That is the pattern across the whole ecosystem.

What is NVIDIA NVL72 and why does it matter specifically for MoE model deployments?

NVIDIA GB200 NVL72 is a rack-scale system connecting 72 Blackwell GPUs in a single NVLink domain with 130 TB/s of interconnect bandwidth — an order of magnitude more than a standard 8-GPU server.

Its relevance to MoE is architectural. For large MoE inference, the binding constraint is all-to-all communication traffic — not raw compute. Standard 8-GPU servers hit a communication ceiling under heavy MoE routing load. NVL72’s unified NVLink Switch fabric effectively removes that ceiling.

NVIDIA Dynamo, the orchestration software, implements disaggregated prefill/decode phases and intelligent token routing — optimisations that amplify throughput specifically for the variable expert activation patterns inherent to MoE inference.

On DeepSeek-R1 workloads, GB200 NVL72 delivers up to 28x the performance of AMD’s MI355X platform. AMD competes more closely on dense models; the gap widens on frontier MoE because of the communication architecture difference.

On pricing: CoreWeave lists GB200 NVL72 at $10.50/GPU-hour versus H200 at $6.31/GPU-hour. The price premium is outweighed by roughly a 20x performance delta on frontier MoE workloads. Performance per dollar works out to approximately 12x better on NVL72 for these workloads.

What is the delegation threshold and how does MoE architecture make it achievable?

The delegation threshold is the point at which an AI agent’s reliability, cost, and speed make it appropriate to run multi-hour tasks unsupervised — without a human in the loop.

MoE architecture is what makes this economically viable at frontier capability. Without MoE, running a frontier-capable agent for two to four hours continuously would be prohibitively expensive for most applications. With MoE, the per-token cost is low enough that extended multi-step task execution becomes a cost-justified operational mode. NVL72 amplifies this further by removing the communication bottleneck that would otherwise introduce latency under sustained load.

The full treatment — how to evaluate whether your workload has crossed this threshold — is in The Delegation Threshold: When Infrastructure Makes AI Agents Reliable Enough to Run Unsupervised.

FAQ

Is an 80B MoE model cheaper to run than an 80B dense model?

Yes, significantly. Qwen3-Coder-Next (80B total, ~3B active) costs the same compute per token as a 3B dense model — around 26 times fewer FLOPs than a dense 80B equivalent. The trade-off is VRAM: the MoE model still needs all 80B parameters loaded in memory. Always check the active parameter count, not the total, when estimating running cost.

Can I run Qwen3-Coder-Next on a Mac?

Yes, with 64 GB or more of unified memory. The 4-bit GGUF version requires approximately 46 GB — within reach of a Mac Studio M2 Ultra or M4 Max. Use llama.cpp or Ollama with the GGUF model file from HuggingFace. Slower than a GPU server, but adequate for developer evaluation and agentic pipeline testing.

What is sparse activation?

Only a small fraction of a MoE model’s parameters actually compute for any given token. In Qwen3-Coder-Next, approximately 3 billion of 80 billion parameters activate per token — roughly 3.75% of the total. You get a large model’s capability (trained across all experts); you pay the compute cost of only the small subset that activates.

Why do MoE models need so much GPU memory if they only use a fraction of their parameters?

All experts must be resident in VRAM so the router can dispatch tokens to any of them at inference time — you cannot pre-select which will be needed. VRAM scales with total parameters; FLOPs per token scale with active parameters. These are separate resource dimensions and need separate calculations.

What is expert collapse and should I worry about it in production?

Expert collapse is when the router learns to route most tokens to a small subset of experts, leaving the majority idle. In well-trained production models like DeepSeek V3 and Qwen3-Coder-Next, it is mitigated through load-balancing loss functions during training. For teams self-hosting, the practical concern is monitoring expert utilisation over time — managed APIs absorb this risk on your behalf.

Is self-hosting a MoE model like DeepSeek V3 practical for an SMB?

For DeepSeek V3 (671B total), no. At FP8, weights alone require approximately 671 GB of VRAM, demanding eight H200 141GB GPUs or equivalent. Your practical paths: (a) use the DeepSeek V3 API, priced on active parameters; or (b) use a distilled student model (7B–14B) for workloads that tolerate some quality reduction.

What is the difference between MoE and teacher-student distillation?

MoE is an architecture — the large model is what you run, with only a subset of experts activating per token. Teacher-student distillation is a training methodology — the large model generates training data that teaches a smaller model to approximate its outputs; the small model is what you deploy. They are complementary: frontier MoE models are the teachers; small deployable dense models on HuggingFace are the downstream product.

What inference framework should I use to deploy a MoE model?

For NVIDIA hardware, TensorRT-LLM delivers the highest throughput ceiling. vLLM supports MoE natively on both NVIDIA and AMD hardware — the practical starting point for most teams. For local deployment on Apple Silicon or consumer GPUs, llama.cpp with GGUF-format models is the standard. SGLang is worth evaluating for DeepSeek-family models at scale.

What is the difference between INT4 and FP8 quantisation for MoE models?

Both reduce VRAM by lowering the numerical precision of stored weights. For Qwen3-Coder-Next: INT4 GGUF requires approximately 46 GB (Mac and prosumer workstation deployment, some accuracy degradation); FP8 requires approximately 80–85 GB (closer to full-precision quality). Use INT4 for local evaluation; FP8 or FP16 for production workloads where quality is business-critical.

How does expert parallelism differ from tensor parallelism?

Tensor parallelism splits a single layer’s weight matrices across GPUs — all GPUs participate in every token. Expert parallelism places different experts on different GPUs — only GPUs holding the selected experts are active per token. The practical result: expert parallelism reduces per-token communication overhead compared to splitting weight matrices, but requires all-to-all routing traffic across the cluster.

When does the cost advantage of MoE over a dense model break down?

At very low batch sizes — single user, single token stream — the VRAM cost of hosting a large MoE model can outweigh compute savings versus a smaller dense alternative. For intermittent, low-volume inference, a smaller dense model on a single GPU may deliver better cost-per-token. The MoE advantage is most pronounced at scale and under concurrent load.

MoE architecture is one piece of a larger picture. For a complete overview of how open-weight models from DeepSeek, Qwen, and others are reshaping enterprise AI strategy — from build-vs-buy decisions to governance requirements — see our open-source AI model strategy guide.

Mixture-of-Experts Architecture Explained — What MoE Means for Deployment Cost and Feasibility

What is Mixture-of-Experts and why has it become the dominant architecture for frontier AI models?

What is the difference between active parameters and total parameters, and why does it change what you pay to run a model?

Why does a MoE model require more memory than its active parameter count suggests?

Can you run a frontier MoE model on local hardware — and what does it actually require?

How does teacher-student distillation explain why there are thousands of small deployable models derived from large MoE bases?

What is NVIDIA NVL72 and why does it matter specifically for MoE model deployments?

What is the delegation threshold and how does MoE architecture make it achievable?

FAQ

Is an 80B MoE model cheaper to run than an 80B dense model?

Can I run Qwen3-Coder-Next on a Mac?

What is sparse activation?

Why do MoE models need so much GPU memory if they only use a fraction of their parameters?

What is expert collapse and should I worry about it in production?

Is self-hosting a MoE model like DeepSeek V3 practical for an SMB?

What is the difference between MoE and teacher-student distillation?

What inference framework should I use to deploy a MoE model?

What is the difference between INT4 and FP8 quantisation for MoE models?

How does expert parallelism differ from tensor parallelism?

When does the cost advantage of MoE over a dense model break down?

Related Articles

SoftwareSeni AI Adoption Update

Extended Team Model – all you need to know to build the dev team your business needs

Claude Cut Token Quotas In August – Will AI Coding Costs Keep Rising?

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG