Three open-weight coding agent models are bunched up at the top of SWE-Bench Verified right now: GLM-4.7 at 74.2, Qwen3-Coder-Next at 70.6, and DeepSeek-V3.2 at 70.2. Those scores would have looked like frontier-level performance twelve months ago. But the numbers alone don’t tell you which model to actually deploy. Availability constraints, infrastructure requirements, and real-world performance gaps mean the highest scorer isn’t automatically the right choice for your team.
This article walks through all three — benchmark performance (with some important caveats), deployment options, integration with coding agent tools, and licensing. It’s part of the broader analysis of the open-weight AI model landscape and goes deep on the coding agent use case. Three deployment paths — managed (AWS Bedrock), self-hosted (vLLM/SGLang), and local (llama.cpp/GGUF) — are mapped to help you make the call.
How do the three leading open-weight coding models compare on benchmarks?
Here are the key numbers. All SWE-Bench scores use the Verified variant — the human-reviewed subset — evaluated with the SWE-Agent scaffold.
Qwen3-Coder-Next SWE-Bench Verified: 70.6 | SWE-Bench Pro: 44.3 | SWE-Bench Multilingual: 62.8 | Terminal-Bench 2.0: 36.2 | Aider: 66.2 | Context window: 256K tokens | Parameters: 80B total / 3B active | Licence: Apache 2.0 | Deployment: Local, self-hosted, managed
DeepSeek-V3.2 SWE-Bench Verified: 70.2 | SWE-Bench Pro: 40.9 | SWE-Bench Multilingual: 62.3 | Terminal-Bench 2.0: — | Aider: — | Context window: 128K tokens | Parameters: 671B total / 37B active | Licence: MIT | Deployment: Managed, API (multi-GPU for self-host)
GLM-4.7 SWE-Bench Verified: 74.2 | SWE-Bench Pro: 40.6 | SWE-Bench Multilingual: 63.7 | Terminal-Bench 2.0: — | Aider: — | Context window: 204K tokens | Parameters: ~360B MoE | Licence: MIT | Deployment: API (access-restricted)
Three things worth unpacking here.
The Qwen3-Coder-Next / DeepSeek-V3.2 gap is basically nothing. At 0.4 points on SWE-Bench Verified, these two are in practical parity. The more meaningful difference is architecture: Qwen3-Coder-Next runs 80B total parameters with only 3B active per token (MoE design), while DeepSeek-V3.2 activates 37B of its 671B total. That’s a very different deployment footprint for nearly identical benchmark performance.
The ranking flips on the harder benchmark. On SWE-Bench Pro — a tougher variant curated by Scale AI in September 2025 — Qwen3-Coder-Next scores 44.3, ahead of DeepSeek-V3.2 (40.9) and GLM-4.7 (40.6). GLM-4.7’s lead on Verified doesn’t hold up when the tasks get harder.
GLM-4.7’s thinking mode is a qualitative differentiator. GLM-4.7 supports interleaved reasoning before tool calls and a toggleable deep reasoning mode to balance latency against accuracy. These capabilities matter in multi-step agentic sessions, but they don’t show up in benchmark snapshots.
All three models carry permissive licences — Apache 2.0 or MIT — so weights are inspectable and commercially usable. AWS Bedrock managed deployment partially addresses data-sovereignty concerns for teams that need it.
What do SWE-Bench scores actually measure — and what do they miss?
SWE-Bench evaluates a model’s ability to resolve real GitHub issues from open-source Python repositories. Success is binary: the model produces a patch, and either it passes the existing test suite or it doesn’t. SWE-Bench Verified is a human-reviewed subset with validated difficulty ratings — the most credible publicly available coding agent benchmark. Early models scored below 5%, so scores in the 70s represent genuine multi-step reasoning capability.
What SWE-Bench Verified measures well:
- Single-task issue resolution on well-scoped, bounded problems
- Performance on established codebases with existing test coverage
- Multi-step reasoning on tasks where success is unambiguous
What it misses:
- Sustained multi-day sessions across large proprietary codebases — tasks are bounded and reset between evaluations
- Cross-file refactoring requiring coordinated changes across many modules
- Enterprise codebases with incomplete or absent test coverage
Benchmark contamination is worth understanding too. Many of the GitHub issues in SWE-Bench were filed and resolved before the training cutoffs of the models being tested. If a model saw those repositories — the discussions, commits, and merged patches — during training, it may have effectively seen the answers. This doesn’t invalidate the benchmark, but a high score is better read as evidence of genuine capability than as a precise ranking instrument. Practitioners increasingly cross-reference multiple scores — SWE-Bench, SWE-Bench Pro, Terminal-Bench, Aider — rather than rely on any single number.
The 74.2 vs 70.6 vs 70.2 spread is useful as a filter for identifying models worth evaluating. But at this level of compression, a gap of this size often doesn’t produce observable differences in real-world task completion. The scores tell you which models are in range; they don’t tell you which will perform better in your codebase.
A note on SWE-Bench Pro: this harder variant, released September 2025 by Scale AI, uses a closed test set that has attracted controversy over independent verification. Verified and Pro scores are not directly comparable; all scores in this article use Verified consistently.
With those caveats in place, here’s how the three models map to specific use cases.
Which model is best for which coding workload?
Qwen3-Coder-Next is where most teams should start. Apache 2.0 removes legal friction, the 80B/3B MoE architecture makes local deployment feasible on consumer hardware, and its 256K context window is the largest of the three. On the harder SWE-Bench Pro it leads at 44.3. For a full treatment of what the 80B/3B MoE architecture means for your deployment budget, see MoE architecture and why 80B/3B works locally.
DeepSeek-V3.2 is the strongest option if managed API access is your deployment model. At 671B/37B parameters, self-hosting requires a multi-GPU cluster — 4–8x H100 equivalent, estimated $8–28/hr in cloud costs — which isn’t practical for most teams. Its DeepSeek Sparse Attention mechanism accelerates long-context reasoning paths by up to 3x. It’s also the only model of the three with explicit Huawei Ascend chip support, which matters for organisations in regions where NVIDIA hardware is export-restricted.
GLM-4.7 scores highest (74.2) but getting access is the problem. Restricted sign-ups at Z.ai mean the model may not be available to you regardless of your hardware. Even where access is granted, GLM-4.7’s ~360B parameter architecture puts local deployment well out of practical reach. For near-term production decisions, Qwen3-Coder-Next or DeepSeek-V3.2 are the more reliable choices.
Workload mapping:
- Long-context repository analysis (100K+ tokens): Qwen3-Coder-Next (256K context)
- Air-gapped or zero-API-cost deployment: Qwen3-Coder-Next via llama.cpp/GGUF
- Managed API, maximum available benchmark performance: DeepSeek-V3.2 API or GLM-4.7 (if access is granted)
- Huawei Ascend infrastructure: DeepSeek-V3.2
Why RLVR training matters. Qwen3-Coder-Next was trained on approximately 800K verifiable coding tasks using Reinforcement Learning from Verifiable Rewards (RLVR). Rather than relying on human preferences, the model receives correctness signals from automated graders — does the code compile, does it pass the tests. The result is a model capable of multi-step reasoning and failure recovery, which is qualitatively different from an instruction-tuned coding assistant.
DeepSeek-V3.2 uses a similar approach, but with a specific architectural extension: its “Thinking in Tool-Use” design retains chain-of-thought and tool-call history throughout a session rather than discarding the reasoning trace after each tool invocation. In practice, this means the model maintains context across a long sequence of tool calls — genuinely useful in coding agents that orchestrate many actions before completing a task.
How do you integrate an open-weight model with existing coding agent tools?
Claude Code is the dominant terminal-based coding agent tool. It uses Anthropic models by default, but Qwen3-Coder-Next can function as the backend via its OpenAI-compatible API. This isn’t a native feature. It requires deploying Qwen3-Coder-Next via vLLM or SGLang — both expose an OpenAI-compatible endpoint — then configuring Claude Code to use that endpoint as its API base. The model becomes a drop-in backend within Claude Code’s UX. One constraint: Qwen3-Coder-Next doesn’t generate think blocks, so Anthropic-specific extended thinking is unavailable via this path. For a discussion of the delegation threshold for coding tasks and when to use synchronous vs. delegated agent modes, that trade-off is explored further in the companion article.
Three integration paths:
Claude Code + Qwen3-Coder-Next via OpenAI-compatible API. Deploy Qwen3-Coder-Next via vLLM or SGLang, then configure Claude Code’s API base to point at that server. You keep Claude Code’s UX while replacing the proprietary model with a self-hosted open-weight backend.
Cline + local Qwen3-Coder-Next. Cline is an open-source VS Code extension that connects to any model via a configurable API endpoint. It’s a good option for individual developers who want local inference without leaving VS Code.
Qwen Code (native CLI). QwenLM/qwen-code on GitHub is Alibaba’s coding agent CLI built specifically for Qwen3-Coder-Next. Less configuration overhead than the OpenAI-compatible path, but a less mature ecosystem than Claude Code.
Tools built on the OpenAI client library — including Aider and Continue.dev — use the same mechanism. Configure the endpoint, set the API base URL, and the model becomes the backend. The pattern is consistent across tools; only the configuration file location differs.
What are the three deployment paths for open-weight coding agents?
Path 1 — Local (llama.cpp/GGUF)
Qwen3-Coder-Next at 4-bit quantisation requires approximately 46GB of unified memory; at 8-bit, approximately 85GB. GGUF quantisations are available from Unsloth. This is within reach of a MacBook Pro M3 Max with 128GB RAM, an RTX 5090, or an AMD Radeon 7900 XTX. Zero ongoing API costs. On memory-constrained systems, reduce the active context window from the 256K maximum to avoid out-of-memory errors. Local deployment doesn’t scale to team-wide shared access — there’s no access control, audit logging, or centralised monitoring.
The reason local deployment is feasible at all: 80B/3B MoE means VRAM requirements are driven by active parameters (3B), not total (80B). See MoE architecture and why 80B/3B works locally for the full explanation.
Path 2 — Self-Hosted (vLLM/SGLang)
A production inference server on private cloud or on-premises GPU infrastructure. You get shared team access, full control, and an OpenAI-compatible API endpoint. vLLM handles continuous batching; SGLang is optimised for structured generation and tool-calling. Total Cost of Ownership includes GPU capex, ML engineering headcount, and downtime risk — not just per-token pricing. DeepSeek-V3.2 self-hosting requires a multi-GPU cluster, which makes managed API or Bedrock the realistic path for that model for most teams.
Path 3 — Managed (AWS Bedrock)
AWS Bedrock hosts Qwen3 Coder Next via Project Mantle alongside DeepSeek V3.2, GLM 4.7, and other open-weight models, all served via an OpenAI-compatible endpoint. Enterprise governance features — Guardrails, Automated Reasoning Checks, access controls, audit logging — are available without any additional engineering. AWS PrivateLink is available in 14 regions, eliminating network latency overhead in multi-step agentic workflows. For a detailed Bedrock deployment walkthrough, see running Qwen3-Coder-Next without infrastructure overhead.
Decision heuristic: If your team can’t dedicate at least a part-time ML engineer to infrastructure, managed deployment is the lower-risk path. Self-hosting makes sense when control, cost at scale, or compliance requirements make it necessary. For the strategic framing, see the broader build-vs-buy decision and where these models fit in enterprise strategy.
FAQ
Does Qwen3-Coder-Next work with Claude Code?
Yes, but not natively. Claude Code uses Anthropic models by default. To use Qwen3-Coder-Next as the backend, deploy it with vLLM or SGLang — both expose an OpenAI-compatible endpoint — then configure Claude Code’s API base to point at that server. Qwen3-Coder-Next operates in non-thinking mode only and doesn’t generate think blocks, so Anthropic-specific extended thinking isn’t available via this path.
Can I run GLM-4.7 locally?
Not practically. GLM-4.7’s ~360B parameter architecture puts it well beyond typical hardware budgets, and compute-constrained access at Z.ai — with restricted sign-ups — means the model may not be available to new users regardless of their hardware. GLM-4.7 is best evaluated via API if access is granted. GLM-5 (744B, February 2026, 77.8% SWE-Bench Verified) is the successor and post-dates this evaluation window.
Is a 70+ SWE-Bench Verified score actually reliable for production coding tasks?
A 70+ score indicates genuine capability on well-scoped, bounded coding tasks from real GitHub repositories. It’s a solid filter for identifying models worth evaluating — but it doesn’t reliably predict performance on multi-day sessions, proprietary codebases without test coverage, or multi-file refactoring. Treat it as a capability floor, not a performance ceiling.
What is the difference between SWE-Bench Verified and SWE-Bench Pro?
Verified is a human-reviewed subset with validated difficulty ratings. SWE-Bench Pro (September 2025, Scale AI) is harder, with curated tasks and a closed test set that has attracted controversy over independent verification. All scores in this article use Verified. On Pro the ranking inverts: Qwen3-Coder-Next (44.3) leads DeepSeek-V3.2 (40.9) and GLM-4.7 (40.6).
What is RLVR and why does it matter for coding agents?
RLVR — Reinforcement Learning from Verifiable Rewards — trains models using automated correctness signals (does the code compile, does it pass the tests). Qwen3-Coder-Next was trained on approximately 800K verifiable coding tasks this way, producing a model capable of multi-step reasoning, tool sequencing, and failure recovery rather than single-turn code completion.
What licences do these models use for commercial deployment?
Qwen3-Coder-Next: Apache 2.0 (commercial use permitted, attribution required in distributed derivatives). GLM-4.7 and DeepSeek-V3.2: MIT. All three allow model weights to be inspected and deployed commercially without licence fees to the originating lab.
For broader context on where open-weight coding models sit in the current competitive landscape — and how to think about the build-vs-buy question across RAG, agentic, and coding workloads — see the complete open-source AI model enterprise strategy guide.