Insights Business| SaaS| Technology AWS Bedrock Open-Weight Models — Running Qwen, Kimi K2, and MiniMax Without Infrastructure Overhead
Business
|
SaaS
|
Technology
Apr 26, 2026

AWS Bedrock Open-Weight Models — Running Qwen, Kimi K2, and MiniMax Without Infrastructure Overhead

AUTHOR

James A. Wondrasek James A. Wondrasek
Graphic representation of AWS Bedrock open-weight models running Qwen, Kimi K2, and MiniMax without infrastructure overhead

Open-weight AI models have reached parity with proprietary APIs on most benchmarks. The problem has always been the infrastructure tax — GPU fleet management, capacity planning, failover engineering, and a dedicated person whose job is keeping it all running. AWS Bedrock has taken that off your plate for a growing catalogue of open-weight models. They added 18 fully managed open-weight models to Bedrock at re:Invent 2025, with six more landing in February 2026. It’s the deployment option that sits between proprietary APIs and self-hosted EC2 — and it’s part of the open-weight AI model landscape every engineering team needs to understand right now.

This article covers what Bedrock’s managed open-weight offering actually delivers: which models are available, which workloads each one suits, what it costs compared to running them yourself, and where the compliance tooling helps — and where it falls short. If you want the full build-vs-buy decision framework — including how Bedrock sits as one of three strategic options — start there before going deeper here.

What open-weight models are now available on AWS Bedrock?

At re:Invent 2025, AWS pulled in models from Google, MiniMax AI, Mistral AI, Moonshot AI, NVIDIA, OpenAI, and Qwen. The February 2026 additions — DeepSeek V3.2, MiniMax M2.1, GLM 4.7, GLM 4.7 Flash, Kimi K2.5, and Qwen3 Coder Next — brought the managed open-weight catalogue to roughly 24 models, sitting alongside nearly 100 serverless models overall.

All of them run through Project Mantle, AWS’s distributed inference engine built specifically for large-scale model serving. The bedrock-mantle endpoint is separate from the bedrock-runtime endpoint used for proprietary models like Claude and Nova. Project Mantle handles GPU provisioning, capacity scaling, and failover. Your operational surface is an API endpoint and a per-token billing meter. That’s it.

The bedrock-mantle endpoint exposes an OpenAI-compatible API, including support for the Responses API. If your application already uses OpenAI clients, redirecting to bedrock-mantle is mostly a base URL swap and a credential update. For models not yet in the standard catalogue, Amazon Bedrock Custom Model Import lets you upload model weights to S3 and AWS provisions the GPU infrastructure from there. It’s the escape hatch for edge cases.

Which model should you choose for which workload?

Model selection on Bedrock should be driven by what your pipeline actually does — benchmarks describe general capability, not your specific use case.

RAG and retrieval workloads: Start with Qwen3-Next-80B-A3B. Its instruction-following accuracy and tool-use tuning make it well-suited for retrieval-augmented generation pipelines where precision matters more than creative generation.

Multimodal workloads: Qwen3-VL-235B-A22B handles vision-language tasks — document parsing, image-grounded Q&A, multimodal agent chains. OCR support spans 32 languages, from hand-drawn sketches through to complex GUI screenshots.

Complex reasoning and agentic workflows: Kimi K2 Thinking from Moonshot AI is optimised for multi-step reasoning and agentic tool chains. Its Mixture-of-Experts architecture has 1 trillion total parameters with 32 billion active per token. MoE models activate only the relevant expert networks for each input, so inference cost is proportional to active parameters, not total parameters — that’s what makes it viable on serverless pricing. Community benchmarks put it at a Chatbot Arena ELO of 1438, with reported execution of 200 to 300 sequential tool calls without losing coherence.

Coding agents and multi-file editing: MiniMax M2 excels at coding automation — multi-file edits, terminal operations, long tool-calling chains. Qwen3-Coder-Next is the alternative if you prefer Alibaba’s Apache 2.0 licensing or want consistency across a Qwen-based stack — it achieves over 70% on SWE-Bench Verified. For a proper head-to-head, see the benchmark comparison of open-weight coding agents.

European enterprise trust requirements: Mistral Large 3 is the European-origin open-weight option on Bedrock — relevant for organisations with supply chain or regulatory requirements around model provenance.

Lightweight inference at scale: NVIDIA Nemotron Nano 2 (9B and 12B VL variants) and Gemma 3 (4B, 12B, 27B) are your small-model options for cost-sensitive, high-throughput tasks where a frontier model is complete overkill.

What does managed Bedrock deployment actually cost compared to self-hosting?

The cost case for Bedrock’s serverless inference is strongest at low-to-medium throughput volumes where fixed GPU infrastructure costs are hard to amortise.

Using gpt-oss as a pricing proxy, Bedrock serverless pricing runs approximately $0.11 per million input tokens and $0.47 per million output tokens for the 20B variant. Self-hosted GPU infrastructure on EC2 runs to roughly $4,264/month for a g6.12xlarge — that’s 4x L4 GPUs on on-demand pricing, before engineering time, on-call coverage, failover tooling, or capacity planning.

That engineering overhead is the cost pricing tables always leave out. Maintaining a production vLLM or SGLang serving stack requires dedicated ML infrastructure expertise. At a 50–500 person SMB, that typically means pulling a senior engineer away from product work.

Bedrock’s billing model simplifies budget management: cost scales with usage, no stranded reservation costs, no cold-start GPU spin-up. AWS doesn’t publish latency SLAs for open-weight serverless endpoints, so benchmark your specific workload before committing to production.

Provisioned Throughput — Bedrock’s reserved capacity tier — is the right choice when your workload is predictable enough to justify the commitment. Third-party inference APIs like Groq and Together AI offer competitive pricing, but lack Bedrock’s AWS-native IAM integration, PrivateLink support, and compliance tooling. The full provider comparison is in the build-vs-buy decision framework.

What does Amazon Bedrock AgentCore offer for agentic workloads?

Amazon Bedrock AgentCore is AWS’s managed runtime for deploying production agentic applications — not just individual model calls, but the full agent lifecycle: memory management, tool execution, session state, and observability.

For teams building on open-weight models, AgentCore eliminates the need to build your own orchestration infrastructure. The split is practical: Strands Agents gives you the code-level primitives (tool registration, agent loop, state management), while AgentCore gives you the production runtime (scaling, logging, session persistence).

Kiro — an AWS-backed spec-driven AI development tool — integrates MiniMax M2.1, Qwen3 Coder Next, and DeepSeek V3.2 from Bedrock, giving you a concrete picture of what an AgentCore-compatible agentic product looks like in practice.

It’s worth being direct about the lock-in consideration: AgentCore and Strands Agents reduce time-to-production, but they introduce AWS dependency at the orchestration layer, not just the inference layer. Whether that trade-off makes sense is a question for the build-vs-buy framework.

How do Bedrock Guardrails address compliance requirements for open-weight models?

Amazon Bedrock Guardrails is a configurable safety and policy layer that sits between your application and any Bedrock model endpoint — including open-weight models served via Project Mantle. It applies dual-layer moderation that screens both inputs and responses, and handles PII redaction, topic blocking, grounding checks for RAG outputs, and word-level content filtering.

For open-weight models, Guardrails provides compensating controls for safety gaps some models carry. DeepSeek V3.2 is the clearest example. Australia banned DeepSeek from all government devices in February 2025, citing data routing to Chinese servers and a 100% attack success rate in Cisco‘s safety testing. Running DeepSeek V3.2 via Bedrock in AWS Asia Pacific (Sydney) keeps all inference within AWS infrastructure — nothing touches DeepSeek’s systems. Adding Bedrock Guardrails addresses those documented vulnerabilities, making DeepSeek V3.2’s reasoning capability available within Australian infrastructure at roughly 27x lower cost than OpenAI o1, with enterprise controls applied.

What Guardrails does not cover: weight-level vulnerabilities, data residency guarantees on their own, or a substitute for a full enterprise AI governance programme. For the complete governance picture, see governance and compliance for open-weight AI.

Cross-Region Inference (CRIS) routes inference within a configured geographic boundary — within Australia using Sydney and Melbourne — preventing traffic from leaving sovereign territory. AWS PrivateLink support for bedrock-mantle keeps inference within your VPC entirely. AWS announced open-weight model support in Asia Pacific (Sydney) on 12 February 2026, which makes the enterprise AI strategy case for Australian organisations considerably more direct.

When does Bedrock make sense and when does self-hosting win?

Bedrock is the right call when: throughput is low-to-medium and unpredictable; the team lacks dedicated ML infrastructure expertise; AWS-native IAM, logging, and compliance tooling add genuine value; and time-to-production matters more than maximum model control.

Self-hosting — vLLM or SGLang on EC2 or on-premise GPU — wins when: throughput is high and predictable enough to amortise dedicated GPU costs; the organisation requires full model weight custody for air-gapped deployment; or fine-tuning and model modification are core requirements.

Third-party inference APIs sit in the middle — potentially lower latency for specific models, no AWS dependency, but weaker compliance tooling. The three-option framework covers this formally. Briefly: Groq and Together AI are worth benchmarking if latency is your primary constraint.

Licensing matters for regulated industries. Qwen3 models are Apache 2.0 — broad commercial rights, no royalties. Kimi K2 Thinking uses a Modified MIT licence — permissive for commercial use with attribution. Neither creates significant legal exposure for standard SaaS deployment, but regulated industries should still run a formal licence review.

The practical heuristic is simple: if your team cannot justify a dedicated ML infrastructure role, Bedrock’s managed open-weight offering is the correct default. Self-hosting a production inference stack is a full-time job — capacity planning, failover, upgrades, monitoring. Bedrock converts all of that into a billing line item and lets your team get back to building the product. For a complete overview of where Bedrock sits in the broader enterprise AI strategy for open-weight models, that resource covers the full landscape.

FAQ

Is Qwen on AWS Bedrock enterprise-ready?

Yes, for most enterprise workloads. Qwen3 models run under AWS’s SLA framework, with IAM-controlled access, PrivateLink support, and Guardrails compatibility. Apache 2.0 permits commercial deployment without royalty obligations. The key caveat: AWS does not publish latency SLAs for open-weight serverless endpoints — benchmark via Bedrock Evaluations before committing if sub-100ms first-token time matters.

What is the difference between AWS Bedrock serverless inference and provisioned throughput?

Serverless inference is pay-per-token with no instance reservation — AWS allocates capacity dynamically. Provisioned Throughput reserves dedicated model capacity for a committed period, guaranteeing consistent throughput at a fixed cost. Serverless suits variable, unpredictable workloads. Provisioned throughput suits high-volume production workloads where predictability matters.

Does Bedrock support OpenAI-compatible API endpoints for open-weight models?

Yes. The bedrock-mantle endpoint exposes an OpenAI API-compatible interface, including support for the Responses API. Existing OpenAI client applications can be redirected with minimal code changes — typically a base URL swap and API key update. This applies to open-weight models only. The bedrock-runtime endpoint for Claude, Nova, and similar models uses a different API schema.

What is Project Mantle and how does it relate to Amazon Bedrock?

Project Mantle is AWS’s distributed inference engine that powers open-weight model serving on Amazon Bedrock. It handles GPU provisioning, capacity scaling, failover, and the OpenAI API compatibility layer. You don’t interact with it directly — you call the bedrock-mantle endpoint and Mantle handles everything underneath.

Can I run DeepSeek models on AWS Bedrock despite the government ban?

Yes. DeepSeek V3.2 is available on Bedrock as an open-weight model. The Australian government ban applies to the DeepSeek API — calling DeepSeek’s own servers — not to running the published open weights. Running via Bedrock keeps inference within AWS infrastructure, with no data sent to DeepSeek’s systems. Adding Bedrock Guardrails addresses the safety vulnerabilities that contributed to the ban. Legal exposure in other jurisdictions should be reviewed independently.

Which AWS regions support open-weight models on Bedrock?

Open-weight models via Project Mantle are available across multiple AWS regions. Asia Pacific (Sydney) and Asia Pacific (Melbourne) gained open-weight model availability in February 2026, enabling Australian organisations to run open-weight inference within Australian borders. Specific model availability by region varies — check the Bedrock model catalogue for current region support.

How does Kimi K2’s Mixture-of-Experts architecture affect inference cost on Bedrock?

Kimi K2 Thinking has 1 trillion total parameters but only 32 billion active per inference pass. Inference cost is proportional to active parameters, not total parameters — so Kimi K2 delivers reasoning quality comparable to frontier models at a per-token cost closer to a 32B dense model than a 1T dense model. That’s what makes it cost-viable for agentic workloads on serverless pricing.

Can I import a custom open-weight model into Amazon Bedrock if it is not in the catalogue?

Yes, via Amazon Bedrock’s Custom Model Import feature. It supports OpenAI-compatible open-weight models not yet in the standard catalogue. Weights are uploaded to S3 via the Bedrock console and AWS then manages GPU provisioning and scaling. Once imported, the model behaves like any other catalogue model.

What is Amazon Bedrock AgentCore?

Amazon Bedrock AgentCore is AWS’s managed runtime for deploying production agentic applications. It provides memory management, tool execution, session state persistence, and observability for agent-based workloads — infrastructure you’d otherwise build and maintain yourself. It works with open-weight models served via Project Mantle and integrates with Strands Agents, AWS’s open-source agentic SDK.

How do I migrate from Groq or Together AI to AWS Bedrock for open-weight inference?

The bedrock-mantle endpoint’s OpenAI API compatibility means migration is primarily a configuration change: update the base URL, swap in AWS credentials, adjust region-specific parameters. The main reasons to migrate are AWS-native IAM integration, PrivateLink for VPC isolation, and Guardrails for compliance — not latency or throughput, where Groq currently leads.

What is the serverless cold-start behaviour for open-weight models on Bedrock?

AWS does not publish cold-start latency specifications for open-weight models on bedrock-mantle. For latency-sensitive workloads where first-token time matters, benchmark your specific model and workload using Bedrock Evaluations before committing to serverless inference. If cold starts are unacceptable, Provisioned Throughput provides reserved capacity with no cold-start exposure.

AUTHOR

James A. Wondrasek James A. Wondrasek

SHARE ARTICLE

Share
Copy Link

Related Articles

Need a reliable team to help achieve your software goals?

Drop us a line! We'd love to discuss your project.

Offices Dots
Offices

BUSINESS HOURS

Monday - Friday
9 AM - 9 PM (Sydney Time)
9 AM - 5 PM (Yogyakarta Time)

Monday - Friday
9 AM - 9 PM (Sydney Time)
9 AM - 5 PM (Yogyakarta Time)

Sydney

SYDNEY

55 Pyrmont Bridge Road
Pyrmont, NSW, 2009
Australia

55 Pyrmont Bridge Road, Pyrmont, NSW, 2009, Australia

+61 2-8123-0997

Yogyakarta

YOGYAKARTA

Unit A & B
Jl. Prof. Herman Yohanes No.1125, Terban, Gondokusuman, Yogyakarta,
Daerah Istimewa Yogyakarta 55223
Indonesia

Unit A & B Jl. Prof. Herman Yohanes No.1125, Yogyakarta, Daerah Istimewa Yogyakarta 55223, Indonesia

+62 274-4539660
Bandung

BANDUNG

JL. Banda No. 30
Bandung 40115
Indonesia

JL. Banda No. 30, Bandung 40115, Indonesia

+62 858-6514-9577

Subscribe to our newsletter