You’ve got your MCP server running. Your AI agents are calling tools, accessing resources, maybe even kicking off a few workflows. And then you hit the wall—timeouts on operations that run for hours, context windows that blow past 100K tokens, multi-agent systems that cost 15x what you expected. Welcome to the gap between prototype and production.
Here’s what happens: basic MCP implementations fall apart when you need hours-long execution, complex agent coordination, or workflows that need a human to make a call at key decision points. Standard tool calling patterns weren’t built for this.
That’s what advanced MCP patterns solve. Tasks for asynchronous operations. Sampling for server-side intelligence. Context engineering for token efficiency. Multi-agent architectures for specialisation. These patterns separate systems that work in demos from systems that actually work at scale.
This guide builds on our comprehensive Model Context Protocol resource, which covers the core MCP primitives. It assumes you’ve built basic MCP servers and understand implementing tools and resources. We’re focusing on production implementation patterns, including how to optimise performance and keep things reliable.
What Are MCP Tasks and How Do They Differ from Regular MCP Tools?
Regular MCP tools are synchronous—you make a request, you wait for the response, you continue. If the operation takes more than your timeout window, it fails.
MCP Tasks break this pattern. They enable asynchronous, long-running operations that blow past typical request-response timeouts. You initiate the task and get back a task ID immediately. The actual work happens in the background. The server sends you progress notifications via JSON-RPC 2.0. When you’re ready, you poll for results using the task ID.
Unlike regular tools which must complete within seconds, Tasks support multiple states—working, input_required, completed, failed, and cancelled. They provide progress tracking, cancellation support, and result polling through notifications.
They’re essential when you’re processing large datasets, running extended research operations, or orchestrating multi-step workflows that need hours of execution. Healthcare data analysis, enterprise automation, code migration, and test execution all benefit from this pattern.
You can also integrate MCP Tasks with workflow orchestrators like Temporal. This gives you durable execution, workflow versioning, replay, and scheduling.
How Do You Implement Long-Running MCP Tools That Won’t Timeout?
Convert your operations to MCP Tasks. Return a task ID immediately, execute work asynchronously, and send progress notifications via JSON-RPC. Clients poll for results using the task ID. Implement cancellation handlers for graceful interruption.
Start by defining your task schema with a unique ID, status tracking, and result structure. When a client calls the tool, return the task ID immediately while queueing the work for background execution.
Set up a notification channel for progress updates. As work progresses, emit notifications with percentages or status messages. Create a result polling endpoint. The protocol supports active polling so clients can check status anytime.
Add a cancellation handler for cleanup—save partial progress and update task status when clients request cancellation.
For state management, persist task status across server restarts using a database or file system. Checkpoint your progress regularly. Design for idempotency so partial completions can be safely retried if something goes wrong.
Temporal Workflows provide an integration path that’s worth looking at. Trigger Tool initiates workflows, Signal Tool sends commands to running workflows, and Query Tool retrieves status. This enables human-in-the-loop patterns where tools pause waiting for human decisions then resume automatically.
What Is MCP Sampling and When Should You Use It?
Traditional pattern: Server returns data, client LLM reasons, client calls next tool. This creates substantial context handoff overhead with high token costs.
MCP sampling flips this. It allows servers to request the connected AI client to generate completions on behalf of the server. The server requests an LLM completion, gets an intelligent response, and continues execution. Lower token costs. It enables server-side agentic loops without you having to manage separate LLM connections.
Sampling with tools now supports tool calling in sampling requests, server-side agent loops, parallel tool calls, and better context control. Servers can include tool calling in sampling requests, enabling multi-step reasoning within a single tool invocation.
Here’s how it works: Clients must advertise sampling support during handshake. Then servers can make sampling requests with system prompts, user messages, tool definitions, and model preferences.
Use it when your servers need intelligent decision-making, data transformation requiring inference, or recursive problem-solving. Use cases include server-side data interpretation, intelligent filtering and routing, recursive research operations, and code generation within tools.
Trade-offs: Complexity increases—debugging server-side reasoning is harder. But token efficiency improves. Security considerations for sampling with tools matter—you need token allowance limits to prevent runaway costs and careful handling of sensitive prompt exposure.
How Does Context Engineering Apply to MCP Implementations?
Context is a finite resource. LLMs have an attention budget. Every new token depletes this budget. As tokens increase, the model’s ability to recall information decreases. This is context rot.
Transformers enable every token to attend across the entire context, resulting in n² pairwise relationships. Tool definitions add 200-2000 tokens each. When you’re building multi-tool workflows that exceed 50K-100K token contexts, you need strategies for managing your token budget across tool definitions, system prompts, conversation history, and results to prevent context window exhaustion.
Context engineering covers all of that. Techniques include tool description optimisation, context compaction, sub-agent architectures, and code execution patterns where agents write code to call tools instead of making direct invocations.
Optimisation strategies start with minimising tool descriptions. Remove verbose examples. Use concise parameter descriptions. Avoid redundant documentation. Every word counts against your budget.
The code execution pattern offers substantial savings. Instead of direct tool invocations, the agent writes Python or JavaScript that calls tools. This saves context on repeated calls.
Context compaction distils the context window while keeping high fidelity. Claude Code implements this by summarising and compressing message history, preserving architectural decisions and implementation details while discarding redundant tool outputs.
Sub-agent architectures provide another approach. Specialised sub-agents handle focused tasks with clean context windows while the main agent coordinates. Each subagent explores extensively using tens of thousands of tokens but returns only a condensed summary of 1,000-2,000 tokens.
Waiting for larger context windows won’t solve these problems. Context windows of all sizes suffer from context pollution. Treating context as a finite resource remains important even as models improve.
How Do You Coordinate Multiple AI Agents Using MCP?
Multi-agent coordination via MCP uses a coordinator agent managing specialised worker agents. Each worker connects to dedicated MCP servers exposing domain-specific tools. Context handoff involves extracting relevant information from one agent’s context and injecting it into another’s, minimising token overhead while preserving state. This enables parallel execution and concentrates domain expertise where you need it.
Architecture patterns include coordinator-worker (hub-and-spoke), pipeline (sequential chaining), and swarm (peer coordination). The coordinator handles task decomposition, agent selection, context handoff management, and result aggregation. Worker agents have narrow domain expertise, specialised tool sets, and focused system prompts.
Context handoff techniques matter for performance. Structured data extraction pulls specific fields from one agent’s results. Semantic summarisation compresses worker outputs. Reference passing shares IDs instead of full content.
Communication patterns vary—synchronous handoffs when one agent must complete first, asynchronous message passing for concurrent work, or shared memory via MCP resources for coordination through a common data layer.
For parallel execution, distribute independent tasks to multiple workers at the same time. Synchronise results when all complete. Handle errors—if one worker fails, decide whether to retry, skip, or abort everything.
But here’s the thing about token costs: they increase significantly. Multi-agent systems can consume 15x more tokens than single-agent approaches. You need a cost-benefit analysis. Does specialisation justify the overhead? Sometimes yes—multi-agent research systems showed substantial improvement on complex research tasks.
When to avoid: Simple workflows achievable by a single agent, tight token budgets, latency-sensitive operations, or cases where debugging complexity outweighs benefits.
What Is URL Mode Elicitation and When Should You Use It?
Agents can’t complete interactive web flows requiring human input—OAuth, two-factor authentication, legal agreements, payment authorisation. URL mode solves this by letting servers send users to proper flows in their browser where they can authenticate securely without the client seeing credentials.
URL mode elicitation allows MCP servers to request users complete web-based interactions (OAuth flows, form submissions) by redirecting to a URL, then resuming the workflow after completion. The server provides a URL, the client presents it to the user, the user completes the interaction, and the server detects completion and continues.
It’s essential for authentication, payment, and approval workflows.
The mechanism: Server pauses workflow. Returns a URL to client. Client presents URL to user. Server monitors for completion signal (OAuth callback, webhook, polling endpoint). When detected, server resumes with results.
Implementation starts with generating a unique session URL tied to the paused workflow state. Store workflow state persistently. Present the URL to the user. Poll for completion or set up a callback. When the user completes the interaction, validate the state and resume.
Use cases extend beyond OAuth—payment processing, terms acceptance, CAPTCHA solving, document signing.
Security considerations include session token management (URLs must be single-use or time-limited), state validation, and timeout handling.
How Do You Implement Human-in-the-Loop Patterns with MCP?
Human-in-the-loop patterns use MCP Tasks for long-running approval workflows combined with elicitation for capturing human input. This lets agents pause execution waiting for user decisions. Implement by creating approval tasks that emit notifications requesting decisions, then polling for user responses before continuing. It balances automation with governance.
The approval workflow pattern: Agent reaches a decision point. Pauses execution. Requests human input through a notification. Waits for response. Continues based on the decision.
Implementation architecture needs several components—a task-based approval queue, a notification system (email, Slack, in-app), a response collection mechanism, and a workflow resumption system.
Types of approvals vary by timing. Pre-execution approvals happen before action—reviewing a plan. Mid-execution approvals pause at decision points—choosing between options. Post-execution approvals validate results before committing.
Timeout handling requires default decisions for expired approvals. Set escalation paths if approvers don’t respond. Implement cleanup for expired tasks.
Audit trail is necessary for compliance. Log every decision with timestamp, approver identity, and justification. Enable compliance reporting from approval logs.
Use cases include high-stakes decisions where automation proposes but humans approve, regulatory compliance, budget approvals, and sensitive data access.
What Are the Key Performance Optimisation Strategies for MCP Workflows?
Performance improves when you identify independent tool calls and execute them concurrently. Aggregate results when all complete. The gain is significant when you’re waiting on multiple external APIs or slow operations.
Optimise MCP performance through parallel tool execution, context minimisation, caching frequently accessed resources, efficient transport selection (stdio vs HTTP), and monitoring with the golden signals framework (latency, traffic, errors, saturation). Profile token consumption to identify optimisation opportunities. Implement retry logic with exponential backoff for transient failures.
Caching strategies operate at multiple levels—resource caching at the server level, result memoisation for expensive computations, and client-side caching with invalidation policies.
Transport selection affects performance and reliability. stdio offers lower latency for local servers but lacks network resilience. HTTP with Server-Sent Events enables remote servers, load balancing, and better observability. Choose based on deployment topology—stdio for same-machine, HTTP for remote or distributed systems.
Token optimisation ties back to context engineering. Minimise tool descriptions. Use the code execution pattern. Implement context compaction. These reduce per-request token consumption, directly impacting latency and cost.
Monitoring implementation follows a three-layer observability architecture. The transport and protocol layer monitors JSON-RPC 2.0 protocol health. The tool execution layer applies golden signals—calls per tool, error rates, execution latency, token usage per invocation. The agentic performance layer measures user goal achievement—task success rate of 85-95%, turns-to-completion ideal of 2-5, tool hallucination rate of 2-8%, self-correction rates target of 70-80%. For comprehensive guidance on performance monitoring for MCP servers and scaling MCP infrastructure, see our production deployment guide.
Error handling patterns include retry with exponential backoff for transient failures, circuit breakers to prevent cascade failures, and fallback strategies for degraded functionality.
FAQ
What’s the difference between MCP Tasks and regular async operations?
MCP Tasks are protocol-level primitives with standardised progress notification, cancellation, and result polling APIs. Regular async operations are implementation-specific. Tasks provide a consistent client-side experience across different MCP servers, no matter who built them.
Can sampling work with any AI model or just Claude?
Sampling requires the MCP client to support the sampling capability and have access to an LLM. While Anthropic designed it, the protocol is model-agnostic. Clients using OpenAI, local models, or other LLMs can implement sampling support.
How do I prevent token cost explosion in multi-agent systems?
Implement context engineering (tool description optimisation, code execution pattern), use sub-agent handoffs to limit context growth, cache common results, and monitor token consumption per agent. Consider whether multi-agent complexity justifies the 15x token increase versus just using a single agent. Sometimes the answer is no.
What happens when an MCP server crashes during a long-running task?
Implement state persistence for tasks (database, file system) so tasks can resume after server restart. Include the task ID in all state, checkpoint your progress regularly, and design for idempotency so partial completions can be safely retried.
How do I debug multi-agent workflows where agents call each other?
Use MCP Inspector for protocol-level debugging, implement comprehensive logging with correlation IDs across agents, emit structured events for distributed tracing, and consider LLM-as-a-Judge evaluation for agentic outcome quality.
Should I use stdio or HTTP transport for production MCP servers?
Choose stdio for local deployments (lower latency) or HTTP with Server-Sent Events for remote servers (better resilience and observability). Base your decision on deployment topology and reliability requirements. If you’re running everything on the same machine, stdio is fine. If you need to distribute your servers or want better monitoring, go HTTP.
Can MCP work with existing workflow orchestrators like Temporal?
Yes. Task creation maps to Temporal activity start, progress notifications update state, and results complete activities. This enables workflow versioning, replay, and scheduling. It’s a solid integration path if you’re already using Temporal.
How do I secure sensitive data in multi-agent context handoffs?
Implement reference passing instead of value passing for sensitive data (pass IDs not content), use MCP resources for access-controlled data retrieval, apply encryption in transit, and audit context contents before handoff to prevent leakage. Don’t pass credit card numbers or passwords directly through context—pass references and let the receiving agent retrieve them through proper access controls.
What’s the difference between MCP and Agent2Agent protocol?
MCP focuses on connecting agents to tools and resources with rich primitives (Tasks, sampling, resources). A2A focuses on agent-to-agent communication with negotiation and coordination. They’re complementary—use MCP for tool integration, A2A for peer agent communication.
How do I measure the success of agentic workflows?
Implement task completion tracking, goal achievement metrics (LLM-as-a-Judge evaluation), user satisfaction scores, token cost per workflow, latency metrics, and error rates. Combine quantitative metrics with qualitative review of outputs. Numbers tell you part of the story, but you need to actually look at what the agents produced.
When should I avoid using multi-agent architectures?
Avoid multi-agent when workflows are simple enough for a single agent, token budgets are constrained, latency requirements are tight, or debugging complexity outweighs benefits. Multi-agent adds 15x token overhead and coordination complexity—use it only when specialisation genuinely improves outcomes. Don’t build a multi-agent system just because it sounds impressive.
How do I handle schema version compatibility as my MCP servers evolve?
Implement semantic versioning for tool schemas, use capability negotiation to detect client support, maintain backwards compatibility for N-1 versions, provide deprecation warnings, and use feature flags for gradual rollouts. This way you can evolve your servers without breaking existing clients.