ChatGPT Health is handling around 40 million health queries a day. Claude for Healthcare shipped to production in January 2026. Mayo Clinic and Microsoft just announced they are training a frontier model on 54 million de-identified patient records. Clinical AI is not a pilot program anymore. The governance infrastructure meant to catch degradation, assign accountability, and validate these systems before they reach patients remains voluntary, institution-dependent, and largely unbuilt — a structural gap the broader clinical AI governance architecture must bridge.
Financial services went through this already. In 2011 the Federal Reserve and the Office of the Comptroller of the Currency issued SR 11-7, a supervisory guidance that required US banks to maintain a complete model inventory, run independent validation, and continuously monitor every model in production. It assigned named roles: model owner, model validator, model risk committee. It was not optional. It was not aspirational. It was the architecture that emerged after the 2008 crisis made clear what happens when models operate without governance.
Healthcare is assembling its own frameworks in real time, through voluntary efforts like HAIRA and CHAI, while financial services has 15 years of mandatory infrastructure sitting one industry away. Healthcare can focus on what already works: practices that transfer without waiting for a healthcare-specific regulatory mandate.
How Does AI Governance in Healthcare Compare to Model Risk Management in Financial Services?
SR 11-7 rests on three pillars: a mandatory model inventory, independent validation, and ongoing monitoring. Every model a bank deploys gets validated before production and re-examined on a regular cycle. Goldman Sachs ran Claude through exactly this process when it embedded Anthropic engineers over six months to co-develop autonomous agents for accounting and compliance. The governance was not an afterthought. It was the precondition for deployment.
Your health system operates without a comparable regulatory mandate. The Coalition for Health AI and The Joint Commission published joint guidance in September 2025, the first formal framework from a US healthcare accreditation body. The HAIRA maturity model, published in Nature in 2026, spans five levels across seven governance domains. But both are voluntary. You decide your own depth, and the result is wide variation in safety posture: 63% of healthcare organisations have no AI-governance policies in place, and shadow AI is present in 40% of hospitals.
The difference comes down to regulatory architecture, and that architecture transfers. Financial services governance is mandatory and model-level. Clinical AI governance is voluntary and institution-level, creating accountability gaps that emerging liability frameworks for clinical AI are only beginning to address. That produces different safety outcomes.
You can adopt several practices without waiting for a healthcare-specific regulatory mandate. Champion-challenger testing is the most directly transferable (more on that below). The Horizon Scan 001 report on AI governance in regulated industries provides the cross-industry analysis.
On 17 April 2026 the Federal Reserve issued SR 26-2, superseding SR 11-7 and modernising model risk management. It shifts toward materiality-based validation and explicitly places generative and agentic AI outside its scope, signalling future regulatory treatment elsewhere — a gap the FDA’s static-device regulatory model is not yet equipped to fill. The architecture evolves. You can borrow the architecture without waiting for the regulatory trigger.
Are Agentic AI Systems Worth the Cost for Clinical Tasks?
Agentic AI means multi-agent architectures where LLMs are augmented with tool use: database queries, calculation, guideline retrieval, structured workflows. The pitch is that these systems address the “last mile” problem of clinical AI. Here is what that problem looks like in practice: an LLM might output “patient appears to have early sepsis,” but your EHR needs a structured alert with a risk score, relevant vitals, and a suggested protocol. Raw LLM output is unstructured text. Clinical decision support needs structured, coded data. That gap is the last mile.
The evidence says agentic AI closes some of that gap, but the improvement is real and small. A Nature benchmarking study found that OpenManus, using a Llama-4 backbone, reached 60.3% on the AgentClinic MedQA benchmark with more than 10 times the token usage of the baseline model. Across agentic systems, absolute accuracy gains ranged from 7% to 8.9%. On multimodal tasks, accuracy remained low at 15.5%. Response times more than doubled.
Token consumption tells the cost story in numbers anyone can understand: OpenManus used 92,954 tokens per scenario, the MedAssist variant used 168,957 tokens, while the Llama-4 baseline used 14,576. More than 10 times the compute for a single-digit percentage improvement.
There is an alternative. The four-step clinical NLP architecture, named entity recognition through to relation extraction, normalisation, and inference, is a specialised pipeline that for specific tasks outperforms general-purpose LLMs at lower computational cost. Healthcare NLP achieved 96% F1-score for PHI detection, compared to GPT-4o’s 79%. GPT-4o completely missed 14.6% of entities. The specialised system missed 0.9%.
Agentic systems also introduce failure modes that single-model evaluations do not capture: tool-use errors, inter-agent coordination failures, emergent behaviours that arise from component interaction rather than any individual model’s output, compounding the benchmark-to-bedside accuracy gap already documented in clinical AI deployment. Gray Swan AI’s benchmark of 22 frontier AI agents observed nearly two million prompt injection attempts, with over 60,000 successful.
The governance question is what level of evidence you should require before deploying a more complex, more expensive system for a marginal gain. Open-source frameworks like OpenManus and Manus lower the barrier to clinical experimentation, making the question pressing because deployment can happen outside institutional controls. Your governance framework should require evidence of superiority before approving architectural complexity — the same standard the accuracy problem driving governance reform demands of clinical AI evaluation as a whole.
The evidence-standard question gets sharper when you consider the scale of what is already under construction. Which brings us to Mayo Clinic and Microsoft.
What Does the Mayo Clinic–Microsoft Partnership Mean for Clinical AI Governance?
On 2 June 2026 Mayo Clinic and Microsoft announced a collaboration to develop and deploy a frontier AI model for healthcare. The model will be trained on Mayo’s de-identified clinical health data and owned by the clinic. Microsoft described it as “building something healthcare has never seen before.” The scale alone makes it the largest known clinical AI data governance experiment.
The partnership tests every governance dimension simultaneously. First, data governance at scale: 54 million de-identified patient records. De-identification is a risk mitigation, not a guarantee. Re-identification risk increases with dataset size and linkage potential. Researchers have re-identified individuals from supposedly anonymised health datasets by cross-referencing with public records. What consent model applies to retrospectively used records? Can patients opt out? How do the Common Rule, HIPAA, and state privacy laws interact when AI model training is the use case?
Second, the regulatory question. The FDA released updated CDS guidance in January 2026 that relaxed key medical device requirements. Many generative AI tools that would have required FDA sign-off can now reach clinics without FDA vetting. What we know: the FDA’s CDS guidance carves out space for clinical decision support software that is not device-level. What is unresolved: whether a frontier model trained on 54 million population-level records falls inside or outside that carve-out. Does the current regulatory framework accommodate frontier-scale models trained on population-level clinical data?
The production context sharpens the question. ChatGPT Health, which tailors responses to users’ uploaded medical records, and Claude for Healthcare are already deployed. The governance frameworks described across this cluster, institutional committees, voluntary frameworks, episodic validation, are being asked to regulate systems that are not prospective. They are live.
The partnership is a stress test. It asks whether voluntary institutional governance can scale to 54 million records, and whether the architecture healthcare has built so far can catch what a frontier model might get wrong.
How Does the Three-Lines-of-Defence Model Apply to Healthcare AI Accountability?
The three-lines-of-defence model separates frontline business units that own AI risk from independent oversight functions and from internal audit. In financial services, the first line is the business units that operate models day to day. The second line is independent risk oversight: validation, monitoring, and challenge. The third line is internal audit, providing assurance that lines one and two are functioning. Every line has a named owner. McKinsey’s 2026 AI Trust Maturity Survey found that only about a third of organisations reach governance maturity level three or higher.
Healthcare has a diffuse ownership problem. When an AI-influenced clinical decision causes harm, who is accountable? The clinician who used it? The IT department that deployed it? The vendor who built it? Without named owners, each team points elsewhere. The Horizon Scan 001 report describes it bluntly: product owns the model, engineering owns the infrastructure, compliance owns the policy, and when a decision is challenged the accountability dissolves.
Mapping the three lines to your health system is straightforward. Clinical departments are the first line: they use the AI, they own the risk. A centralised AI governance function is the second line: independent validation, champion-challenger testing, drift monitoring. Your board risk committee is the third line: assurance that the governance system works. For smaller health systems that cannot staff a dedicated second line, the HAIRA maturity model provides a tiered readiness assessment. The model scales down as well as up. Shared services and regional collaboratives are a practical starting point.
What Is Champion-Challenger Testing and How Would It Work in a Hospital Setting?
Champion-challenger testing is a core SR 11-7 technique. You run a candidate model (the challenger) in parallel with your production model (the champion) on live data, compare performance, and switch only when the challenger demonstrably outperforms. It is the most directly transferable governance practice clinical AI has not adopted, and it requires only monitoring infrastructure, not regulatory change.
Healthcare already has the building blocks. Shadow deployments, sometimes called silent evaluation, are used across the clinical AI literature as a pre-deployment risk evaluation method. The infrastructure and the mindset already exist. What is missing is systematisation: turning an ad hoc evaluation step into a governance decision point with defined comparison metrics and a formal sign-off.
Here is what it looks like in practice. Say your hospital runs a sepsis prediction model. You develop a new version and deploy it in silent mode alongside the incumbent for a defined evaluation period, three months, for example. You compare alert accuracy, false-positive rates, and clinician override patterns. Only if the challenger outperforms across your predefined metrics do you switch. No regulatory approval is required. Just operational discipline.
The limitation is that champion-challenger testing assumes model stability during the comparison period. Agentic systems that recalibrate autonomously challenge that assumption, which is why shorter evaluation windows or continuous comparison may be necessary. The SR 26-2 modernisation moves toward materiality-based validation rather than default annual cycles, a direction worth following.
What Role Does Continuous Monitoring Play in Clinical AI Safety?
Continuous monitoring catches degradation that point-in-time validation misses. The Erasmus MC study by van der Vorst and colleagues, published in BMJ Digital Health, proved this: an AI surgical-discharge tool was monitored across 6,800 admissions over 33 months. The AUC remained stable at 0.82. The performance metric looked fine. But input variables had silently drifted. New data-pipeline errors had appeared. Average respiratory rate had increased at one hospital. The model was generating predictions for patient profiles it had never seen during training.
Three types of drift can degrade your models. Data drift: the distribution of inputs shifts. Accuracy drift: the outputs degrade. Concept drift: the relationship between inputs and outputs changes. Each requires different monitoring, and the Erasmus MC study showed that tracking performance metrics alone is not enough. You need statistical drift dashboards that flag distributional shifts long before accuracy drops.
The FDA’s Total Product Lifecycle framework already requires post-market surveillance for AI-enabled medical devices. But most health system IT functions are not structured to deliver it. The gap is operational, not regulatory. You can implement combined univariate and multivariate drift dashboards with open-source tooling. Predefined thresholds keep the response proportionate: minor drift triggers increased monitoring, moderate drift schedules retraining, severe drift pauses clinical use pending investigation.
Continuous monitoring is the prerequisite for everything else. Champion-challenger testing, tiered validation cadences, and agentic AI safety controls all depend on monitoring infrastructure. Financial services built this infrastructure because regulation required it. You can build it because the evidence says you need it.
Conclusion
The architecture clinical AI governance needs already exists — and the regulatory architecture and accountability systems to institutionalise it are the subject of ongoing governance reform. Financial services built it over 15 years, stress-tested it through the 2008 crisis, and operationalised it through specific practices you can adopt today.
Three lines of defence solves the diffuse ownership problem by naming accountable owners at every level. Champion-challenger testing gives you a decision framework for model switching that requires only operational discipline. Continuous monitoring is your safety net, the only mechanism that catches degradation before patient harm occurs. Together they form a governance stack any health system can begin implementing.
Acknowledge what does not transfer. SR 11-7’s regulatory mandate does not extend to healthcare, and a voluntary adaptation of financial services practices is not the same as mandatory governance. But it is better than the current patchwork. And the highest-priority investment is clear: continuous monitoring infrastructure, because it enables everything else, and the Erasmus MC evidence proves that without it, degradation is invisible.
Financial services built mandatory governance infrastructure after the 2008 crisis demonstrated what happens without it. Healthcare can adopt that infrastructure before its own catastrophe forces the same outcome. ChatGPT Health, Claude for Healthcare, and the Mayo-Microsoft partnership are not waiting for governance to catch up. The question is whether you will adopt what already works, or learn the same lessons the hard way.
Frequently Asked Questions
What is SR 11-7 and why does it matter for healthcare?
SR 11-7 is the Federal Reserve’s 2011 supervisory guidance that requires US banks to maintain a comprehensive model risk management framework for every model in production. It mandates a complete model inventory, independent validation, ongoing monitoring, and defined accountability roles including model owner, validator, and risk committee. It matters for healthcare because it demonstrates that mandatory, model-level governance is operationally feasible, and its architecture transfers without waiting for a healthcare-specific regulatory mandate.
Why should healthcare learn from financial services when banks caused the 2008 crisis?
The 2008 crisis is precisely why SR 11-7 exists. Financial services governance was not built on superior foresight; mandatory model risk management emerged from catastrophic failure. The lesson for healthcare is not that banking got it right but that governance infrastructure was built after disaster, at enormous cost. Healthcare has the opportunity to build governance before, rather than after, a patient-harm crisis forces regulatory intervention.
How much does champion-challenger testing cost to implement?
Champion-challenger testing is among the lowest-cost governance practices available. It requires monitoring dashboards, defined comparison metrics, and a governance decision point, not new regulatory approval or expensive software. Most hospitals already run shadow deployments of clinical AI tools before full adoption. The incremental cost is primarily operational: staff time to define metrics, review comparison reports, and make switching decisions. No capital investment is required.
What happens when continuous monitoring detects model drift?
When drift is detected, the response depends on severity. Minor data drift may trigger increased monitoring frequency. Significant accuracy drift should pause the model’s clinical use pending investigation and recalibration. The Erasmus MC study proved that input variables can shift while aggregate performance appears stable, so drift detection must trigger a structured governance response, not just an alert. Without a pre-defined escalation pathway, monitoring adds noise without improving safety.
Is de-identified patient data truly anonymous?
No. De-identification reduces but does not eliminate re-identification risk, particularly at scale. The Mayo Clinic partnership involves 54 million records, and re-identification risk increases with dataset size and linkage potential. Researchers have re-identified individuals from supposedly anonymised health datasets by cross-referencing with public records. De-identification is a risk mitigation, not a guarantee, and governance frameworks must account for residual re-identification risk.
Do smaller hospitals need the same AI governance as large academic centres?
The governance principles are the same, but implementation can be tiered. Smaller health systems may not staff a dedicated second-line validation function, but they can share services through regional collaboratives, adopt phased timelines, and prioritise the highest-value practices first, starting with champion-challenger testing and continuous monitoring. The question is not whether to govern clinical AI but how to govern it proportionately to the organisation’s AI deployment footprint.
Can healthcare organisations wait for regulation before implementing AI governance?
No. Production clinical AI systems, including ChatGPT Health and Claude for Healthcare, are already deployed while regulatory frameworks remain under construction. Waiting for regulation means operating AI without safety controls in the interim. The Mayo Clinic partnership, operating at unprecedented data scale, demonstrates that clinical AI deployment is accelerating faster than regulatory development. Voluntary governance implemented now is better than mandatory governance imposed after patient harm.
What is the difference between model validation and model monitoring?
Model validation is a point-in-time assessment confirming a model meets its intended performance specifications before deployment. Model monitoring is the continuous observation of a model’s behaviour in production to detect degradation, drift, and unexpected outputs as they occur. Validation confirms a model is safe before use; monitoring confirms it remains safe during use. Both are necessary because validation alone cannot predict how a model will perform across changing clinical populations over months or years.
How do AI agents fail differently from single-model AI systems?
AI agents introduce failure modes that single-model systems do not exhibit. These include tool-use errors, where an agent queries the wrong database or misinterprets retrieved information; inter-agent coordination failures, where multiple agents produce conflicting recommendations; and emergent behaviours, where the combined system behaves in ways no single component was tested for. These failures are harder to detect because they arise from component interaction, not from any individual model’s output.
What can clinicians do today to improve AI governance in their hospital?
Clinicians can start by asking three questions about any AI system used in their practice: who is accountable if it produces a harmful recommendation, how is its ongoing performance monitored, and what evidence supports its superiority over existing decision support tools. These questions require no regulatory change or budget to ask. They surface governance gaps that clinical teams may not have identified and create institutional pressure for accountability structures.
How does the EU AI Act affect clinical AI deployed outside Europe?
The EU AI Act classifies many clinical AI systems as high-risk, imposing mandatory requirements for risk management, data governance, transparency, and human oversight. Non-European health systems using AI tools from EU-based vendors, or operating European clinical trial sites, may need to comply. More broadly, the Act is shaping global regulatory development, with its risk-classification framework being studied by regulators in multiple jurisdictions as a model for clinical AI governance.