AI Monitoring After Deployment: What NIST 800-4 Reveals About Where Most Governance Programs Fall Short

April 22, 2026
Trustible Team

TL;DR: In March 2026, NIST published the first major government framework for post-deployment AI monitoring. It identifies six monitoring categories, two orientations, and a core problem most organizations haven’t solved: monitoring generates signals, but only if the program is watching broadly enough to catch the ones that matter. Download our AI Monitoring Guide for more info.

Post-deployment AI monitoring is an explicit requirement under the EU AI Act (Article 72), ISO 42001, the NIST AI Risk Management Framework, and OMB M-25-21. The regulatory consensus is clear: deploying an AI system isn’t a one-time event. It requires ongoing oversight.

In March 2026, NIST published AI 800-4, “Challenges to the Monitoring of Deployed AI Systems.” Drawing on three workshops with over 250 practitioners and a review of 87 published papers, it’s the first major government report to systematically catalog what post-deployment AI monitoring requires and where the field falls short. The report identifies six categories of monitoring and surfaces dozens of open challenges.

The core argument is that monitoring generates signals, signals become governance triggers, and triggers initiate governance activities. That pipeline is the operational backbone of ongoing AI oversight. It only works if governance teams are monitoring broadly enough to catch the signals that matter. Most organizations aren’t.

What “AI Monitoring” Means

AI monitoring means different things to different teams. To a machine learning engineer, it means tracking model drift and inference latency. To a CISO, it means watching for adversarial attacks and unauthorized access. To a lawyer, it means staying ahead of regulatory changes that could shift the legal exposure of a deployed system. To a business leader, it means knowing whether an AI investment is still delivering value.

All of these are valid, and all of them are necessary. The problem is that most organizations treat AI monitoring as a single technical discipline when it’s actually a cross-functional responsibility that spans engineering, security, legal, and business teams. Teams invest in the monitoring they know best and leave gaps in the areas they don’t.

NIST AI 800-4 proposes six categories of post-deployment monitoring: Functionality (is the system still working as intended?), Operational (is it maintaining consistent service?), Human Factors (is it transparent to users and producing quality interactions?), Security (is it secure against attacks and misuse?), Compliance (does it adhere to relevant regulations?), and Large-Scale Impacts (is it promoting human flourishing or causing harm at scale?). Most organizations have reasonable coverage on the first two. Categories three through six are where the gaps are. Human Factors is, per NIST’s own data, the most underserved: workshop practitioners raised human factors issues far more frequently than the published literature addresses. That gap between what teams experience on the ground and what the research community has actually solved is a structural blind spot.

Model-Level vs. Use Case-Level Monitoring

The AI monitoring conversation typically defaults to model-level concerns. Is this endpoint performing within spec? Has output distribution shifted? Those are important questions, but they aren’t the questions governance teams need answered.

Organizations govern use cases. A single use case might involve multiple model endpoints, a retrieval-augmented generation pipeline, a guardrail layer, a human escalation path, and a vendor SaaS interface. Model monitoring informs about each component in isolation. Use case monitoring informs about the whole system and whether it’s meeting its goals. A model can be performing perfectly on its own metrics while the use case it supports is failing its users.

The two levels connect through the monitoring pipeline. Drift detected on a model endpoint is a data point. Whether that drift is significant enough to change the risk profile of the use case, and what to do about it, is a governance question. That connection requires context established before monitoring begins, including documented goals, risk assessments, and defined thresholds. Organizations that skip this work end up with dashboards nobody can interpret and alerts nobody knows how to act on.

Internal vs. External Monitoring

Regardless of monitoring level, activities fall into two orientations based on where the signal originates.

Internal monitoring looks inward at the system itself. Data sources include system logs, inference logs, input/output data streams, and software performance metrics. It provides the strongest coverage for Functionality and Operational monitoring. For most organizations, it’s the more familiar discipline. The tooling is mature, the workflows are established, and the teams responsible have been doing analogous work in DevOps and cybersecurity for years.

External monitoring looks outward from the system. Data sources include AI incident reports, structured user feedback, academic research, regulatory updates, and vendor product changelogs. It’s strongest for Compliance, Large-Scale Impacts, and Human Factors, and it’s also where most organizations have the biggest gap. It’s less familiar, less tooled, and often nobody’s explicit responsibility. But it’s where many of the highest-impact governance triggers originate, including new regulations, vendor deprecations, and published incidents affecting models an organization depends on.

There’s also a practical advantage to external monitoring that internal approaches can’t match: it applies equally across all deployment patterns. Internal monitoring is constrained by how AI is deployed. A self-hosted model offers full visibility while a vendor SaaS tool may offer almost none. External monitoring works regardless of whether AI runs on an organization’s own infrastructure or behind a vendor’s API. For organizations managing a diverse AI portfolio, that consistency matters a lot.

The Agent Dimension

AI agents complicate both orientations in ways existing tooling hasn’t caught up with.

Internally, agent monitoring is harder because agents chain multiple model calls, invoke tools, and operate across distributed infrastructure. There’s no universal standard for logging agent actions or identifying agent-initiated requests. Protocols like MCP and Google’s A2A are still being standardized, and monitoring frameworks haven’t adapted yet.

Agents also blur the internal/external boundary in a meaningful way. When an agent books a meeting, submits a form, or delegates to another agent, it’s taking actions with real-world consequences beyond model inference. Monitoring the model call is internal, but monitoring whether the agent’s action was appropriate, authorized, and correct is a governance question. In multi-agent systems where Agent A delegates to Agent B, which then calls Agent C’s API, there may not be a clear right answer for what the optimal chain of actions should have been.

Shadow agents are the emerging blind spot. Browser-based agent tools, personal API keys connected to automation platforms, and autonomous workflows running outside IT’s visibility create monitoring gaps that purely internal approaches can’t address. Shadow agents are becoming the next generation of shadow AI.

Where to Go From Here

NIST AI 800-4 is a diagnostic framework, not an implementation roadmap. It identifies what should be monitored and where the field’s collective gaps are, but it doesn’t prescribe how to build the program.

That work starts with an honest audit of which of the six categories a current monitoring program actually covers, where the signal pipeline breaks down, and which external monitoring activities have no clear owner. Most organizations will find more gaps than expected, particularly in the categories most likely to generate the governance triggers that matter.

The organizations that get AI monitoring right won’t be the ones with the most advanced MLOps pipelines. They’ll be the ones that connect technical signals to governance decisions, track the external environment as carefully as they track their own systems, and build the organizational discipline to act on what they find.

The Trustible AI Monitoring white paper maps each NIST category to specific monitoring activities, governance triggers, and regulatory requirements across the EU AI Act, ISO 42001, NIST AI RMF, and OMB M-25-21. It covers four deployment patterns, 12 concrete monitoring user stories, and 7 recommendations for governance teams building or auditing an AI monitoring program.

[Download the white paper]

Platform

Features

By Framework

By Industry

AI Monitoring After Deployment: What NIST 800-4 Reveals About Where Most Governance Programs Fall Short

What “AI Monitoring” Means

Model-Level vs. Use Case-Level Monitoring

Internal vs. External Monitoring

The Agent Dimension

Where to Go From Here

Related Posts

Introducing AI Controls: Normalized Compliance Across Every AI Governance Framework

Best AI Governance Software for Enterprise Companies

AI chatbot legislation is moving fast. Thirteen states have acted.

Contact Us