What FDA Wants to See in Your AI Audit Trail

If you're building AI that will go through FDA review, your audit trail architecture isn't optional. It's foundational. I've seen companies rebuild their entire logging infrastructure months before submission because they didn't get this right from the start.

Here's what FDA actually wants to see, and how to build for it.

Why Audit Trails Matter for FDA AI

FDA's interest in AI/ML audit trails comes down to one question: Can you demonstrate that your AI behaved as intended, and can you prove it?

This means capturing:

What the model was trained on
What version produced what output
What inputs led to what decisions
How performance changed over time
What happened when something went wrong

The FDA's guidance on AI/ML-based Software as a Medical Device (SaMD) is clear: you need "transparency and real-world performance monitoring." Audit trails are how you demonstrate both.

The Four Components of an FDA-Ready Audit Trail

1. Training Data Provenance

You need to prove what data your model was trained on:

Data source documentation: where did the training data come from?
Data version control: what specific dataset was used for each model version?
Data quality records: what cleaning, filtering, or preprocessing was applied?
Ground truth methodology: how was the training data labeled, and by whom?

FDA wants to know that your training data is representative and that you can reproduce your training process. If you can't say "Model v2.3 was trained on Dataset v1.7 with these specific characteristics," you have a problem.

2. Model Version Control

Every deployed model needs to be versioned and traceable:

Model versioning: unique identifier for each deployed model
Deployment history: when was each version deployed and decommissioned?
Configuration management: what hyperparameters and settings were used?
Validation records: what testing was done before deployment?

When FDA asks "what model was running on March 15th?", you need to answer precisely.

3. Inference Logging

Every prediction your model makes in production should be logged:

Input capture: what data was fed to the model?
Output capture: what did the model predict?
Timestamp: when did this prediction happen?
Model version: which model version made this prediction?
User context: who requested the prediction (for access control audit)?

This is where many teams underinvest. Logging model outputs is easy. Logging inputs in a way that preserves context and enables reproduction is harder, but it's what FDA needs.

4. Performance Monitoring Records

You need to demonstrate ongoing model performance:

Performance metrics over time: accuracy, sensitivity, specificity, etc.
Distribution monitoring: is input data drifting from training data?
Outcome tracking: when you can get ground truth, how did predictions compare?
Alert records: when did performance metrics cross thresholds?

The FDA's Predetermined Change Control Plan (PCCP) framework assumes you're monitoring performance. Without these records, you can't demonstrate that your model stayed within acceptable bounds.

Implementation Patterns

Pattern 1: Immutable Event Logs

The gold standard is an immutable, append-only log of all model activity. Every training run, every deployment, every prediction is recorded and can't be changed.

{
  "event_type": "prediction",
  "timestamp": "2026-02-15T14:23:17Z",
  "model_version": "cardiac-risk-v2.3.1",
  "input_hash": "sha256:abc123...",
  "output": {
    "risk_score": 0.73,
    "risk_category": "elevated"
  },
  "metadata": {
    "request_id": "uuid-here",
    "user_id": "clinician-123",
    "facility_id": "hospital-456"
  }
}

Store these in a system that prevents modification: cloud storage with object lock, a blockchain-based ledger, or a purpose-built audit system.

Pattern 2: Reproducibility by Design

Design your system so any prediction can be reproduced:

Store the exact model artifact (not just weights, the full inference code)
Store the exact input data (or a hash that can be matched to stored data)
Ensure your inference pipeline is deterministic

When FDA asks "why did the model output X?", you should be able to re-run the exact prediction and get the same result.

Pattern 3: Layered Retention

Different audit data has different retention needs:

Training artifacts: Forever (or at least the life of the device)
Model versions: Forever
Inference logs: 6-10 years minimum (depends on clinical context)
Performance metrics: Forever

Plan your storage architecture accordingly. Don't rely on ephemeral container logs.

Common Mistakes

Mistake 1: Logging Outputs but Not Inputs

I see this constantly. Teams log what the model predicted but not what data it saw. When something goes wrong, they can't diagnose whether it was a model issue or a data issue.

Mistake 2: Non-Deterministic Inference

If your model uses random number generation or non-deterministic GPU operations, the same input might give different outputs. This makes audit reproduction impossible. Fix this before you scale.

Mistake 3: Mutable Logs

If anyone can edit your audit logs, they're worthless. Logs need to be append-only and tamper-evident. This is a technical control, not a policy.

Mistake 4: No Input Validation Logging

What happens when your model receives invalid or out-of-distribution input? If you silently fail or return a default, you need to log that. Edge cases matter in FDA review.

Mistake 5: Orphaned Model Versions

When you deploy a new model version, can you still run the old one? You need to preserve the ability to reproduce predictions from previous versions, even after they're decommissioned.

What FDA Review Looks Like

In a regulatory submission, your audit trail documentation typically includes:

System architecture description: how do the components interact?
Data flow diagram: where is audit data captured, stored, accessed?
Retention policies: how long is data kept, how is it protected?
Sample audit records: examples of actual logged events
Reproducibility demonstration: evidence that predictions can be reconstructed

The FDA may ask you to demonstrate reproduction of a specific prediction during review. If you can't do this, you're not ready to submit.

Building for the PCCP

If you're planning to use a Predetermined Change Control Plan (PCCP) for continuous AI improvement, your audit trail requirements expand:

Retraining triggers: what caused each retraining event?
Validation against reference: how did the new model compare to the reference?
Change classification: was this change within the PCCP envelope?
Automatic vs. manual approval: who or what approved the change?

The PCCP framework assumes robust performance monitoring and change documentation. Your audit system needs to support this.

Start Now

If there's one piece of advice I give to every health tech company building AI: design your audit trail architecture now, not later.

Retrofitting audit trails is painful and expensive. Building them in from day one is relatively easy. The difference between these two paths can be months of delay and hundreds of thousands of dollars in rework.

If you're building AI for FDA-regulated applications and need help getting your audit architecture right, let's talk. I've built these systems from the inside at companies shipping FDA-cleared AI at scale.