📋 Description
AI systems may be developed outside of a formalized software development lifecycle, resulting in critical assets (e.g., model versions, training data, pipelines) being poorly documented or untraceable. This impairs reproducibility, makes debugging difficult, and increases audit and compliance risks.
In many industries, accountability is tied to the ability to recreate past model versions and their decisions. A lack of traceability can prevent organizations from understanding how a decision was made or reusing a well-performing model. It may also result in violations of regulatory requirements mandating version control and transparency in AI deployment.
At a minimum, deployed models should be linked to the following:
- The training dataset, or a frozen snapshot of it, to ensure reproducibility.
- The data processing pipeline, including scripts and transformations.
- The model architecture and training code, with explicit dependency tracking.
- The hyperparameters, if not contained in the code, are stored separately for full configuration documentation.
🔍 Public Examples and Common Patterns
Low model traceability can often emerge due to the iterative nature of the data science process. Researchers and developers may experiment with many data processing methods and model set-ups without clear tracking. When a version with superior performance emerges, it may be shipped, but later it may be impossible to recreate the process and set-up retraining processes or explainability tools.