Training data should be sanitized prior to use to remove inappropriate content and protect against data poisoning attacks. Inappropriate content may include toxic language, harmful stereotypes, or malformed documents, while "poisoned" content refers to data intentionally crafted to manipulate or degrade model performance.
Sanitization should be tailored to the specific AI system's use case and may involve both automated tools and manual reviews. Various off-the-shelf tools and models exist for filtering inappropriate or low-quality content. However, detecting poisoned data is a more complex task, and no standardized method exists.
Common approaches include:
- Using outlier detection to flag potentially malicious entries for manual review.
- Limiting the number of data points from each specific source (i.e. user) to ensure that they are not intentionally corrupting the data.
- Training a secondary model to distinguish good vs poisoned datapoints and use it as a custom filter.
- Implementing an advanced detector using the
Adversarial Robustness Toolbox.