📋 Description
Input Checks—also known as guardrails or firewalls—are mechanisms applied to system inputs to detect and block inappropriate or harmful prompts before they are processed by the AI model. These checks are critical in defending against prompt injection, jailbreak attempts, off-topic use, and other forms of misuse.
Checks can range in complexity:
- Simple rules (e.g., blocking keywords or patterns)
- Topic classifiers that detect sensitive or out-of-scope queries
- LLM-based detectors that evaluate user prompts for harmful behaviors
These systems can be built in-house or integrated using off-the-shelf tools. The choice of approach should balance performance with practical constraints. Consider the following when implementing input checks:
- Accuracy: Ensure the check system reliably detects malicious or harmful inputs across diverse adversarial examples.
- Latency: More accurate systems often introduce greater delay—choose an implementation that preserves acceptable user experience.
The following libraries provide out-of-the-box tools for implementing robust input validation for LLMs:
Guardrails AI: A Python library for validating, correcting, and enforcing structure on LLM outputs and inputs.
NeMo Guardrails (NVIDIA): Enables safe and controlled LLM applications by guiding the flow of user interactions.
LLM Guard: Specialized in scanning for prompt injection, toxicity, jailbreaking, and more.
📉 How It Reduces Risks
- Blocks Prompt Injection: Prevents users from manipulating the system with crafted prompts designed to bypass restrictions.
- Reduces Model Exploits: Filters out jailbreaks and adversarial prompts that could trigger undesired behavior.
- Maintains Output Integrity: Stops off-topic or irrelevant prompts from wasting compute or returning unpredictable results.
- Improves Trust and Safety: Ensures that input behavior aligns with usage policies and ethical standards.
📎 Suggested Evidence
- Input Filter Logs: Records of flagged or rejected prompts with associated categories (e.g., injection, jailbreak, unsafe content).
- Guardrail Performance Benchmarks: Accuracy metrics or evaluation results from red-teaming tests show detection rates across known attack categories.
- Latency Impact Assessments: Metrics showing trade-offs between detection performance and system response times.
- Audit Reports of Prompt-Based Attacks: Historical data showing prompt manipulation cases and whether input checks would have prevented them.