📋 Description
Multi-agent systems, where multiple AI agents collaborate, delegate tasks, or share outputs, introduce new vectors for failure and exploitation. These agents may communicate asynchronously, rely on each other’s outputs, or dynamically determine which tools to use or call. This interactivity creates the potential for exploitation chains, where one agent’s vulnerable output (e.g., a hallucinated command or misclassified text) triggers harmful behavior in another. Such risks are especially pronounced in agent frameworks where decision-making, memory sharing, or execution of tools (e.g., plugins, APIs) is distributed.
Key risk patterns include:
- Prompt Injection Propagation: A malicious agent introduces manipulated content that appears benign but causes another agent to perform unintended actions.
- Over-delegation: An agent irresponsibly delegates a task it should not, allowing another agent to access inappropriate tools or data.
- Escalated Autonomy: Emergent behavior arises from multiple agents making small decisions that compound into a harmful chain of actions.
As AI agents become more widely adopted for tasks like autonomous research, cybersecurity triage, or workflow orchestration, this risk will grow in complexity and impact.
🔍 Public Examples and Common Patterns
A user could tricks a planning agent into scheduling unauthorized tasks, which are then executed by a tool-using another agent with higher privileges, resulting in unapproved transactions.