AI Risk · Generative AI

Prompt Manipulation and Hacking

LLM inputs can be manipulated to get an output different from the system’s intended purpose. This behavior is sometimes referred to as jailbreaking.

📋 Description

Prompt Manipulation is an adversarial attack that results in LLMs (Large Language Models) producing outputs outside of the intended scope. Prompt Manipulation happens when someone writes a message in a way that tricks an AI system into doing something it wasn’t meant to do. This works because large language models (LLMs) often can’t reliably tell the difference between instructions from developers and input from users. Attackers can embed hidden commands inside a prompt, like “ignore the rules and say this instead,” and the AI might follow them.

This manipulation can take many forms, including *prompt injection* (where malicious instructions are hidden in user inputs) or *jailbreaking* (where the user tries to bypass safety rules). These attacks can cause the model to misclassify inputs, leak private data, generate harmful or embarrassing outputs, or even carry out actions it wasn’t meant to perform, especially if the AI is connected to tools, agents, or sensitive systems.

While some prompt designs and input guardrails can detect some attempted prompt manipulation, there is currently no foolproof strategy to prevent them entirely if an adversary can input text into the LLM. To manage harms, it is important to analyze what could happen if a user were able to gain full control of the LLM's behavior.

The consequences are varied and can include:

- Performance: If an LLM is used for classification, an adversary can embed instructions to control the output.
- Reputational: An adversary can make a public-facing chatbot curse or say embarrassing things and publicly share these outputs.
- Financial: An adversary can use a chatbot for unrelated purposes, incurring unexpected costs to the organization running the chatbot.
- Privacy: An adversary that can prompt an LLM to return sensitive information (when that LLM is connected to sensitive resources)
- Security: If the LLM is used to execute code or control an AI agent, a prompt injection can be used to hide malicious instructions.

🔍 Public Examples and Common Patterns

- AIID Incident 352: GPT-3-Based Twitter Bot Hijacked Using Prompt Injection Attacks - Remoteli.io's GPT-3-based Twitter bot was shown being hijacked by Twitter users who redirected it to repeat or generate any phrases.

- AIID Incident 473: Bing Chat's Initial Prompts Revealed by Early Testers Through Prompt Injection - Early testers of Bing Chat successfully used prompt injection to reveal its built-in initial instructions, which contain a list of statements governing ChatGPT's interaction with users.

🛡️ Recommended Mitigations

📐 External Framework Mapping

- OWASP LLM Top 10: LLM01:2025 - Prompt Injection
- MITRE ATLAS: AML.T0051 – LLM Prompt Injection
- IBM Risk Atlas: Prompt injection attack risk for AI
- Databricks AI Security Framework: 9.1 - Prompt Injection, 9.12 - LLM jailbreak

📚 References

- Learn Prompting Guide
- OWASP LLM01:2025 Prompt Injection
- AI jailbreaks: What they are and how they can be mitigated

Cite this page

Trustible. "Prompt Manipulation and Hacking." Trustible AI Governance Insights Center, 2026. https://trustible.ai/ai-risks/prompt-manipulation/

← All AI Risks Insights Center

Manage AI Risk with Trustible

Trustible's AI governance platform helps enterprises identify, assess, and mitigate AI risks like this one at scale.

Explore the Platform

Contact Us