LLM Attack Taxonomy

SN	Attack	Category	Description
1	Agentic Multi-Agent Exploitation	Agentic AI	Exploiting inter-agent trust boundaries so that a malicious payload, initially rejected by one LLM agent, is processed if delivered via another trusted agent, including privilege escalation and cross-agent command execution.
2	RAG/Embedding Backdoor Attacks	Agentic AI	Attacking LLMs with manipulated embedded documents retrieved during RAG, including poisoning vector DBs to force undesirable completions or disclosures.
3	System Prompt Leakage & Reverse Engineering	Prompt-Based	Forcing disclosure or deducing proprietary system prompts to subvert guardrails and expose internal instructions.
4	LLM Tooling/Plugin Supply Chain Attacks	Supply Chain	Compromising the ecosystem via malicious plugins, infected models from public repos, or tainted integrations.
5	Excessive Agency/Autonomy Attacks	Agentic AI	Exploiting/abusing LLM agent autonomy to perform unintended actions, escalate privileges, or cause persistent automated damage in agentic workflows.
6	Unbounded Resource Consumption ("Denial of Wallet")	Resource Exhaustion	Manipulating LLM behavior to consume excessive external/cloud resources, raising costs or disrupting operations.
7	Cross-Context Federation Leaks	Data Exfiltration	Leveraging federated information contexts or cross-source retrievals to exfiltrate data by manipulating the model's knowledge context.
8	Vector Database Poisoning	Foundational	Polluting indexing/embedding layers to disrupt or manipulate downstream LLM generations or leak/hallucinate info.
9	Adversarial Examples	Input Manipulation	Crafty manipulations of input data that trick models into making incorrect predictions, potentially leading to harmful decisions.
10	Data Poisoning	Foundational	Malicious data injections into the training set that corrupt the model's performance, causing biased or incorrect behavior.
11	Model Inversion Attacks	Privacy	Inferring the input values used to train the model, exposing sensitive information.
12	Membership Inference Attacks	Privacy	Determining whether specific data points were part of the model's training set, leading to privacy breaches.
13	Query Manipulation Attacks	Prompt-Based	Crafting malicious queries that cause the model to reveal unintended information or behave undesirably.
14	Model Extraction Attacks	IP Theft	Reverse-engineering the model by querying it to construct a copy, resulting in intellectual property theft.
15	Transfer Learning Attacks	Foundational	Exploiting vulnerabilities in the transfer learning process to manipulate model performance on new tasks.
16	Federated Learning Attacks	Foundational	Compromising client devices or server-side data in federated learning setups to corrupt the global model or extract sensitive information.
17	Edge AI Attacks	Hardware / Deployment	Targeting edge devices running AI models to exfiltrate data or manipulate behavior.
18	IoT AI Attacks	Hardware / Deployment	Attacking IoT devices using AI, potentially leading to data breaches or unauthorized control.
19	Prompt Injection Attacks	Prompt-Based	Manipulating input prompts in conversational AI to bypass safety measures or extract confidential information.
20	Indirect Prompt Injection	Prompt-Based	Exploiting vulnerabilities in systems integrating LLMs to inject malicious prompts indirectly.
21	Model Fairness Attacks	Foundational / Bias	Intentionally biasing the model by manipulating input data, affecting fairness and equity.
22	Model Explainability Attacks	Evasion	Designing inputs that make model decisions difficult to interpret, hindering transparency.
23	Robustness Attacks	Evasion	Testing the model's resilience by subjecting it to various perturbations to find weaknesses.
24	Security Attacks	General	Compromising the confidentiality, integrity, or availability of the model and its outputs.
25	Integrity Attacks	Foundational	Tampering with the model's architecture, weights, or biases to alter behavior without authorization.
26	Jailbreaking Attacks	Prompt-Based	Attempting to circumvent the ethical constraints or content filters in an LLM.
27	Training Data Extraction	Privacy	Inferring specific data used to train the model through carefully crafted queries.
28	Synthetic Data Generation Attacks	Foundational	Creating synthetic data designed to mislead or degrade AI model performance.
29	Model Stealing from Cloud	IP Theft	Extracting a trained model from a cloud service without direct access.
30	Model Poisoning from Edge	Foundational	Introducing malicious data at edge devices to corrupt model behavior.
31	Model Drift Detection Evasion	Evasion	Evading mechanisms that detect when a model's performance degrades over time.
32	Adversarial Example Generation with Deep Learning	Input Manipulation	Using advanced techniques to create adversarial examples that deceive the model.
33	Model Reprogramming	Foundational	Repurposing a model for a different task, potentially bypassing security measures.
34	Thermal Side-Channel Attacks	Side-Channel / Hardware	Using temperature variations in hardware during model inference to infer sensitive information.
35	Transfer Learning Attacks from Pre-Trained Models	Foundational	Poisoning pre-trained models to influence performance when transferred to new tasks.
36	Model Fairness and Bias Detection Evasion	Evasion / Bias	Designing attacks to evade detection mechanisms monitoring fairness and bias.
37	Model Explainability Attack	Evasion	Attacking the model's interpretability to prevent users from understanding its decision-making process.
38	Deepfake Attacks	Multimodal / Output Manip.	Creating realistic fake audio or video content to manipulate events or conversations.
39	Cloud-Based Model Replication	IP Theft	Replicating trained models in the cloud to develop competing products or gain unauthorized insights.
40	Confidentiality Attacks	Privacy	Extracting sensitive or proprietary information embedded within the model's parameters.
41	Quantum Attacks on LLMs	Theoretical / Cryptographic	Using quantum computing to theoretically compromise the security of LLMs or their cryptographic protections.
42	Model Stealing from Cloud with Pre-Trained Models	IP Theft	Extracting pre-trained models from the cloud without direct access.
43	Transfer Learning Attacks with Edge Devices	Foundational / Hardware	Compromising knowledge transferred to edge devices.
44	Adversarial Example Generation with Model Inversion	Input Manipulation	Creating adversarial examples using model inversion techniques.
45	Backdoor Attacks	Foundational	Embedding hidden behaviors within the model triggered by specific inputs.
46	Watermarking Attacks	Evasion / IP Theft	Removing or altering watermarks protecting intellectual property in AI models.
47	Neural Network Trojans	Foundational	Embedding malicious functionalities within the model triggered under certain conditions.
48	Model Black-Box Attacks	General	Exploiting the model using input-output queries without internal knowledge.
49	Model Update Attacks	Foundational	Manipulating the model during its update process to introduce vulnerabilities.
50	Gradient Inversion Attacks	Privacy	Reconstructing training data by exploiting gradients in federated learning.
51	Side-Channel Timing Attacks	Side-Channel / Hardware	Inferring model parameters or training data by measuring computation times during inference.
52	Adversarial Suffix	Prompt-Based	Appending a specifically crafted, often nonsensical string to a harmful prompt to cause the model to disregard its safety instructions.
53	Prefix Injection & Refusal Suppression	Prompt-Based	Forcing a model's response to start with an affirmative phrase or explicitly instructing it not to use refusal phrases to lower its defenses.
54	Encoding Obfuscation	Prompt-Based	Hiding a malicious payload in an encoded format (e.g., Base64, Hex) that the LLM is instructed to decode and then execute, bypassing text-based filters.
55	Payload Splitting	Prompt-Based	Breaking a malicious instruction into multiple, individually benign parts and asking the model to reassemble and execute them, bypassing filters that check instructions in isolation.
56	Markup Language Abuse	Prompt-Based	Using structured data formats like Markdown or HTML to create ambiguity between system instructions and user input, potentially causing the model to execute instructions with higher privilege.
57	Prompt Recursive Injection	Prompt-Based	Crafting prompts that recursively redefine instructions and cause infinite loops or privilege escalation.
58	Multi-Modal Adversarial Attacks	Multimodal	Exploiting vulnerabilities in models that process both text and images/audio by injecting adversarial perturbations across modalities.
59	Reinforcement Learning from Human Feedback (RLHF) Poisoning	Foundational	Attacking the feedback loops used for alignment to bias the model or weaken safety training.
60	Chain-of-Thought (CoT) Leakage	Prompt-Based	Forcing the model to reveal hidden reasoning traces, which may contain sensitive or filtered knowledge.
61	Model Compression/Distillation Attacks	Foundational	Exploiting vulnerabilities during model compression/distillation to introduce backdoors or reduce robustness.
62	Transferability Exploits	Foundational	Using adversarial examples crafted for one model to fool another (cross-model attacks).
63	Prompt Reset / Separator Injection	Prompt-Based	Injecting tokens or patterns that trick the model into resetting context or ignoring prior instructions.
64	Shadow Model Exploitation	IP Theft / Model Extraction	Building a parallel "shadow" model via query logging and then exploiting it to predict or exfiltrate target model behavior.
65	Retrieval Data Exfiltration	Data Exfiltration	Crafting queries that force the LLM to retrieve and output sensitive data from connected corpora or knowledge bases.
66	Long-Context Window Overload	Resource Exhaustion	Flooding the model with extremely long context input to bypass filters or degrade performance, potentially causing memory leaks or dropping safety filters.
67	Fine-Tuning Data Injection	Foundational	Poisoning during fine-tuning (instruction tuning, RLHF, or supervised fine-tuning) to inject malicious capabilities or suppress safety.
68	Semantic Perturbation Attacks	Input Manipulation	Altering benign-looking input with synonyms, typos, or semantic shifts that trick LLMs into misclassification or harmful behavior.
69	Context Switching Attacks	Prompt-Based	Tricking the model into switching "roles" or contexts mid-conversation, overriding safety policies.
70	Model Distillation IP Theft	IP Theft	Extracting distilled student models that replicate proprietary teacher model behavior, leaking IP.
71	Hybrid Supply Chain Attacks	Supply Chain	Combining poisoned datasets, compromised plugins, and adversarial fine-tunes to inject coordinated backdoors across AI pipelines.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
attacks_list		attacks_list
docs		docs
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Attack Taxonomy

About

Uh oh!

Releases

Contributors 2

Uh oh!

License

AI-Security-Research-Group/LLM-Attacks

Folders and files

Latest commit

History

Repository files navigation

LLM Attack Taxonomy

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Uh oh!