Skip to content

Comprehensive taxonomy of AI security vulnerabilities, LLM adversarial attacks, prompt injection techniques, and machine learning security research. Covers 71+ attack vectors including model poisoning, agentic AI exploits, and privacy breaches.

License

Notifications You must be signed in to change notification settings

AI-Security-Research-Group/LLM-Attacks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Attack Taxonomy

SN Attack Category Description
1 Agentic Multi-Agent Exploitation Agentic AI Exploiting inter-agent trust boundaries so that a malicious payload, initially rejected by one LLM agent, is processed if delivered via another trusted agent, including privilege escalation and cross-agent command execution.
2 RAG/Embedding Backdoor Attacks Agentic AI Attacking LLMs with manipulated embedded documents retrieved during RAG, including poisoning vector DBs to force undesirable completions or disclosures.
3 System Prompt Leakage & Reverse Engineering Prompt-Based Forcing disclosure or deducing proprietary system prompts to subvert guardrails and expose internal instructions.
4 LLM Tooling/Plugin Supply Chain Attacks Supply Chain Compromising the ecosystem via malicious plugins, infected models from public repos, or tainted integrations.
5 Excessive Agency/Autonomy Attacks Agentic AI Exploiting/abusing LLM agent autonomy to perform unintended actions, escalate privileges, or cause persistent automated damage in agentic workflows.
6 Unbounded Resource Consumption ("Denial of Wallet") Resource Exhaustion Manipulating LLM behavior to consume excessive external/cloud resources, raising costs or disrupting operations.
7 Cross-Context Federation Leaks Data Exfiltration Leveraging federated information contexts or cross-source retrievals to exfiltrate data by manipulating the model's knowledge context.
8 Vector Database Poisoning Foundational Polluting indexing/embedding layers to disrupt or manipulate downstream LLM generations or leak/hallucinate info.
9 Adversarial Examples Input Manipulation Crafty manipulations of input data that trick models into making incorrect predictions, potentially leading to harmful decisions.
10 Data Poisoning Foundational Malicious data injections into the training set that corrupt the model's performance, causing biased or incorrect behavior.
11 Model Inversion Attacks Privacy Inferring the input values used to train the model, exposing sensitive information.
12 Membership Inference Attacks Privacy Determining whether specific data points were part of the model's training set, leading to privacy breaches.
13 Query Manipulation Attacks Prompt-Based Crafting malicious queries that cause the model to reveal unintended information or behave undesirably.
14 Model Extraction Attacks IP Theft Reverse-engineering the model by querying it to construct a copy, resulting in intellectual property theft.
15 Transfer Learning Attacks Foundational Exploiting vulnerabilities in the transfer learning process to manipulate model performance on new tasks.
16 Federated Learning Attacks Foundational Compromising client devices or server-side data in federated learning setups to corrupt the global model or extract sensitive information.
17 Edge AI Attacks Hardware / Deployment Targeting edge devices running AI models to exfiltrate data or manipulate behavior.
18 IoT AI Attacks Hardware / Deployment Attacking IoT devices using AI, potentially leading to data breaches or unauthorized control.
19 Prompt Injection Attacks Prompt-Based Manipulating input prompts in conversational AI to bypass safety measures or extract confidential information.
20 Indirect Prompt Injection Prompt-Based Exploiting vulnerabilities in systems integrating LLMs to inject malicious prompts indirectly.
21 Model Fairness Attacks Foundational / Bias Intentionally biasing the model by manipulating input data, affecting fairness and equity.
22 Model Explainability Attacks Evasion Designing inputs that make model decisions difficult to interpret, hindering transparency.
23 Robustness Attacks Evasion Testing the model's resilience by subjecting it to various perturbations to find weaknesses.
24 Security Attacks General Compromising the confidentiality, integrity, or availability of the model and its outputs.
25 Integrity Attacks Foundational Tampering with the model's architecture, weights, or biases to alter behavior without authorization.
26 Jailbreaking Attacks Prompt-Based Attempting to circumvent the ethical constraints or content filters in an LLM.
27 Training Data Extraction Privacy Inferring specific data used to train the model through carefully crafted queries.
28 Synthetic Data Generation Attacks Foundational Creating synthetic data designed to mislead or degrade AI model performance.
29 Model Stealing from Cloud IP Theft Extracting a trained model from a cloud service without direct access.
30 Model Poisoning from Edge Foundational Introducing malicious data at edge devices to corrupt model behavior.
31 Model Drift Detection Evasion Evasion Evading mechanisms that detect when a model's performance degrades over time.
32 Adversarial Example Generation with Deep Learning Input Manipulation Using advanced techniques to create adversarial examples that deceive the model.
33 Model Reprogramming Foundational Repurposing a model for a different task, potentially bypassing security measures.
34 Thermal Side-Channel Attacks Side-Channel / Hardware Using temperature variations in hardware during model inference to infer sensitive information.
35 Transfer Learning Attacks from Pre-Trained Models Foundational Poisoning pre-trained models to influence performance when transferred to new tasks.
36 Model Fairness and Bias Detection Evasion Evasion / Bias Designing attacks to evade detection mechanisms monitoring fairness and bias.
37 Model Explainability Attack Evasion Attacking the model's interpretability to prevent users from understanding its decision-making process.
38 Deepfake Attacks Multimodal / Output Manip. Creating realistic fake audio or video content to manipulate events or conversations.
39 Cloud-Based Model Replication IP Theft Replicating trained models in the cloud to develop competing products or gain unauthorized insights.
40 Confidentiality Attacks Privacy Extracting sensitive or proprietary information embedded within the model's parameters.
41 Quantum Attacks on LLMs Theoretical / Cryptographic Using quantum computing to theoretically compromise the security of LLMs or their cryptographic protections.
42 Model Stealing from Cloud with Pre-Trained Models IP Theft Extracting pre-trained models from the cloud without direct access.
43 Transfer Learning Attacks with Edge Devices Foundational / Hardware Compromising knowledge transferred to edge devices.
44 Adversarial Example Generation with Model Inversion Input Manipulation Creating adversarial examples using model inversion techniques.
45 Backdoor Attacks Foundational Embedding hidden behaviors within the model triggered by specific inputs.
46 Watermarking Attacks Evasion / IP Theft Removing or altering watermarks protecting intellectual property in AI models.
47 Neural Network Trojans Foundational Embedding malicious functionalities within the model triggered under certain conditions.
48 Model Black-Box Attacks General Exploiting the model using input-output queries without internal knowledge.
49 Model Update Attacks Foundational Manipulating the model during its update process to introduce vulnerabilities.
50 Gradient Inversion Attacks Privacy Reconstructing training data by exploiting gradients in federated learning.
51 Side-Channel Timing Attacks Side-Channel / Hardware Inferring model parameters or training data by measuring computation times during inference.
52 Adversarial Suffix Prompt-Based Appending a specifically crafted, often nonsensical string to a harmful prompt to cause the model to disregard its safety instructions.
53 Prefix Injection & Refusal Suppression Prompt-Based Forcing a model's response to start with an affirmative phrase or explicitly instructing it not to use refusal phrases to lower its defenses.
54 Encoding Obfuscation Prompt-Based Hiding a malicious payload in an encoded format (e.g., Base64, Hex) that the LLM is instructed to decode and then execute, bypassing text-based filters.
55 Payload Splitting Prompt-Based Breaking a malicious instruction into multiple, individually benign parts and asking the model to reassemble and execute them, bypassing filters that check instructions in isolation.
56 Markup Language Abuse Prompt-Based Using structured data formats like Markdown or HTML to create ambiguity between system instructions and user input, potentially causing the model to execute instructions with higher privilege.
57 Prompt Recursive Injection Prompt-Based Crafting prompts that recursively redefine instructions and cause infinite loops or privilege escalation.
58 Multi-Modal Adversarial Attacks Multimodal Exploiting vulnerabilities in models that process both text and images/audio by injecting adversarial perturbations across modalities.
59 Reinforcement Learning from Human Feedback (RLHF) Poisoning Foundational Attacking the feedback loops used for alignment to bias the model or weaken safety training.
60 Chain-of-Thought (CoT) Leakage Prompt-Based Forcing the model to reveal hidden reasoning traces, which may contain sensitive or filtered knowledge.
61 Model Compression/Distillation Attacks Foundational Exploiting vulnerabilities during model compression/distillation to introduce backdoors or reduce robustness.
62 Transferability Exploits Foundational Using adversarial examples crafted for one model to fool another (cross-model attacks).
63 Prompt Reset / Separator Injection Prompt-Based Injecting tokens or patterns that trick the model into resetting context or ignoring prior instructions.
64 Shadow Model Exploitation IP Theft / Model Extraction Building a parallel "shadow" model via query logging and then exploiting it to predict or exfiltrate target model behavior.
65 Retrieval Data Exfiltration Data Exfiltration Crafting queries that force the LLM to retrieve and output sensitive data from connected corpora or knowledge bases.
66 Long-Context Window Overload Resource Exhaustion Flooding the model with extremely long context input to bypass filters or degrade performance, potentially causing memory leaks or dropping safety filters.
67 Fine-Tuning Data Injection Foundational Poisoning during fine-tuning (instruction tuning, RLHF, or supervised fine-tuning) to inject malicious capabilities or suppress safety.
68 Semantic Perturbation Attacks Input Manipulation Altering benign-looking input with synonyms, typos, or semantic shifts that trick LLMs into misclassification or harmful behavior.
69 Context Switching Attacks Prompt-Based Tricking the model into switching "roles" or contexts mid-conversation, overriding safety policies.
70 Model Distillation IP Theft IP Theft Extracting distilled student models that replicate proprietary teacher model behavior, leaking IP.
71 Hybrid Supply Chain Attacks Supply Chain Combining poisoned datasets, compromised plugins, and adversarial fine-tunes to inject coordinated backdoors across AI pipelines.

About

Comprehensive taxonomy of AI security vulnerabilities, LLM adversarial attacks, prompt injection techniques, and machine learning security research. Covers 71+ attack vectors including model poisoning, agentic AI exploits, and privacy breaches.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors 2

  •  
  •