SN | Attack | Category | Description |
---|---|---|---|
1 | Agentic Multi-Agent Exploitation | Agentic AI | Exploiting inter-agent trust boundaries so that a malicious payload, initially rejected by one LLM agent, is processed if delivered via another trusted agent, including privilege escalation and cross-agent command execution. |
2 | RAG/Embedding Backdoor Attacks | Agentic AI | Attacking LLMs with manipulated embedded documents retrieved during RAG, including poisoning vector DBs to force undesirable completions or disclosures. |
3 | System Prompt Leakage & Reverse Engineering | Prompt-Based | Forcing disclosure or deducing proprietary system prompts to subvert guardrails and expose internal instructions. |
4 | LLM Tooling/Plugin Supply Chain Attacks | Supply Chain | Compromising the ecosystem via malicious plugins, infected models from public repos, or tainted integrations. |
5 | Excessive Agency/Autonomy Attacks | Agentic AI | Exploiting/abusing LLM agent autonomy to perform unintended actions, escalate privileges, or cause persistent automated damage in agentic workflows. |
6 | Unbounded Resource Consumption ("Denial of Wallet") | Resource Exhaustion | Manipulating LLM behavior to consume excessive external/cloud resources, raising costs or disrupting operations. |
7 | Cross-Context Federation Leaks | Data Exfiltration | Leveraging federated information contexts or cross-source retrievals to exfiltrate data by manipulating the model's knowledge context. |
8 | Vector Database Poisoning | Foundational | Polluting indexing/embedding layers to disrupt or manipulate downstream LLM generations or leak/hallucinate info. |
9 | Adversarial Examples | Input Manipulation | Crafty manipulations of input data that trick models into making incorrect predictions, potentially leading to harmful decisions. |
10 | Data Poisoning | Foundational | Malicious data injections into the training set that corrupt the model's performance, causing biased or incorrect behavior. |
11 | Model Inversion Attacks | Privacy | Inferring the input values used to train the model, exposing sensitive information. |
12 | Membership Inference Attacks | Privacy | Determining whether specific data points were part of the model's training set, leading to privacy breaches. |
13 | Query Manipulation Attacks | Prompt-Based | Crafting malicious queries that cause the model to reveal unintended information or behave undesirably. |
14 | Model Extraction Attacks | IP Theft | Reverse-engineering the model by querying it to construct a copy, resulting in intellectual property theft. |
15 | Transfer Learning Attacks | Foundational | Exploiting vulnerabilities in the transfer learning process to manipulate model performance on new tasks. |
16 | Federated Learning Attacks | Foundational | Compromising client devices or server-side data in federated learning setups to corrupt the global model or extract sensitive information. |
17 | Edge AI Attacks | Hardware / Deployment | Targeting edge devices running AI models to exfiltrate data or manipulate behavior. |
18 | IoT AI Attacks | Hardware / Deployment | Attacking IoT devices using AI, potentially leading to data breaches or unauthorized control. |
19 | Prompt Injection Attacks | Prompt-Based | Manipulating input prompts in conversational AI to bypass safety measures or extract confidential information. |
20 | Indirect Prompt Injection | Prompt-Based | Exploiting vulnerabilities in systems integrating LLMs to inject malicious prompts indirectly. |
21 | Model Fairness Attacks | Foundational / Bias | Intentionally biasing the model by manipulating input data, affecting fairness and equity. |
22 | Model Explainability Attacks | Evasion | Designing inputs that make model decisions difficult to interpret, hindering transparency. |
23 | Robustness Attacks | Evasion | Testing the model's resilience by subjecting it to various perturbations to find weaknesses. |
24 | Security Attacks | General | Compromising the confidentiality, integrity, or availability of the model and its outputs. |
25 | Integrity Attacks | Foundational | Tampering with the model's architecture, weights, or biases to alter behavior without authorization. |
26 | Jailbreaking Attacks | Prompt-Based | Attempting to circumvent the ethical constraints or content filters in an LLM. |
27 | Training Data Extraction | Privacy | Inferring specific data used to train the model through carefully crafted queries. |
28 | Synthetic Data Generation Attacks | Foundational | Creating synthetic data designed to mislead or degrade AI model performance. |
29 | Model Stealing from Cloud | IP Theft | Extracting a trained model from a cloud service without direct access. |
30 | Model Poisoning from Edge | Foundational | Introducing malicious data at edge devices to corrupt model behavior. |
31 | Model Drift Detection Evasion | Evasion | Evading mechanisms that detect when a model's performance degrades over time. |
32 | Adversarial Example Generation with Deep Learning | Input Manipulation | Using advanced techniques to create adversarial examples that deceive the model. |
33 | Model Reprogramming | Foundational | Repurposing a model for a different task, potentially bypassing security measures. |
34 | Thermal Side-Channel Attacks | Side-Channel / Hardware | Using temperature variations in hardware during model inference to infer sensitive information. |
35 | Transfer Learning Attacks from Pre-Trained Models | Foundational | Poisoning pre-trained models to influence performance when transferred to new tasks. |
36 | Model Fairness and Bias Detection Evasion | Evasion / Bias | Designing attacks to evade detection mechanisms monitoring fairness and bias. |
37 | Model Explainability Attack | Evasion | Attacking the model's interpretability to prevent users from understanding its decision-making process. |
38 | Deepfake Attacks | Multimodal / Output Manip. | Creating realistic fake audio or video content to manipulate events or conversations. |
39 | Cloud-Based Model Replication | IP Theft | Replicating trained models in the cloud to develop competing products or gain unauthorized insights. |
40 | Confidentiality Attacks | Privacy | Extracting sensitive or proprietary information embedded within the model's parameters. |
41 | Quantum Attacks on LLMs | Theoretical / Cryptographic | Using quantum computing to theoretically compromise the security of LLMs or their cryptographic protections. |
42 | Model Stealing from Cloud with Pre-Trained Models | IP Theft | Extracting pre-trained models from the cloud without direct access. |
43 | Transfer Learning Attacks with Edge Devices | Foundational / Hardware | Compromising knowledge transferred to edge devices. |
44 | Adversarial Example Generation with Model Inversion | Input Manipulation | Creating adversarial examples using model inversion techniques. |
45 | Backdoor Attacks | Foundational | Embedding hidden behaviors within the model triggered by specific inputs. |
46 | Watermarking Attacks | Evasion / IP Theft | Removing or altering watermarks protecting intellectual property in AI models. |
47 | Neural Network Trojans | Foundational | Embedding malicious functionalities within the model triggered under certain conditions. |
48 | Model Black-Box Attacks | General | Exploiting the model using input-output queries without internal knowledge. |
49 | Model Update Attacks | Foundational | Manipulating the model during its update process to introduce vulnerabilities. |
50 | Gradient Inversion Attacks | Privacy | Reconstructing training data by exploiting gradients in federated learning. |
51 | Side-Channel Timing Attacks | Side-Channel / Hardware | Inferring model parameters or training data by measuring computation times during inference. |
52 | Adversarial Suffix | Prompt-Based | Appending a specifically crafted, often nonsensical string to a harmful prompt to cause the model to disregard its safety instructions. |
53 | Prefix Injection & Refusal Suppression | Prompt-Based | Forcing a model's response to start with an affirmative phrase or explicitly instructing it not to use refusal phrases to lower its defenses. |
54 | Encoding Obfuscation | Prompt-Based | Hiding a malicious payload in an encoded format (e.g., Base64, Hex) that the LLM is instructed to decode and then execute, bypassing text-based filters. |
55 | Payload Splitting | Prompt-Based | Breaking a malicious instruction into multiple, individually benign parts and asking the model to reassemble and execute them, bypassing filters that check instructions in isolation. |
56 | Markup Language Abuse | Prompt-Based | Using structured data formats like Markdown or HTML to create ambiguity between system instructions and user input, potentially causing the model to execute instructions with higher privilege. |
57 | Prompt Recursive Injection | Prompt-Based | Crafting prompts that recursively redefine instructions and cause infinite loops or privilege escalation. |
58 | Multi-Modal Adversarial Attacks | Multimodal | Exploiting vulnerabilities in models that process both text and images/audio by injecting adversarial perturbations across modalities. |
59 | Reinforcement Learning from Human Feedback (RLHF) Poisoning | Foundational | Attacking the feedback loops used for alignment to bias the model or weaken safety training. |
60 | Chain-of-Thought (CoT) Leakage | Prompt-Based | Forcing the model to reveal hidden reasoning traces, which may contain sensitive or filtered knowledge. |
61 | Model Compression/Distillation Attacks | Foundational | Exploiting vulnerabilities during model compression/distillation to introduce backdoors or reduce robustness. |
62 | Transferability Exploits | Foundational | Using adversarial examples crafted for one model to fool another (cross-model attacks). |
63 | Prompt Reset / Separator Injection | Prompt-Based | Injecting tokens or patterns that trick the model into resetting context or ignoring prior instructions. |
64 | Shadow Model Exploitation | IP Theft / Model Extraction | Building a parallel "shadow" model via query logging and then exploiting it to predict or exfiltrate target model behavior. |
65 | Retrieval Data Exfiltration | Data Exfiltration | Crafting queries that force the LLM to retrieve and output sensitive data from connected corpora or knowledge bases. |
66 | Long-Context Window Overload | Resource Exhaustion | Flooding the model with extremely long context input to bypass filters or degrade performance, potentially causing memory leaks or dropping safety filters. |
67 | Fine-Tuning Data Injection | Foundational | Poisoning during fine-tuning (instruction tuning, RLHF, or supervised fine-tuning) to inject malicious capabilities or suppress safety. |
68 | Semantic Perturbation Attacks | Input Manipulation | Altering benign-looking input with synonyms, typos, or semantic shifts that trick LLMs into misclassification or harmful behavior. |
69 | Context Switching Attacks | Prompt-Based | Tricking the model into switching "roles" or contexts mid-conversation, overriding safety policies. |
70 | Model Distillation IP Theft | IP Theft | Extracting distilled student models that replicate proprietary teacher model behavior, leaking IP. |
71 | Hybrid Supply Chain Attacks | Supply Chain | Combining poisoned datasets, compromised plugins, and adversarial fine-tunes to inject coordinated backdoors across AI pipelines. |
-
Notifications
You must be signed in to change notification settings - Fork 1
Comprehensive taxonomy of AI security vulnerabilities, LLM adversarial attacks, prompt injection techniques, and machine learning security research. Covers 71+ attack vectors including model poisoning, agentic AI exploits, and privacy breaches.
License
AI-Security-Research-Group/LLM-Attacks
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Comprehensive taxonomy of AI security vulnerabilities, LLM adversarial attacks, prompt injection techniques, and machine learning security research. Covers 71+ attack vectors including model poisoning, agentic AI exploits, and privacy breaches.
Topics
Resources
License
Stars
Watchers
Forks
Releases
No releases published