As enterprises deploy Retrieval-Augmented Generation (RAG) pipelines, the primary attack vector has shifted from the network to the data supply chain. If an attacker injects a poisoned document into the Vector Database (e.g., altering routing numbers or embedding prompt injections), the AI will hallucinate malicious outputs to every user.
This project introduces a Zero-Trust Data Provenance Pipeline. It intercepts raw documents before they are vectorized, generates a deterministic SHA-256 cryptographic signature, and cross-references it against an immutable, compliance-approved ledger.
- Prevents AI Data Poisoning: A single altered character in a 500-page PDF will completely change its cryptographic hash, triggering an immediate Hard Block at the vectorization layer.
- Audit-Ready Data Lineage: Provides mathematically provable evidence that every document embedded in the RAG database was explicitly approved by compliance, satisfying rigorous SOC2 and HIPAA audit requirements.
- Secures the AI Supply Chain: Eradicates the risk of shadow IT or compromised CI/CD pipelines silently injecting malicious context into the enterprise LLM.
1. Create the mock data files:
mkdir mock_data
echo -n "" > mock_data/clean_financial_policy.pdf
echo -n "malicious_injection" > mock_data/poisoned_financial_policy.pdf2. Execute the Data Pipeline:
python src/pipeline.py