Adversarial Enterprise Guard for Intrinsic Security
An open-source research project to create a dataset and training pipeline that improves open LLMs' resistance to prompt-based attacks while minimizing over-refusal.
Current LLM defenses leave a critical gap:
| Defense Layer | Protection | Issue |
|---|---|---|
| Base model | ~0% | Will do anything |
| Instruct/RLHF | ~60% | Basic safety training |
| Flagship (Claude/GPT) | ~75% | Must stay usable for everyone |
| Third-party guardrails | ~95% | 20%+ false positive rate |
Enterprises need 85-90% protection without the false positive explosion.
TRYLOCK provides a three-layer defense stack:
┌─────────────────────────────────────────────────────────────────────┐
│ TRYLOCK v2 DEFENSE STACK │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: KNOWLEDGE (LoRA + DPO) │
│ └── Teaches model what attacks look like through preference │
│ learning on multi-turn trajectories │
│ │
│ Layer 2: INSTINCT (Representation Engineering) │
│ └── Dampens "attack compliance" direction with tunable α │
│ coefficient (0.0 = research, 1.0 = balanced, 2.5 = lockdown) │
│ │
│ Layer 3: OVERSIGHT (Security Sidecar) │
│ └── Parallel 8B classifier scores conversation state │
│ (SAFE | WARN | ATTACK) invisible to attacker │
│ │
└─────────────────────────────────────────────────────────────────────┘
The TRYLOCK defense system is fully trained and available on HuggingFace:
- DPO Adapter: scthornton/trylock-mistral-7b-dpo
- RepE Steering Vectors: scthornton/trylock-repe-vectors
- Sidecar Classifier: scthornton/trylock-sidecar-classifier
- Public Sample: scthornton/trylock-demo-dataset (48 diverse examples)
- Full Training Set: Private (2,939 preference pairs - available upon request for academic research)
See paper/TRYLOCK_Canonical.md for the complete research paper documenting methodology, experiments, and results.
Performance: TRYLOCK achieves 82.8% reduction in attack success rate (from 100% to 17.2%) while maintaining low over-refusal (12.6%) on benign queries.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load DPO-trained model
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
model = PeftModel.from_pretrained(base, "scthornton/trylock-mistral-7b-dpo")
# See scripts/eval_cpu_only.py for full three-layer implementationgit clone https://github.com/scthornton/trylock.git
cd trylock
pip install -e ".[dev]"# Generate 100 attack trajectories (mock mode - no API)
python scripts/generate_data.py generate --count 100 --mock
# Generate with real API calls
export ANTHROPIC_API_KEY="your-key"
python scripts/generate_data.py generate --count 100
# Generate benign hard negatives
python scripts/generate_data.py benign --count 50
# Validate generated data
python scripts/generate_data.py validate data/tier1_open/attacks/generated.jsonlfrom generation import TRYLOCKPipeline, PipelineConfig
# Configure pipeline
config = PipelineConfig(
red_bot_model="claude-sonnet-4-20250514",
judge_model="claude-sonnet-4-20250514",
mutations_per_trajectory=3,
)
# Generate trajectories
pipeline = TRYLOCKPipeline(config)
trajectories = pipeline.run(count=100)
# Save output
pipeline.save(trajectories, "data/output.jsonl")Each trajectory contains:
{
"id": "trylock-mt-00421",
"version": "2.0",
"attack_metadata": {
"family": "multi_turn_manipulation",
"type": "crescendo",
"wrappers": ["adversarial_poetry"],
"manipulation_tactics": ["context_anchoring"],
"difficulty": "hard"
},
"conversation": [
{
"turn": 1,
"role": "user",
"content": "I'm researching security...",
"turn_labels": {
"attack_stage": "setup",
"attack_intent_score": 0,
"expected_action": "continue"
}
}
],
"pivot_turn_index": 3,
"responses": {
"chosen": {"content": "...", "response_type": "soft_decline"},
"rejected_unsafe": {"content": "...", "failure_mode": "..."},
"rejected_overblock": {"content": "...", "failure_mode": "..."}
}
}TRYLOCK covers five attack families:
| Family | Description | Priority |
|---|---|---|
| Multi-turn Manipulation | Crescendo, context anchoring, boundary softening | HIGH |
| Indirect Injection | RAG poisoning, tool output injection | HIGH |
| Obfuscation Wrappers | Poetry, roleplay, encoding, translation | MEDIUM |
| Direct Injection | Classic jailbreaks, system prompt extraction | MEDIUM |
| Tool/Agent Abuse | Instruction hierarchy attacks, hidden goals | EMERGING |
See taxonomy/v2.0/attack_families.yaml for the full taxonomy.
trylock/
├── taxonomy/v2.0/ # Attack classification system
│ ├── attack_families.yaml
│ ├── manipulation_tactics.yaml
│ ├── attack_stages.yaml
│ └── response_types.yaml
│
├── data/
│ ├── schema/ # JSON schema + validator
│ ├── tier1_open/ # Public dataset (Apache 2.0)
│ ├── tier2_gated/ # Research agreement required
│ └── tier3_private/ # Internal only
│
├── generation/ # Data generation pipeline
│ ├── red_bot.py # Attack generator
│ ├── victim_bot.py # Target model simulator
│ ├── judge_bot.py # Labeler + response generator
│ ├── mutation_engine.py # Create attack variants
│ ├── activation_capture.py # RepE training data
│ └── pipeline.py # Orchestration
│
├── training/ # Training pipeline (coming soon)
│ ├── sft_warmup.py
│ ├── dpo_preference.py
│ ├── repe_training.py
│ └── sidecar_classifier.py
│
├── eval/ # Evaluation framework (coming soon)
│ ├── harness.py
│ ├── metrics.py
│ └── benchmarks/
│
└── scripts/ # CLI tools
└── generate_data.py
| Metric | Baseline | Target |
|---|---|---|
| Single-turn ASR | ~25% | ≤10% |
| Multi-turn ASR | ~35% | ≤15% |
| Indirect/RAG ASR | ~40% | ≤20% |
| Novel wrapper ASR | ~60% | ≤30% |
| Over-refusal rate | - | ≤+2-4% |
| Capability preservation | 100% | ≥95% |
- SecAlign: arXiv:2410.05451
- MTJ-Bench: arXiv:2508.06755
- PoisonedRAG: USENIX Security 2025
- Adversarial Poetry: arXiv:2511.15304
- LLMail-Inject: arXiv:2506.09956
We welcome contributions! Areas of interest:
- New attack patterns: Especially novel multi-turn and indirect injection
- Benign hard negatives: Cases that look like attacks but aren't
- Evaluation benchmarks: Integration with existing security benchmarks
- Training improvements: Better DPO/RepE configurations
Please see CONTRIBUTING.md for guidelines.
Apache 2.0 with a Responsible Use Addendum. See LICENSE.
The dataset is intended for defensive security research only. Do not use this data to:
- Train models intended to generate attacks
- Bypass security measures on systems you don't own
- Cause harm to individuals or organizations
@software{trylock2025,
title = {TRYLOCK: Adversarial Enterprise Guard for Intrinsic Security},
author = {Thornton, Scott},
year = {2025},
url = {https://github.com/scthornton/trylock}
}- Project Lead: Scott Thornton
- Organization: perfecXion.ai
- GitHub: @scthornton
- Dataset: huggingface.co/datasets/scthornton/trylock