A Typed & Interpretable Framework for Cyber Threat Intelligence Reasoning
Bridging MITRE ATT&CK, Knowledge Graphs, and Large Language Models
TITAN is a typed, bidirectional knowledge graph framework for Cyber Threat Intelligence (CTI) reasoning and question answering. It integrates data from the MITRE ATT&CK STIX bundles, builds a TITAN Ontology, generates reasoning (CoT) and non-reasoning (NoCoT) datasets, and provides an end-to-end pipeline for model training, evaluation, and graph execution.
TITAN with Chain of Thought (CoT)
No Chain of Thought (Example 1)
No Chain of Thought (Example 2)
TITAN as a tool for a Cybersecurity Agent
TITAN implements the full pipeline described in the paper TITAN: Graph-Executable Reasoning for Cyber Threat Intelligence.
It comprises:
- Typed Graph Construction β builds a bidirectional knowledge graph from MITRE ATT&CK STIX data using the TITAN Ontology, where each edge is semantically typed (e.g.,
uses_attack_pattern,mitigates_attack_pattern). - Dataset Generation β creates large-scale QA/navigation datasets in both CoT and NoCoT formats, with executable relational paths (
<PATH>β¦</PATH>). - Data Splitting β produces train/validation/test splits across CTI sections.
- Path-Planner Training β fine-tunes LLMs for path generation using LoRA adapters (Unsloth + TRL).
- Graph Execution β executes generated paths over the TITAN Graph to return grounded entities and interpretable reasoning traces.
TITAN/
ββ datasets/
β ββ CoT/
β ββ NoCoT/
β ββ create_dataset_splits.py # split into train/val/test
ββ utils/
β ββ build_graph.py # STIX β TITAN Ontology Graph (GraphML)
β ββ build_dataset.py # Graph + YAML templates β dataset JSON
β ββ paraphrase.py # optional: generate target variations via LLM
β ββ useful_cot.yaml # question templates with <PATH>...</PATH> and target
ββ graph_algorithm.py # deterministic path execution utilities
ββ train_titan.py # LoRA SFT training (Unsloth + TRL)
ββ test_titan.py # interactive tester for path planning & execution
ββ modify_target.py # apply paraphrased targets to YAML/JSON
ββ README.md
Notes
paraphrase.pyis optional and not used unless applied viamodify_target.py.- Update the
<img src="images/...">path if your image file name differs.
- Python 3.9+
- Local MITRE ATT&CK STIX JSON bundles (e.g.,
../attack-stix-data/) - (Optional) GPU for LLM steps (
paraphrase.py, training)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -U pip
pip install networkx pandas pyyaml tqdm scikit-learn
# For model training and testing:
pip install torch transformers accelerate datasets trl unslothScript: utils/build_graph.py
Generates titan_graph.graphml (bidirectional, typed graph).
python utils/build_graph.py --base ../attack-stix-data --out titan_graph.graphml --log-file build_log.txtThe resulting graph follows the TITAN Ontology, distinguishing semantic directions (e.g.,
uses_attack_patternβused_by_intrusion_set) and ensuring all relations are mirrored with coherent inverse semantics.
Script: utils/build_dataset.py
Inputs:
titan_graph.graphmlutils/useful_cot.yamlβ templates with<PATH>...</PATH>andtarget
Outputs:
datasets/CoT/NAVIGATION_DATASET.jsondatasets/CoT/NAVIGATION_QUESTION_PER_SECTION.json
Example:
python utils/build_dataset.py \
--templates utils/useful_cot.yaml \
--graph titan_graph.graphml \
--out datasets/CoT/NAVIGATION_DATASET.json \
--out datasets/CoT/NAVIGATION_QUESTION_PER_SECTION.jsonRe-run for NoCoT using the corresponding output folder:
datasets/NoCoT/
python - <<'PY'
import json, pandas as pd, os
inp="datasets/CoT/NAVIGATION_DATASET.json"; out="datasets/CoT/NAVIGATION_DATASET.csv"
data=json.load(open(inp,"r",encoding="utf-8"))
df=pd.DataFrame(data)
if "question" in df.columns: df=df.rename(columns={"question":"Question"})
os.makedirs(os.path.dirname(out), exist_ok=True)
df.to_csv(out, index=False, encoding="utf-8")
print("Saved", out)
PYYou may refine the Objective/target terms using utils/paraphrase.py.
This creates target_variations.csv, which can be applied to YAML or JSON via modify_target.py.
python modify_target.py --csv target_variations.csv \
--in utils/useful_cot.yaml --out utils/useful_cot.improved.yaml --pick firstpython modify_target.py --csv target_variations.csv \
--in datasets/CoT/NAVIGATION_DATASET.json \
--out datasets/CoT/NAVIGATION_DATASET.improved.json \
--pick longestScript: datasets/create_dataset_splits.py
Inputs:
- CSV dataset (
Questioncolumn required) - Section mapping JSON
Outputs:
datasets/CoT/COMPLETE/train_dataset.csv
datasets/CoT/COMPLETE/val_dataset.csv
datasets/CoT/COMPLETE/test_dataset.csv
Example:
python datasets/create_dataset_splits.py \
--csv datasets/CoT/NAVIGATION_DATASET.csv \
--json datasets/CoT/NAVIGATION_QUESTION_PER_SECTION.json \
--out datasets/CoT/COMPLETE \
--train 0.80 --val 0.05 --test 0.15 --seed 42Script: train_titan.py β fine-tunes an LLM (e.g., Phi-3.5, LLaMA, Qwen) using LoRA adapters.
Dataset directory structure:
TITAN_COMPLETE_DATASET/
ββ train_dataset.csv
ββ val_dataset.csv
ββ test_dataset.csv
Example:
python train_titan.py \
--data TITAN_COMPLETE_DATASET \
--out MODELS/phi_titan \
--model unsloth/Phi-3.5-mini-instruct \
--lr 3e-4 --train-bsz 8 --eval-bsz 8 --grad-accum 2 \
--epochs 8 --seq-len 2048 --seed 42This script saves LoRA adapters and tokenizer into the
--outdirectory.
Reduce--train-bszor increase--grad-accumif GPU memory is insufficient.
Script: test_titan.py
Loads the trained model, generates an executable <PATH>...</PATH> plan, and executes it over the TITAN Graph.
python test_titan.py \
--model MODELS/phi_titan \
--names NAMES.txt \
--graph titan_graph.graphml \
--rels Relationship_Descriptions.txtExample query:
Which mitigations apply to techniques used by the Carberp malware?
The system generates a CoT reasoning trace, an executable path, and the final grounded entities.
- Missing columns β rename
questionβQuestionbefore splitting. - Unknown mappings β may be excluded or labeled as
Unknown. - Small sections β the splitter balances small groups automatically.
- GPU unavailable β training runs on CPU but will be slow.
- CLI arguments not supported β set paths directly in scripts.
# 1. Build graph
python utils/build_graph.py --base ../attack-stix-data --out titan_graph.graphml
# 2. Build dataset
python utils/build_dataset.py \
--templates utils/useful_cot.yaml \
--graph titan_graph.graphml \
--out datasets/CoT/NAVIGATION_DATASET.json \
--out datasets/CoT/NAVIGATION_QUESTION_PER_SECTION.json
# 3. (Optional) Apply paraphrased targets
python modify_target.py --csv target_variations.csv \
--in datasets/CoT/NAVIGATION_DATASET.json \
--out datasets/CoT/NAVIGATION_DATASET.improved.json
# 4. Convert to CSV
python - <<'PY'
import json, pandas as pd, os
inp="datasets/CoT/NAVIGATION_DATASET.json"; out="datasets/CoT/NAVIGATION_DATASET.csv"
data=json.load(open(inp,"r",encoding="utf-8")); df=pd.DataFrame(data)
if "question" in df.columns: df=df.rename(columns={"question":"Question"})
os.makedirs(os.path.dirname(out), exist_ok=True); df.to_csv(out, index=False, encoding="utf-8")
print("Saved", out)
PY
# 5. Split
python datasets/create_dataset_splits.py \
--csv datasets/CoT/NAVIGATION_DATASET.csv \
--json datasets/CoT/NAVIGATION_QUESTION_PER_SECTION.json \
--out datasets/CoT/COMPLETE --train 0.80 --val 0.05 --test 0.15
# 6. Train
python train_titan.py --data TITAN_COMPLETE_DATASET --out MODELS/phi_titan
# 7. Test
python test_titan.py --model MODELS/phi_titan --names NAMES.txt --graph titan_graph.graphml --rels Relationship_Descriptions.txt