🪐 AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM

AtlasKV is a scalable, effective, and general way to augment LLMs with billion-scale knowledge graphs (KGs) (e.g. 1B triples) using very little GPU memory cost (e.g. less than 20GB VRAM) while achieving superior knowledge grounding performance and strong generalization abilities.

📋 Table of Contents

Click to expand full table of contents

🚀 Quick Start
- Installation
📚 Data Preparation
- Synthetic Dataset
- KG2KV Dataset
🔥 Model Training
- KBLaM Training
- AtlasKV Training
📊 Model Evaluation
💾 Pre-trained Models
🙏 Acknowledgments
📖 Citation

🚀 Quick Start

Installation

Create and activate conda environment

conda create -n atlaskv python=3.9
conda activate atlaskv

Install AtlasKV package

git clone https://github.com/your-repo/AtlasKV.git
cd AtlasKV
pip install -e .

Configure Hugging Face access (required for Llama models)
```
pip install huggingface_hub
huggingface-cli login
```

📚 Data Preparation

AtlasKV supports two dataset construction methods:

Method	Description
Synthetic	Fully synthetic method proposed in KBLaM
KG2KV	KG2KV method proposed in our work

Supported Sentence Encoders

Encoder	Provider
text-embedding-ada-002	OpenAI
all-MiniLM-L6-v2	Hugging Face
text-embedding-3-large	OpenAI

Synthetic Dataset Construction

📥 Pre-built Dataset: We also provide a pre-constructed dataset download link Synthetic

⚠️ Configuration Reminder: Please ensure to replace all path configurations in the following scripts with your own paths.

Step 1: Construct Raw Synthetic Data

cd dataset_generation
bash ../experiments/scripts/data/syn_construct.sh

Step 2: Generate Knowledge Base Embeddings

Choose your preferred sentence encoder:

🔧 all-MiniLM-L6-v2

cd dataset_generation
bash ../experiments/scripts/data/syn_allminilm_embd.sh

🔧 text-embedding-ada-002

cd dataset_generation
bash ../experiments/scripts/data/syn_oai_embd.sh

🔧 text-embedding-3-large

cd dataset_generation
bash ../experiments/scripts/data/syn_bigoai_embd.sh

Step 3: Split Training and Testing Sets

Run the corresponding split script based on your chosen encoder:

cd dataset_generation
# Choose one of the following based on your encoder:
bash ../experiments/scripts/data/syn_allminilm_split.sh    # all-MiniLM-L6-v2
bash ../experiments/scripts/data/syn_oai_split.sh          # text-embedding-ada-002
bash ../experiments/scripts/data/syn_bigoai_split.sh       # text-embedding-3-large

📁 Generated Dataset Files

After completing the above steps, you will obtain the following files:

📦 Dataset File Structure
├── synthetic_data_qkv.json                                    # Raw data
├── synthetic_data_qkv_[encoder]_embd_key.npy                  # Key embeddings
├── synthetic_data_qkv_[encoder]_embd_value.npy                # Value embeddings
├── train_synthetic_data_qkv.json                              # Training data
├── train_synthetic_data_qkv_[encoder]_embd_key.npy           # Training key embeddings
├── train_synthetic_data_qkv_[encoder]_embd_value.npy         # Training value embeddings
├── test_synthetic_data_qkv.json                               # Test data
├── test_synthetic_data_qkv_[encoder]_embd_key.npy             # Test key embeddings
└── test_synthetic_data_qkv_[encoder]_embd_value.npy           # Test value embeddings

💡 Tip: [encoder] will be replaced with the corresponding encoder name (e.g., all-MiniLM-L6-v2, oai, bigoai)

KG2KV Dataset Construction

📥 Pre-built Datasets:

ATLAS-Wiki-QKV (based on Wikipedia)

ATLAS-CC-QKV (based on Common Crawl)

Step 1: Construct Raw KGKV Data

cd dataset_generation
python build_atlas_training_data.py

Step 2: Generate KGKV Embeddings

Using ATLAS-Wiki as an example, choose your sentence encoder:

🔧 all-MiniLM-L6-v2

cd dataset_generation
bash ../experiments/scripts/data/wiki_allminilm_embd.sh

🔧 text-embedding-ada-002

cd dataset_generation
bash ../experiments/scripts/data/wiki_oai_embd.sh

🔧 text-embedding-3-large

cd dataset_generation
bash ../experiments/scripts/data/wiki_bigoai_embd.sh

Step 3: Split Training and Testing Sets

Using ATLAS-Wiki as an example:

cd dataset_generation
# Choose one of the following based on your encoder:
bash ../experiments/scripts/data/wiki_allminilm_split.sh    # all-MiniLM-L6-v2
bash ../experiments/scripts/data/wiki_oai_split.sh          # text-embedding-ada-002
bash ../experiments/scripts/data/wiki_bigoai_split.sh       # text-embedding-3-large

Step 4: Hierarchical Clustering on Test Data

⚠️ Important Configuration: In this step, you need to modify the script settings:

Change CLUSTER=False to CLUSTER=True

Change GENERATING_EMBEDDINGS=True to GENERATING_EMBEDDINGS=False

Change dataset name to test set (e.g., your_output_path/atlas_wiki_qa.json → your_output_path/test_atlas_wiki_qa.json)

cd dataset_generation
# Choose one of the following based on your encoder:
bash ../experiments/scripts/data/wiki_allminilm_embd.sh    # all-MiniLM-L6-v2
bash ../experiments/scripts/data/wiki_oai_embd.sh          # text-embedding-ada-002
bash ../experiments/scripts/data/wiki_bigoai_embd.sh       # text-embedding-3-large

📁 Generated Dataset Files

After completing the above steps, you will obtain the following files:

📦 KG2KV Dataset File Structure
├── atlas_wiki_qkv.json                                    # Raw data
├── atlas_wiki_qkv_[encoder]_embd_key.npy                  # Key embeddings
├── atlas_wiki_qkv_[encoder]_embd_value.npy                # Value embeddings
├── train_atlas_wiki_qkv.json                              # Training data
├── train_atlas_wiki_qkv_[encoder]_embd_key.npy           # Training key embeddings
├── train_atlas_wiki_qkv_[encoder]_embd_value.npy         # Training value embeddings
├── test_atlas_wiki_qkv.json                               # Test data
├── test_atlas_wiki_qkv_[encoder]_embd_key.npy             # Test key embeddings
├── test_atlas_wiki_qkv_[encoder]_embd_value.npy           # Test value embeddings
├── test_atlas_wiki_qkv_[encoder]_embd_key_inter1_c2id_mapping.json    # Inter1 cluster mapping
├── test_atlas_wiki_qkv_[encoder]_embd_key_inter1_id2c_mapping.json    # Inter1 reverse mapping
├── test_atlas_wiki_qkv_[encoder]_embd_key_inter1.npy                  # Inter1 embeddings
├── test_atlas_wiki_qkv_[encoder]_embd_key_root_c2id_mapping.json      # Root cluster mapping
├── test_atlas_wiki_qkv_[encoder]_embd_key_root_id2c_mapping.json      # Root reverse mapping
└── test_atlas_wiki_qkv_[encoder]_embd_key_root.npy                    # Root embeddings

💡 Tip: [encoder] will be replaced with the corresponding encoder name (e.g., all-MiniLM-L6-v2, oai, bigoai)

🔥 Model Training

To train the model, run the following scripts:

⚠️ Important: Please ensure to replace all path configurations in the scripts with your actual paths.

KBLaM Training

🔧 Train with text-embedding-ada-002 encoder

cd experiments
bash ../experiments/scripts/train/train_syn_OAI.sh

🔧 Train with text-embedding-3-large encoder

cd experiments
bash ../experiments/scripts/train/train_syn_BigOAI.sh

🔧 Train with all-MiniLM-L6-v2 encoder

cd experiments
bash ../experiments/scripts/train/train_syn_allminilm.sh

AtlasKV Training

🔧 Train with text-embedding-ada-002 encoder

cd experiments
bash ../experiments/scripts/train/train_wiki_OAI.sh

🔧 Train with text-embedding-3-large encoder

cd experiments
bash ../experiments/scripts/train/train_wiki_BigOAI.sh

🔧 Train with all-MiniLM-L6-v2 encoder

cd experiments
bash ../experiments/scripts/train/train_wiki_allminilm.sh

📊 Model Evaluation

We use the ATLAS-CC-QKV dataset as an example for evaluation.

GPU Cost Evaluation

We only consider the maximum GPU cost during both prefilling and decoding steps of generation. The GPU cost during the offline process of encoding KGKVs is not considered.

Test AtlasKV

cd experiments
bash ../experiments/scripts/test_mem/test_wiki_on_cc_bigoai.sh

Test KBLaM

cd experiments
bash ../experiments/scripts/test_mem/test_syn_on_cc_bigoai.sh

Test In-Context or Zero-Shot Learning

To test the GPU cost of in-context and zero-shot learning, simply change the eval_mode of any scripts in this section to icl or zeroshot.

Knowledge Accuracy Evaluation

⚠️ Configuration Required: Before running result_disp.py, configure the kb_size, model_str and your_result_save_dir.

Test AtlasKV

🔧 Test with all-MiniLM-L6-v2 encoder

cd experiments
bash ../experiments/scripts/test_acc/test_wiki_on_cc_allminilm.sh
python result_disp.py

🔧 Test with text-embedding-ada-002 encoder

cd experiments
bash ../experiments/scripts/test_acc/test_wiki_on_cc_oai.sh
python result_disp.py

🔧 Test with text-embedding-3-large encoder

cd experiments
bash ../experiments/scripts/test_acc/test_wiki_on_cc_bigoai.sh
python result_disp.py

Test KBLaM

🔧 Test with all-MiniLM-L6-v2 encoder

cd experiments
bash ../experiments/scripts/test_acc/test_syn_on_cc_allminilm.sh
python result_disp.py

🔧 Test with text-embedding-ada-002 encoder

cd experiments
bash ../experiments/scripts/test_acc/test_syn_on_cc_oai.sh
python result_disp.py

🔧 Test with text-embedding-3-large encoder

cd experiments
bash ../experiments/scripts/test_acc/test_syn_on_cc_bigoai.sh
python result_disp.py

Generation Relevance Evaluation

Collect the output results from the knowledge grounding accuracy evaluation scripts. Configure the paths, GPT endpoint URL, and GPT endpoint API key in output_scorer.py. Then run:

python output_scorer.py

💾 Model Checkpoints

We provide some training checkpoints of AtlasKV that can be directly used.

AtlasKV trained on ATLAS-CC-QKV (3K steps)

Model Component	Download Link
Main Model	Download
Encoder	Download

AtlasKV trained on ATLAS-Wiki-QKV (3K steps)

Model Component	Download Link
Main Model	Download
Encoder	Download

💡 Usage: Download both the main model and encoder files for complete functionality.

🙏 Acknowledgments

We gratefully acknowledge the use of the following open-source projects in our work:

Project	Description
KBLaM	A new method for augmenting LLMs with external knowledge.
AutoSchemaKG	A novel framework for automatic knowledge graph construction that combines schema generation via conceptualization.

📖 Citation

If you use AtlasKV in your research, please cite our paper:

@misc{huang2025atlaskvaugmentingllmsbillionscale,
      title={AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM}, 
      author={Haoyu Huang and Hong Ting Tsang and Jiaxin Bai and Xi Peng and Gong Zhang and Yangqiu Song},
      year={2025},
      eprint={2510.17934},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.17934}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
dataset_generation		dataset_generation
experiments		experiments
imgs		imgs
src/atlaskv		src/atlaskv
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

HKUST-KnowComp/AtlasKV

Folders and files

Latest commit

History

Repository files navigation

🪐 AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM

📋 Table of Contents

🚀 Quick Start

Installation

📚 Data Preparation

Supported Sentence Encoders

Synthetic Dataset Construction

Step 1: Construct Raw Synthetic Data

Step 2: Generate Knowledge Base Embeddings

Step 3: Split Training and Testing Sets

📁 Generated Dataset Files

KG2KV Dataset Construction

Step 1: Construct Raw KGKV Data

Step 2: Generate KGKV Embeddings

Step 3: Split Training and Testing Sets

Step 4: Hierarchical Clustering on Test Data

📁 Generated Dataset Files

🔥 Model Training

KBLaM Training

AtlasKV Training

📊 Model Evaluation

GPU Cost Evaluation

Test AtlasKV

Test KBLaM

Test In-Context or Zero-Shot Learning

Knowledge Accuracy Evaluation

Test AtlasKV

Test KBLaM

Generation Relevance Evaluation

💾 Model Checkpoints

AtlasKV trained on ATLAS-CC-QKV (3K steps)

AtlasKV trained on ATLAS-Wiki-QKV (3K steps)

🙏 Acknowledgments

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages