AtlasKV is a scalable, effective, and general way to augment LLMs with billion-scale knowledge graphs (KGs) (e.g. 1B triples) using very little GPU memory cost (e.g. less than 20GB VRAM) while achieving superior knowledge grounding performance and strong generalization abilities.
Click to expand full table of contents
-
Create and activate conda environment
conda create -n atlaskv python=3.9 conda activate atlaskv
-
Install AtlasKV package
git clone https://github.com/your-repo/AtlasKV.git cd AtlasKV pip install -e .
-
Configure Hugging Face access (required for Llama models)
pip install huggingface_hub huggingface-cli login
AtlasKV supports two dataset construction methods:
| Method | Description |
|---|---|
| Synthetic | Fully synthetic method proposed in KBLaM |
| KG2KV | KG2KV method proposed in our work |
| Encoder | Provider |
|---|---|
| text-embedding-ada-002 | OpenAI |
| all-MiniLM-L6-v2 | Hugging Face |
| text-embedding-3-large | OpenAI |
π₯ Pre-built Dataset: We also provide a pre-constructed dataset download link Synthetic
β οΈ Configuration Reminder: Please ensure to replace all path configurations in the following scripts with your own paths.
cd dataset_generation
bash ../experiments/scripts/data/syn_construct.shChoose your preferred sentence encoder:
π§ all-MiniLM-L6-v2
cd dataset_generation
bash ../experiments/scripts/data/syn_allminilm_embd.shπ§ text-embedding-ada-002
cd dataset_generation
bash ../experiments/scripts/data/syn_oai_embd.shπ§ text-embedding-3-large
cd dataset_generation
bash ../experiments/scripts/data/syn_bigoai_embd.shRun the corresponding split script based on your chosen encoder:
cd dataset_generation
# Choose one of the following based on your encoder:
bash ../experiments/scripts/data/syn_allminilm_split.sh # all-MiniLM-L6-v2
bash ../experiments/scripts/data/syn_oai_split.sh # text-embedding-ada-002
bash ../experiments/scripts/data/syn_bigoai_split.sh # text-embedding-3-largeAfter completing the above steps, you will obtain the following files:
π¦ Dataset File Structure
βββ synthetic_data_qkv.json # Raw data
βββ synthetic_data_qkv_[encoder]_embd_key.npy # Key embeddings
βββ synthetic_data_qkv_[encoder]_embd_value.npy # Value embeddings
βββ train_synthetic_data_qkv.json # Training data
βββ train_synthetic_data_qkv_[encoder]_embd_key.npy # Training key embeddings
βββ train_synthetic_data_qkv_[encoder]_embd_value.npy # Training value embeddings
βββ test_synthetic_data_qkv.json # Test data
βββ test_synthetic_data_qkv_[encoder]_embd_key.npy # Test key embeddings
βββ test_synthetic_data_qkv_[encoder]_embd_value.npy # Test value embeddings
π‘ Tip:
[encoder]will be replaced with the corresponding encoder name (e.g.,all-MiniLM-L6-v2,oai,bigoai)
π₯ Pre-built Datasets:
- ATLAS-Wiki-QKV (based on Wikipedia)
- ATLAS-CC-QKV (based on Common Crawl)
cd dataset_generation
python build_atlas_training_data.pyUsing ATLAS-Wiki as an example, choose your sentence encoder:
π§ all-MiniLM-L6-v2
cd dataset_generation
bash ../experiments/scripts/data/wiki_allminilm_embd.shπ§ text-embedding-ada-002
cd dataset_generation
bash ../experiments/scripts/data/wiki_oai_embd.shπ§ text-embedding-3-large
cd dataset_generation
bash ../experiments/scripts/data/wiki_bigoai_embd.shUsing ATLAS-Wiki as an example:
cd dataset_generation
# Choose one of the following based on your encoder:
bash ../experiments/scripts/data/wiki_allminilm_split.sh # all-MiniLM-L6-v2
bash ../experiments/scripts/data/wiki_oai_split.sh # text-embedding-ada-002
bash ../experiments/scripts/data/wiki_bigoai_split.sh # text-embedding-3-large
β οΈ Important Configuration: In this step, you need to modify the script settings:
- Change
CLUSTER=FalsetoCLUSTER=True- Change
GENERATING_EMBEDDINGS=TruetoGENERATING_EMBEDDINGS=False- Change dataset name to test set (e.g.,
your_output_path/atlas_wiki_qa.jsonβyour_output_path/test_atlas_wiki_qa.json)
cd dataset_generation
# Choose one of the following based on your encoder:
bash ../experiments/scripts/data/wiki_allminilm_embd.sh # all-MiniLM-L6-v2
bash ../experiments/scripts/data/wiki_oai_embd.sh # text-embedding-ada-002
bash ../experiments/scripts/data/wiki_bigoai_embd.sh # text-embedding-3-largeAfter completing the above steps, you will obtain the following files:
π¦ KG2KV Dataset File Structure
βββ atlas_wiki_qkv.json # Raw data
βββ atlas_wiki_qkv_[encoder]_embd_key.npy # Key embeddings
βββ atlas_wiki_qkv_[encoder]_embd_value.npy # Value embeddings
βββ train_atlas_wiki_qkv.json # Training data
βββ train_atlas_wiki_qkv_[encoder]_embd_key.npy # Training key embeddings
βββ train_atlas_wiki_qkv_[encoder]_embd_value.npy # Training value embeddings
βββ test_atlas_wiki_qkv.json # Test data
βββ test_atlas_wiki_qkv_[encoder]_embd_key.npy # Test key embeddings
βββ test_atlas_wiki_qkv_[encoder]_embd_value.npy # Test value embeddings
βββ test_atlas_wiki_qkv_[encoder]_embd_key_inter1_c2id_mapping.json # Inter1 cluster mapping
βββ test_atlas_wiki_qkv_[encoder]_embd_key_inter1_id2c_mapping.json # Inter1 reverse mapping
βββ test_atlas_wiki_qkv_[encoder]_embd_key_inter1.npy # Inter1 embeddings
βββ test_atlas_wiki_qkv_[encoder]_embd_key_root_c2id_mapping.json # Root cluster mapping
βββ test_atlas_wiki_qkv_[encoder]_embd_key_root_id2c_mapping.json # Root reverse mapping
βββ test_atlas_wiki_qkv_[encoder]_embd_key_root.npy # Root embeddings
π‘ Tip:
[encoder]will be replaced with the corresponding encoder name (e.g.,all-MiniLM-L6-v2,oai,bigoai)
To train the model, run the following scripts:
β οΈ Important: Please ensure to replace all path configurations in the scripts with your actual paths.
π§ Train with text-embedding-ada-002 encoder
cd experiments
bash ../experiments/scripts/train/train_syn_OAI.shπ§ Train with text-embedding-3-large encoder
cd experiments
bash ../experiments/scripts/train/train_syn_BigOAI.shπ§ Train with all-MiniLM-L6-v2 encoder
cd experiments
bash ../experiments/scripts/train/train_syn_allminilm.shπ§ Train with text-embedding-ada-002 encoder
cd experiments
bash ../experiments/scripts/train/train_wiki_OAI.shπ§ Train with text-embedding-3-large encoder
cd experiments
bash ../experiments/scripts/train/train_wiki_BigOAI.shπ§ Train with all-MiniLM-L6-v2 encoder
cd experiments
bash ../experiments/scripts/train/train_wiki_allminilm.shWe use the ATLAS-CC-QKV dataset as an example for evaluation.
We only consider the maximum GPU cost during both prefilling and decoding steps of generation. The GPU cost during the offline process of encoding KGKVs is not considered.
cd experiments
bash ../experiments/scripts/test_mem/test_wiki_on_cc_bigoai.shcd experiments
bash ../experiments/scripts/test_mem/test_syn_on_cc_bigoai.shTo test the GPU cost of in-context and zero-shot learning, simply change the eval_mode of any scripts in this section to icl or zeroshot.
β οΈ Configuration Required: Before runningresult_disp.py, configure thekb_size,model_strandyour_result_save_dir.
π§ Test with all-MiniLM-L6-v2 encoder
cd experiments
bash ../experiments/scripts/test_acc/test_wiki_on_cc_allminilm.sh
python result_disp.pyπ§ Test with text-embedding-ada-002 encoder
cd experiments
bash ../experiments/scripts/test_acc/test_wiki_on_cc_oai.sh
python result_disp.pyπ§ Test with text-embedding-3-large encoder
cd experiments
bash ../experiments/scripts/test_acc/test_wiki_on_cc_bigoai.sh
python result_disp.pyπ§ Test with all-MiniLM-L6-v2 encoder
cd experiments
bash ../experiments/scripts/test_acc/test_syn_on_cc_allminilm.sh
python result_disp.pyπ§ Test with text-embedding-ada-002 encoder
cd experiments
bash ../experiments/scripts/test_acc/test_syn_on_cc_oai.sh
python result_disp.pyπ§ Test with text-embedding-3-large encoder
cd experiments
bash ../experiments/scripts/test_acc/test_syn_on_cc_bigoai.sh
python result_disp.pyCollect the output results from the knowledge grounding accuracy evaluation scripts. Configure the paths, GPT endpoint URL, and GPT endpoint API key in output_scorer.py. Then run:
python output_scorer.pyWe provide some training checkpoints of AtlasKV that can be directly used.
| Model Component | Download Link |
|---|---|
| Main Model | Download |
| Encoder | Download |
| Model Component | Download Link |
|---|---|
| Main Model | Download |
| Encoder | Download |
π‘ Usage: Download both the main model and encoder files for complete functionality.
We gratefully acknowledge the use of the following open-source projects in our work:
| Project | Description |
|---|---|
| KBLaM | A new method for augmenting LLMs with external knowledge. |
| AutoSchemaKG | A novel framework for automatic knowledge graph construction that combines schema generation via conceptualization. |
If you use AtlasKV in your research, please cite our paper:
@misc{huang2025atlaskvaugmentingllmsbillionscale,
title={AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM},
author={Haoyu Huang and Hong Ting Tsang and Jiaxin Bai and Xi Peng and Gong Zhang and Yangqiu Song},
year={2025},
eprint={2510.17934},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.17934},
}