Skip to content

AtlasKV: A scalable, effective, and general way to augment LLMs with billion-scale knowledge graphs using very little GPU memory cost.

License

Notifications You must be signed in to change notification settings

HKUST-KnowComp/AtlasKV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸͺ AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM

ARXIV ONE DRIVE ONE DRIVE Hugging Face Collections PYTHON LICENSE

Memory Comparison Chart

AtlasKV is a scalable, effective, and general way to augment LLMs with billion-scale knowledge graphs (KGs) (e.g. 1B triples) using very little GPU memory cost (e.g. less than 20GB VRAM) while achieving superior knowledge grounding performance and strong generalization abilities.

πŸ“‹ Table of Contents

Click to expand full table of contents

πŸš€ Quick Start

Installation

  1. Create and activate conda environment

    conda create -n atlaskv python=3.9
    conda activate atlaskv
  2. Install AtlasKV package

    git clone https://github.com/your-repo/AtlasKV.git
    cd AtlasKV
    pip install -e .
  3. Configure Hugging Face access (required for Llama models)

    pip install huggingface_hub
    huggingface-cli login

πŸ“š Data Preparation

AtlasKV supports two dataset construction methods:

Method Description
Synthetic Fully synthetic method proposed in KBLaM
KG2KV KG2KV method proposed in our work

Supported Sentence Encoders

Encoder Provider
text-embedding-ada-002 OpenAI
all-MiniLM-L6-v2 Hugging Face
text-embedding-3-large OpenAI

Synthetic Dataset Construction

πŸ“₯ Pre-built Dataset: We also provide a pre-constructed dataset download link Synthetic

⚠️ Configuration Reminder: Please ensure to replace all path configurations in the following scripts with your own paths.

Step 1: Construct Raw Synthetic Data

cd dataset_generation
bash ../experiments/scripts/data/syn_construct.sh

Step 2: Generate Knowledge Base Embeddings

Choose your preferred sentence encoder:

πŸ”§ all-MiniLM-L6-v2
cd dataset_generation
bash ../experiments/scripts/data/syn_allminilm_embd.sh
πŸ”§ text-embedding-ada-002
cd dataset_generation
bash ../experiments/scripts/data/syn_oai_embd.sh
πŸ”§ text-embedding-3-large
cd dataset_generation
bash ../experiments/scripts/data/syn_bigoai_embd.sh

Step 3: Split Training and Testing Sets

Run the corresponding split script based on your chosen encoder:

cd dataset_generation
# Choose one of the following based on your encoder:
bash ../experiments/scripts/data/syn_allminilm_split.sh    # all-MiniLM-L6-v2
bash ../experiments/scripts/data/syn_oai_split.sh          # text-embedding-ada-002
bash ../experiments/scripts/data/syn_bigoai_split.sh       # text-embedding-3-large

πŸ“ Generated Dataset Files

After completing the above steps, you will obtain the following files:

πŸ“¦ Dataset File Structure
β”œβ”€β”€ synthetic_data_qkv.json                                    # Raw data
β”œβ”€β”€ synthetic_data_qkv_[encoder]_embd_key.npy                  # Key embeddings
β”œβ”€β”€ synthetic_data_qkv_[encoder]_embd_value.npy                # Value embeddings
β”œβ”€β”€ train_synthetic_data_qkv.json                              # Training data
β”œβ”€β”€ train_synthetic_data_qkv_[encoder]_embd_key.npy           # Training key embeddings
β”œβ”€β”€ train_synthetic_data_qkv_[encoder]_embd_value.npy         # Training value embeddings
β”œβ”€β”€ test_synthetic_data_qkv.json                               # Test data
β”œβ”€β”€ test_synthetic_data_qkv_[encoder]_embd_key.npy             # Test key embeddings
└── test_synthetic_data_qkv_[encoder]_embd_value.npy           # Test value embeddings

πŸ’‘ Tip: [encoder] will be replaced with the corresponding encoder name (e.g., all-MiniLM-L6-v2, oai, bigoai)

KG2KV Dataset Construction

πŸ“₯ Pre-built Datasets:

Step 1: Construct Raw KGKV Data

cd dataset_generation
python build_atlas_training_data.py

Step 2: Generate KGKV Embeddings

Using ATLAS-Wiki as an example, choose your sentence encoder:

πŸ”§ all-MiniLM-L6-v2
cd dataset_generation
bash ../experiments/scripts/data/wiki_allminilm_embd.sh
πŸ”§ text-embedding-ada-002
cd dataset_generation
bash ../experiments/scripts/data/wiki_oai_embd.sh
πŸ”§ text-embedding-3-large
cd dataset_generation
bash ../experiments/scripts/data/wiki_bigoai_embd.sh

Step 3: Split Training and Testing Sets

Using ATLAS-Wiki as an example:

cd dataset_generation
# Choose one of the following based on your encoder:
bash ../experiments/scripts/data/wiki_allminilm_split.sh    # all-MiniLM-L6-v2
bash ../experiments/scripts/data/wiki_oai_split.sh          # text-embedding-ada-002
bash ../experiments/scripts/data/wiki_bigoai_split.sh       # text-embedding-3-large

Step 4: Hierarchical Clustering on Test Data

⚠️ Important Configuration: In this step, you need to modify the script settings:

  • Change CLUSTER=False to CLUSTER=True
  • Change GENERATING_EMBEDDINGS=True to GENERATING_EMBEDDINGS=False
  • Change dataset name to test set (e.g., your_output_path/atlas_wiki_qa.json β†’ your_output_path/test_atlas_wiki_qa.json)
cd dataset_generation
# Choose one of the following based on your encoder:
bash ../experiments/scripts/data/wiki_allminilm_embd.sh    # all-MiniLM-L6-v2
bash ../experiments/scripts/data/wiki_oai_embd.sh          # text-embedding-ada-002
bash ../experiments/scripts/data/wiki_bigoai_embd.sh       # text-embedding-3-large

πŸ“ Generated Dataset Files

After completing the above steps, you will obtain the following files:

πŸ“¦ KG2KV Dataset File Structure
β”œβ”€β”€ atlas_wiki_qkv.json                                    # Raw data
β”œβ”€β”€ atlas_wiki_qkv_[encoder]_embd_key.npy                  # Key embeddings
β”œβ”€β”€ atlas_wiki_qkv_[encoder]_embd_value.npy                # Value embeddings
β”œβ”€β”€ train_atlas_wiki_qkv.json                              # Training data
β”œβ”€β”€ train_atlas_wiki_qkv_[encoder]_embd_key.npy           # Training key embeddings
β”œβ”€β”€ train_atlas_wiki_qkv_[encoder]_embd_value.npy         # Training value embeddings
β”œβ”€β”€ test_atlas_wiki_qkv.json                               # Test data
β”œβ”€β”€ test_atlas_wiki_qkv_[encoder]_embd_key.npy             # Test key embeddings
β”œβ”€β”€ test_atlas_wiki_qkv_[encoder]_embd_value.npy           # Test value embeddings
β”œβ”€β”€ test_atlas_wiki_qkv_[encoder]_embd_key_inter1_c2id_mapping.json    # Inter1 cluster mapping
β”œβ”€β”€ test_atlas_wiki_qkv_[encoder]_embd_key_inter1_id2c_mapping.json    # Inter1 reverse mapping
β”œβ”€β”€ test_atlas_wiki_qkv_[encoder]_embd_key_inter1.npy                  # Inter1 embeddings
β”œβ”€β”€ test_atlas_wiki_qkv_[encoder]_embd_key_root_c2id_mapping.json      # Root cluster mapping
β”œβ”€β”€ test_atlas_wiki_qkv_[encoder]_embd_key_root_id2c_mapping.json      # Root reverse mapping
└── test_atlas_wiki_qkv_[encoder]_embd_key_root.npy                    # Root embeddings

πŸ’‘ Tip: [encoder] will be replaced with the corresponding encoder name (e.g., all-MiniLM-L6-v2, oai, bigoai)

πŸ”₯ Model Training

To train the model, run the following scripts:

⚠️ Important: Please ensure to replace all path configurations in the scripts with your actual paths.

KBLaM Training

πŸ”§ Train with text-embedding-ada-002 encoder
cd experiments
bash ../experiments/scripts/train/train_syn_OAI.sh
πŸ”§ Train with text-embedding-3-large encoder
cd experiments
bash ../experiments/scripts/train/train_syn_BigOAI.sh
πŸ”§ Train with all-MiniLM-L6-v2 encoder
cd experiments
bash ../experiments/scripts/train/train_syn_allminilm.sh

AtlasKV Training

πŸ”§ Train with text-embedding-ada-002 encoder
cd experiments
bash ../experiments/scripts/train/train_wiki_OAI.sh
πŸ”§ Train with text-embedding-3-large encoder
cd experiments
bash ../experiments/scripts/train/train_wiki_BigOAI.sh
πŸ”§ Train with all-MiniLM-L6-v2 encoder
cd experiments
bash ../experiments/scripts/train/train_wiki_allminilm.sh

πŸ“Š Model Evaluation

We use the ATLAS-CC-QKV dataset as an example for evaluation.

GPU Cost Evaluation

We only consider the maximum GPU cost during both prefilling and decoding steps of generation. The GPU cost during the offline process of encoding KGKVs is not considered.

Test AtlasKV

cd experiments
bash ../experiments/scripts/test_mem/test_wiki_on_cc_bigoai.sh

Test KBLaM

cd experiments
bash ../experiments/scripts/test_mem/test_syn_on_cc_bigoai.sh

Test In-Context or Zero-Shot Learning

To test the GPU cost of in-context and zero-shot learning, simply change the eval_mode of any scripts in this section to icl or zeroshot.

Knowledge Accuracy Evaluation

⚠️ Configuration Required: Before running result_disp.py, configure the kb_size, model_str and your_result_save_dir.

Test AtlasKV

πŸ”§ Test with all-MiniLM-L6-v2 encoder
cd experiments
bash ../experiments/scripts/test_acc/test_wiki_on_cc_allminilm.sh
python result_disp.py
πŸ”§ Test with text-embedding-ada-002 encoder
cd experiments
bash ../experiments/scripts/test_acc/test_wiki_on_cc_oai.sh
python result_disp.py
πŸ”§ Test with text-embedding-3-large encoder
cd experiments
bash ../experiments/scripts/test_acc/test_wiki_on_cc_bigoai.sh
python result_disp.py

Test KBLaM

πŸ”§ Test with all-MiniLM-L6-v2 encoder
cd experiments
bash ../experiments/scripts/test_acc/test_syn_on_cc_allminilm.sh
python result_disp.py
πŸ”§ Test with text-embedding-ada-002 encoder
cd experiments
bash ../experiments/scripts/test_acc/test_syn_on_cc_oai.sh
python result_disp.py
πŸ”§ Test with text-embedding-3-large encoder
cd experiments
bash ../experiments/scripts/test_acc/test_syn_on_cc_bigoai.sh
python result_disp.py

Generation Relevance Evaluation

Collect the output results from the knowledge grounding accuracy evaluation scripts. Configure the paths, GPT endpoint URL, and GPT endpoint API key in output_scorer.py. Then run:

python output_scorer.py

πŸ’Ύ Model Checkpoints

We provide some training checkpoints of AtlasKV that can be directly used.

AtlasKV trained on ATLAS-CC-QKV (3K steps)

Model Component Download Link
Main Model Download
Encoder Download

AtlasKV trained on ATLAS-Wiki-QKV (3K steps)

Model Component Download Link
Main Model Download
Encoder Download

πŸ’‘ Usage: Download both the main model and encoder files for complete functionality.

πŸ™ Acknowledgments

We gratefully acknowledge the use of the following open-source projects in our work:

Project Description
KBLaM A new method for augmenting LLMs with external knowledge.
AutoSchemaKG A novel framework for automatic knowledge graph construction that combines schema generation via conceptualization.

πŸ“– Citation

If you use AtlasKV in your research, please cite our paper:

@misc{huang2025atlaskvaugmentingllmsbillionscale,
      title={AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM}, 
      author={Haoyu Huang and Hong Ting Tsang and Jiaxin Bai and Xi Peng and Gong Zhang and Yangqiu Song},
      year={2025},
      eprint={2510.17934},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.17934}, 
}

About

AtlasKV: A scalable, effective, and general way to augment LLMs with billion-scale knowledge graphs using very little GPU memory cost.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published