Skip to content

Commit

Permalink
[docs] update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
fate-ubw committed Aug 24, 2024
1 parent 6ab2acb commit ad67300
Show file tree
Hide file tree
Showing 4 changed files with 159 additions and 175 deletions.
78 changes: 78 additions & 0 deletions docs/process_wiki.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# process knowledge database into vector database
## 💽 process wiki2023 as vector database

### 10-samples test
- 10-samples test is aimed at validating the environment
- run colbert embedding process enwiki-20230401-10samples.tsv
1. Change root path for variables: `checkpoint`, `index_dbPath`, `collection` in
[wiki2023-10samples_tsv-2-colbert_embedding.py](./preprocess/colbert-wiki2023-preprocess/wiki2023-10samples_tsv-2-colbert_embedding.py). Colbert enforces the use of absolute paths, so you need to modify the paths for the following three variables
~~~bash
# change root path
checkpoint = '/your_root_path/RAGLAB/model/colbertv2.0'
index_dbPath = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2023-10samples'
collection = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2023-10samples/enwiki-20230401-10samples.tsv'
~~~
2. run
~~~bash
cd RAGLAB
sh run/wiki2023_preprocess/2-wiki2023-10samples_tsv-2-colbert_embedding.sh
~~~
- Embedding precess will take around 15mins in first time.
- The first time colbert processes embeddings, it takes a relatively long time because it needs to recompile the `torch_extensions`. However, calling the processed embeddings does not require a long time. If there are no errors and the retrieved text can be printed, it indicates that the environment is correct.

### Wiki2023 raw data source
- we get source wiki2023 get from [factscore](https://github.com/shmsw25/FActScore)
- **Note**: RAGLAB already provides enwiki2023 source data on HuggingFace, so there's no need to download it again. This information is just to provide the source of the data.
- download method: install throuth gdown
~~~bash
cd RAGLAB/data/retrieval/colbertv2.0_passages
mkdir wiki2023
pip install gdown
gdown --id 1mekls6OGOKLmt7gYtHs0WGf5oTamTNat
~~~
### preprocess wiki2023
- If the 10-samples test is passed successfully, you can proceed with processing wiki2023.
1. preprocess `.db -> .tsv` (Colbert can only read files in .tsv format.)
~~~bash
cd RAGLAB
sh run/wiki2023_preprocess/3-wiki2023_db-2-tsv.sh
~~~
2. `.tsv -> embedding`
- remember to change the root path of `checkpoint`, `index_dbPath` and `collection`
~~~bash
vim preprocess/colbert-wiki2023-preprocess/wiki2023_tsv-2-colbert_embedding.py
# change root path
checkpoint = '/your_root_path/RAGLAB/model/colbertv2.0'
index_dbPath = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2023-10samples'
collection = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2023-10samples/enwiki-20230401-10samples.tsv'
~~~
- run bash script
~~~bash
cd RAGLAB
sh run/wiki2023_preprocess/4-wiki2023_tsv-2-colbert_embedding.sh
~~~
- This usually takes about 20 hours, depending on your computer's performance

## 💽 Process wiki2018 as vector database
- This section is a tutorial on using wiki2018

### wiki2018 raw data source
- we get source wiki2018 get from [DPR](https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz)
- Directly download wiki2018 raw database using wget
~~~bash
cd RAGLAB/data/retrieval/colbertv2.0_passages/wiki2018
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
~~~

### Porcess wiki2018
1. tsv -> tsv
~~~bash
cd RAGLAB
sh run/wiki2018_preprocess/1-wiki2018_tsv_2_tsv.sh
~~~
2. tsv -> embedding
~~~bash
cd RAGLAB
sh run/wiki2018_preprocess/2-wiki2018_tsv-2-colbert_embedding.sh
~~~
78 changes: 78 additions & 0 deletions docs/train_docs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# 🤖 Train models

## 10-samples test for fintune
- The 10-samples train dataset has been processed, please directly start the bash script to begin testing.
- Note: The test script only uses one GPU
- full weight requires 80GB VRam GPU
~~~bash
cd RAGLAB
sh run/rag_train/script_finetune-llama3-baseline-full_weight-10samples.sh
~~~
- LoRA (Low-Rank Adaptation) requires at least 26GB of VRAM
~~~bash
cd RAGLAB
sh run/rag_train/script_finetune-llama3-baseline-Lora-10samples.sh
~~~
- Congratulations!!!You can now start fine-tuning the baseline model and selfrag-8B
## finetune self rag 8b
- full weight finetune
~~~bash
cd RAGLAB
sh run/rag_train/script_finetune-selfrag_8b-full_weight.sh
~~~
- lora finetune
~~~bash
cd RAGLAB
sh run/rag_train/script_finetune-selfrag_8b-Lora.sh
~~~
## finetune llama3-8b as baseline
- preprocess train data. Train data for baseline model need remove special tokens.
~~~bash
cd RAGLAB
sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh
~~~
- then you will get baseline train_data without special token and passages (Q: what is specal token? Anawer: special tokens is a concept proposed by SelfRAG)
- full weight finetune llama3-8b-baseline ues processed data
~~~bash
sh run/rag_train/script_finetune-llama3-baseline-full_weight.sh
~~~
- lora finetune llama3-8b-baseline
~~~bash
cd RAGLAB
sh run/rag_train/script_finetune-llama3-baseline-Lora.sh
~~~
## Lora finetune llama3-70b as baseline
- preprocess train data. Train data for baseline model need remove special tokens.
~~~bash
cd RAGLAB
sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh
~~~
- lora finetune llama3-70b-baseline ues processed data
~~~bash
sh run/rag_train/script_finetune-llama3-70B-baseline-Lora.sh
~~~

## QLora finetune llama3-70B as baseline
- preprocess train data. Train data for baseline model need remove special tokens.
~~~bash
cd RAGLAB
sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh
~~~
- 8bit QLora finetune llama3-70B
~~~bash
sh run/rag_train/script_finetune-llama3-70B-baseline-QLora-8bit.sh
~~~
- 4bit QLora fintune llama3-70B
~~~bash
sh run/rag_train/script_finetune-llama3-70B-baseline-QLora-4bit.sh
~~~

## 8bit QLora finetune selfrag-70B as baseline
- 8bit Qlora finetune slefrag 70B
~~~bash
sh run/rag_train/script_finetune-selfrag_llama3-70b-QLora-8bit.sh
~~~
- 4bit Qlora finetune slefrag 70B
~~~bash
sh run/rag_train/script_finetune-selfrag_llama3-70b-QLora-4bit.sh
~~~
Empty file removed preprocess/README.md
Empty file.
178 changes: 3 additions & 175 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,185 +224,13 @@
> - During the Factscore evaluation process, we used GPT-3.5 as the evaluation model, so there's no need to download a local model. If you need to use a local model to evaluate Factscore, please refer to [Factscore](https://github.com/shmsw25/FActScore)
# Process knowlwdge database from source
## 💽 process wiki2023 as vector database
### 10-samples test
- 10-samples test is aimed at validating the environment
- run colbert embedding process enwiki-20230401-10samples.tsv
1. Change root path for variables: `checkpoint`, `index_dbPath`, `collection` in
[wiki2023-10samples_tsv-2-colbert_embedding.py](https://github.com/fate-ubw/RAGLAB/blob/main/preprocess/colbert-wiki2023-preprocess/wiki2023-db_into_tsv-10samples.py). In file paths, colbert encounters many issues when using relative paths to generate embeddings. Therefore, the current version of raglab uses absolute paths.
~~~bash
# change root path
checkpoint = '/your_root_path/RAGLAB/model/colbertv2.0'
index_dbPath = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2023-10samples'
collection = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2023-10samples/enwiki-20230401-10samples.tsv'
~~~
2. run
~~~bash
cd RAGLAB
sh run/wiki2023_preprocess/2-wiki2023-10samples_tsv-2-colbert_embedding.sh
~~~
- Embedding precess will take around 15mins in first time.
- The first time colbert processes embeddings, it takes a relatively long time because it needs to recompile the `torch_extensions`. However, calling the processed embeddings does not require a long time. If there are no errors and the retrieved text can be printed, it indicates that the environment is correct.
### embedding whole wiki2023
- you can download the [colbert embdding wiki2023]() as raglab database(40Gb)
~~~bash
cd /RAGLAB/data/retrieval/colbertv2.0_embedding
gdown --id xxxxxx
~~~
- modify the path in meta.json file
- embedding whole wiki2023 to vector need 22 hours, so we recommend download prepared embedding
#### download wiki2023 raw data
- current version of raglab use wiki2023 as database
- we get source wiki2023 get from [factscore](https://github.com/shmsw25/FActScore)
- method1: url for download wiki2023:[google_drive](https://drive.google.com/file/d/1mekls6OGOKLmt7gYtHs0WGf5oTamTNat/view)
- method2: install throuth gdown
~~~bash
cd RAGLAB/data/retrieval/colbertv2.0_passages
mkdir wiki2023
pip install gdown
gdown --id 1mekls6OGOKLmt7gYtHs0WGf5oTamTNat
~~~
### preprocess wiki2023
- If the 10-samples test is passed successfully, you can proceed with processing wiki2023.
1. preprocess `.db -> .tsv` (Colbert can only read files in .tsv format.)
~~~bash
cd RAGLAB
sh run/wiki2023_preprocess/3-wiki2023_db-2-tsv.sh
~~~
2. `.tsv -> embedding`
- remember to change the root path of `checkpoint`, `index_dbPath` and `collection`
~~~bash
# change root path
checkpoint = '/your_root_path/RAGLAB/model/colbertv2.0'
index_dbPath = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2023-10samples'
collection = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2023-10samples/enwiki-20230401-10samples.tsv'
~~~
- run bash script
~~~bash
cd RAGLAB
sh run/wiki2023_preprocess/4-wiki2023_tsv-2-colbert_embedding.sh
~~~
## 💽 Process wiki2018 as vector database
- This section is a tutorial on using wiki2018
### Download text files
- Directly download wiki2018 raw database using wget
~~~bash
cd RAGLAB/data/retrieval/colbertv2.0_passages/wiki2018
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
~~~
### Process raw wiki2018 into colbert format
~~~bash
cd RAGLAB
sh run/wiki2018_preprocess/1-wiki2018_tsv_2_tsv.sh
~~~
### Modify wiki2018 embedding config file
1. Change the path
~~~
cd /RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2018/indexes/wiki2018
vim metadata.json
~~~
- You only need to modify two paths in the metadata.json file. Here, simply delete the original paths and copy the following paths. Other parameters do not need to be modified.
~~~sh
"collection": "/home/ec2-user/SageMaker/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2018/wiki2018.tsv",
"experiment": "/home/ec2-user/SageMaker/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2018",
~~~
- After modification, you can directly start the colbert server. For experimental startup method, refer to the last section of the readme: Inference experiments.
- If you wish to process the knowledge database yourself, please refer to the following steps. RAGLAB has already uploaded the processed knowledge database to [Hugging Face](https://huggingface.co/datasets/RAGLAB/data)
- document: [process_wiki.md](./docs/process_wiki.md)
# 🤖 Train models
- This section covers the process of training models in RAGLAB. You can either download all pre-trained models from HuggingFace🤗, or use the tutorial below to train from scratch📝.
- [All data](#all-data-for-reproduce-paper-results) provides all data necessary for finetuning.
## 10-samples test for fintune
- The 10-samples train dataset has been processed, please directly start the bash script to begin testing.
- Note: The test script only uses one GPU
- full weight requires 80GB VRam GPU
~~~bash
cd RAGLAB
sh run/rag_train/script_finetune-llama3-baseline-full_weight-10samples.sh
~~~
- LoRA (Low-Rank Adaptation) requires at least 26GB of VRAM
~~~bash
cd RAGLAB
sh run/rag_train/script_finetune-llama3-baseline-Lora-10samples.sh
~~~
- Congratulations!!!You can now start fine-tuning the baseline model and selfrag-8B
## finetune self rag 8b
- full weight finetune
~~~bash
cd RAGLAB
sh run/rag_train/script_finetune-selfrag_8b-full_weight.sh
~~~
- lora finetune
~~~bash
cd RAGLAB
sh run/rag_train/script_finetune-selfrag_8b-Lora.sh
~~~
## finetune llama3-8b as baseline
- preprocess train data. Train data for baseline model need remove special tokens.
~~~bash
cd RAGLAB
sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh
~~~
- then you will get baseline train_data without special token and passages (Q: what is specal token? Anawer: special tokens is a concept proposed by SelfRAG)
- full weight finetune llama3-8b-baseline ues processed data
~~~bash
sh run/rag_train/script_finetune-llama3-baseline-full_weight.sh
~~~
- lora finetune llama3-8b-baseline
~~~bash
cd RAGLAB
sh run/rag_train/script_finetune-llama3-baseline-Lora.sh
~~~
## Lora finetune llama3-70b as baseline
- preprocess train data. Train data for baseline model need remove special tokens.
~~~bash
cd RAGLAB
sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh
~~~
- lora finetune llama3-70b-baseline ues processed data
~~~bash
sh run/rag_train/script_finetune-llama3-70B-baseline-Lora.sh
~~~
## QLora finetune llama3-70B as baseline
- preprocess train data. Train data for baseline model need remove special tokens.
~~~bash
cd RAGLAB
sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh
~~~
- 8bit QLora finetune llama3-70B
~~~bash
sh run/rag_train/script_finetune-llama3-70B-baseline-QLora-8bit.sh
~~~
- 4bit QLora fintune llama3-70B
~~~bash
sh run/rag_train/script_finetune-llama3-70B-baseline-QLora-4bit.sh
~~~
## 8bit QLora finetune selfrag-70B as baseline
- 8bit Qlora finetune slefrag 70B
~~~bash
sh run/rag_train/script_finetune-selfrag_llama3-70b-QLora-8bit.sh
~~~
- 4bit Qlora finetune slefrag 70B
~~~bash
sh run/rag_train/script_finetune-selfrag_llama3-70b-QLora-4bit.sh
~~~
- document: [train_docs.md](./docs/train_docs.md)
## :bookmark: License
Expand Down

0 comments on commit ad67300

Please sign in to comment.