forked from fate-ubw/RAGLAB
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
159 additions
and
175 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
# process knowledge database into vector database | ||
## 💽 process wiki2023 as vector database | ||
|
||
### 10-samples test | ||
- 10-samples test is aimed at validating the environment | ||
- run colbert embedding process enwiki-20230401-10samples.tsv | ||
1. Change root path for variables: `checkpoint`, `index_dbPath`, `collection` in | ||
[wiki2023-10samples_tsv-2-colbert_embedding.py](./preprocess/colbert-wiki2023-preprocess/wiki2023-10samples_tsv-2-colbert_embedding.py). Colbert enforces the use of absolute paths, so you need to modify the paths for the following three variables | ||
~~~bash | ||
# change root path | ||
checkpoint = '/your_root_path/RAGLAB/model/colbertv2.0' | ||
index_dbPath = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2023-10samples' | ||
collection = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2023-10samples/enwiki-20230401-10samples.tsv' | ||
~~~ | ||
2. run | ||
~~~bash | ||
cd RAGLAB | ||
sh run/wiki2023_preprocess/2-wiki2023-10samples_tsv-2-colbert_embedding.sh | ||
~~~ | ||
- Embedding precess will take around 15mins in first time. | ||
- The first time colbert processes embeddings, it takes a relatively long time because it needs to recompile the `torch_extensions`. However, calling the processed embeddings does not require a long time. If there are no errors and the retrieved text can be printed, it indicates that the environment is correct. | ||
|
||
### Wiki2023 raw data source | ||
- we get source wiki2023 get from [factscore](https://github.com/shmsw25/FActScore) | ||
- **Note**: RAGLAB already provides enwiki2023 source data on HuggingFace, so there's no need to download it again. This information is just to provide the source of the data. | ||
- download method: install throuth gdown | ||
~~~bash | ||
cd RAGLAB/data/retrieval/colbertv2.0_passages | ||
mkdir wiki2023 | ||
pip install gdown | ||
gdown --id 1mekls6OGOKLmt7gYtHs0WGf5oTamTNat | ||
~~~ | ||
### preprocess wiki2023 | ||
- If the 10-samples test is passed successfully, you can proceed with processing wiki2023. | ||
1. preprocess `.db -> .tsv` (Colbert can only read files in .tsv format.) | ||
~~~bash | ||
cd RAGLAB | ||
sh run/wiki2023_preprocess/3-wiki2023_db-2-tsv.sh | ||
~~~ | ||
2. `.tsv -> embedding` | ||
- remember to change the root path of `checkpoint`, `index_dbPath` and `collection` | ||
~~~bash | ||
vim preprocess/colbert-wiki2023-preprocess/wiki2023_tsv-2-colbert_embedding.py | ||
# change root path | ||
checkpoint = '/your_root_path/RAGLAB/model/colbertv2.0' | ||
index_dbPath = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2023-10samples' | ||
collection = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2023-10samples/enwiki-20230401-10samples.tsv' | ||
~~~ | ||
- run bash script | ||
~~~bash | ||
cd RAGLAB | ||
sh run/wiki2023_preprocess/4-wiki2023_tsv-2-colbert_embedding.sh | ||
~~~ | ||
- This usually takes about 20 hours, depending on your computer's performance | ||
|
||
## 💽 Process wiki2018 as vector database | ||
- This section is a tutorial on using wiki2018 | ||
|
||
### wiki2018 raw data source | ||
- we get source wiki2018 get from [DPR](https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz) | ||
- Directly download wiki2018 raw database using wget | ||
~~~bash | ||
cd RAGLAB/data/retrieval/colbertv2.0_passages/wiki2018 | ||
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz | ||
~~~ | ||
|
||
### Porcess wiki2018 | ||
1. tsv -> tsv | ||
~~~bash | ||
cd RAGLAB | ||
sh run/wiki2018_preprocess/1-wiki2018_tsv_2_tsv.sh | ||
~~~ | ||
2. tsv -> embedding | ||
~~~bash | ||
cd RAGLAB | ||
sh run/wiki2018_preprocess/2-wiki2018_tsv-2-colbert_embedding.sh | ||
~~~ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
# 🤖 Train models | ||
|
||
## 10-samples test for fintune | ||
- The 10-samples train dataset has been processed, please directly start the bash script to begin testing. | ||
- Note: The test script only uses one GPU | ||
- full weight requires 80GB VRam GPU | ||
~~~bash | ||
cd RAGLAB | ||
sh run/rag_train/script_finetune-llama3-baseline-full_weight-10samples.sh | ||
~~~ | ||
- LoRA (Low-Rank Adaptation) requires at least 26GB of VRAM | ||
~~~bash | ||
cd RAGLAB | ||
sh run/rag_train/script_finetune-llama3-baseline-Lora-10samples.sh | ||
~~~ | ||
- Congratulations!!!You can now start fine-tuning the baseline model and selfrag-8B | ||
## finetune self rag 8b | ||
- full weight finetune | ||
~~~bash | ||
cd RAGLAB | ||
sh run/rag_train/script_finetune-selfrag_8b-full_weight.sh | ||
~~~ | ||
- lora finetune | ||
~~~bash | ||
cd RAGLAB | ||
sh run/rag_train/script_finetune-selfrag_8b-Lora.sh | ||
~~~ | ||
## finetune llama3-8b as baseline | ||
- preprocess train data. Train data for baseline model need remove special tokens. | ||
~~~bash | ||
cd RAGLAB | ||
sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh | ||
~~~ | ||
- then you will get baseline train_data without special token and passages (Q: what is specal token? Anawer: special tokens is a concept proposed by SelfRAG) | ||
- full weight finetune llama3-8b-baseline ues processed data | ||
~~~bash | ||
sh run/rag_train/script_finetune-llama3-baseline-full_weight.sh | ||
~~~ | ||
- lora finetune llama3-8b-baseline | ||
~~~bash | ||
cd RAGLAB | ||
sh run/rag_train/script_finetune-llama3-baseline-Lora.sh | ||
~~~ | ||
## Lora finetune llama3-70b as baseline | ||
- preprocess train data. Train data for baseline model need remove special tokens. | ||
~~~bash | ||
cd RAGLAB | ||
sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh | ||
~~~ | ||
- lora finetune llama3-70b-baseline ues processed data | ||
~~~bash | ||
sh run/rag_train/script_finetune-llama3-70B-baseline-Lora.sh | ||
~~~ | ||
|
||
## QLora finetune llama3-70B as baseline | ||
- preprocess train data. Train data for baseline model need remove special tokens. | ||
~~~bash | ||
cd RAGLAB | ||
sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh | ||
~~~ | ||
- 8bit QLora finetune llama3-70B | ||
~~~bash | ||
sh run/rag_train/script_finetune-llama3-70B-baseline-QLora-8bit.sh | ||
~~~ | ||
- 4bit QLora fintune llama3-70B | ||
~~~bash | ||
sh run/rag_train/script_finetune-llama3-70B-baseline-QLora-4bit.sh | ||
~~~ | ||
|
||
## 8bit QLora finetune selfrag-70B as baseline | ||
- 8bit Qlora finetune slefrag 70B | ||
~~~bash | ||
sh run/rag_train/script_finetune-selfrag_llama3-70b-QLora-8bit.sh | ||
~~~ | ||
- 4bit Qlora finetune slefrag 70B | ||
~~~bash | ||
sh run/rag_train/script_finetune-selfrag_llama3-70b-QLora-4bit.sh | ||
~~~ |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters