[docs] update readme

xuyongfu · Aug 24, 2024 · ad67300 · ad67300
1 parent 6ab2acb
commit ad67300
Show file tree

Hide file tree

Showing 4 changed files with 159 additions and 175 deletions.
diff --git a/docs/process_wiki.md b/docs/process_wiki.md
@@ -0,0 +1,78 @@
+# process knowledge database into vector database
+## 💽 process wiki2023 as vector database
+
+### 10-samples test
+- 10-samples test is aimed at validating the environment
+- run colbert embedding process enwiki-20230401-10samples.tsv
+  1. Change root path for variables: `checkpoint`, `index_dbPath`, `collection` in
+[wiki2023-10samples_tsv-2-colbert_embedding.py](./preprocess/colbert-wiki2023-preprocess/wiki2023-10samples_tsv-2-colbert_embedding.py). Colbert enforces the use of absolute paths, so you need to modify the paths for the following three variables
+  ~~~bash
+    # change root path
+  checkpoint = '/your_root_path/RAGLAB/model/colbertv2.0'
+  index_dbPath = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2023-10samples'
+  collection = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2023-10samples/enwiki-20230401-10samples.tsv'
+  ~~~
+  2. run
+  ~~~bash
+  cd RAGLAB
+  sh run/wiki2023_preprocess/2-wiki2023-10samples_tsv-2-colbert_embedding.sh
+  ~~~
+- Embedding precess will take around 15mins in first time.
+- The first time colbert processes embeddings, it takes a relatively long time because it needs to recompile the `torch_extensions`. However, calling the processed embeddings does not require a long time. If there are no errors and the retrieved text can be printed, it indicates that the environment is correct.
+
+###  Wiki2023 raw data source
+- we get source wiki2023 get from [factscore](https://github.com/shmsw25/FActScore)
+- **Note**: RAGLAB already provides enwiki2023 source data on HuggingFace, so there's no need to download it again. This information is just to provide the source of the data.
+  - download method: install throuth gdown 
+  ~~~bash
+  cd RAGLAB/data/retrieval/colbertv2.0_passages
+  mkdir wiki2023
+  pip install gdown
+  gdown --id 1mekls6OGOKLmt7gYtHs0WGf5oTamTNat
+  ~~~
+
+### preprocess wiki2023
+- If the 10-samples test is passed successfully, you can proceed with processing wiki2023.
+1. preprocess `.db -> .tsv` (Colbert can only read files in .tsv format.)
+    ~~~bash
+    cd RAGLAB
+    sh run/wiki2023_preprocess/3-wiki2023_db-2-tsv.sh
+    ~~~
+2. `.tsv -> embedding`
+  - remember to change the root  path of `checkpoint`, `index_dbPath` and `collection`
+    ~~~bash
+      vim preprocess/colbert-wiki2023-preprocess/wiki2023_tsv-2-colbert_embedding.py
+      # change root path
+        checkpoint = '/your_root_path/RAGLAB/model/colbertv2.0'
+        index_dbPath = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2023-10samples'
+        collection = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2023-10samples/enwiki-20230401-10samples.tsv'
+    ~~~
+  - run bash script
+    ~~~bash
+    cd RAGLAB
+    sh run/wiki2023_preprocess/4-wiki2023_tsv-2-colbert_embedding.sh
+    ~~~
+  - This usually takes about 20 hours, depending on your computer's performance
+
+## 💽 Process wiki2018 as vector database
+- This section is a tutorial on using wiki2018
+
+### wiki2018 raw data source
+  - we get source wiki2018 get from [DPR](https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz)
+  - Directly download wiki2018 raw database using wget
+~~~bash
+cd RAGLAB/data/retrieval/colbertv2.0_passages/wiki2018
+wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
+~~~
+
+### Porcess wiki2018
+1. tsv -> tsv
+  ~~~bash
+  cd RAGLAB
+  sh run/wiki2018_preprocess/1-wiki2018_tsv_2_tsv.sh
+  ~~~
+2. tsv -> embedding
+  ~~~bash
+  cd RAGLAB
+  sh run/wiki2018_preprocess/2-wiki2018_tsv-2-colbert_embedding.sh
+  ~~~
diff --git a/docs/train_docs.md b/docs/train_docs.md
@@ -0,0 +1,78 @@
+# 🤖 Train models
+
+## 10-samples test for fintune
+- The 10-samples train dataset has been processed, please directly start the bash script to begin testing.
+- Note: The test script only uses one GPU
+  - full weight requires 80GB VRam GPU
+  ~~~bash
+  cd RAGLAB
+  sh run/rag_train/script_finetune-llama3-baseline-full_weight-10samples.sh
+  ~~~
+  - LoRA (Low-Rank Adaptation) requires at least 26GB of VRAM
+  ~~~bash
+  cd RAGLAB
+  sh run/rag_train/script_finetune-llama3-baseline-Lora-10samples.sh
+  ~~~
+- Congratulations！！！You can now start fine-tuning the baseline model and selfrag-8B
+## finetune self rag 8b
+- full weight finetune
+  ~~~bash
+  cd RAGLAB
+  sh run/rag_train/script_finetune-selfrag_8b-full_weight.sh
+  ~~~
+- lora finetune 
+  ~~~bash
+  cd RAGLAB
+  sh run/rag_train/script_finetune-selfrag_8b-Lora.sh
+  ~~~
+## finetune llama3-8b as baseline
+- preprocess train data. Train data for baseline model need remove special tokens.
+  ~~~bash
+  cd RAGLAB
+  sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh
+  ~~~
+- then you will get baseline train_data without special token and passages (Q: what is specal token? Anawer: special tokens is a concept proposed by SelfRAG)
+- full weight finetune llama3-8b-baseline ues processed data
+  ~~~bash
+  sh run/rag_train/script_finetune-llama3-baseline-full_weight.sh
+  ~~~
+- lora finetune llama3-8b-baseline
+  ~~~bash
+  cd RAGLAB
+  sh run/rag_train/script_finetune-llama3-baseline-Lora.sh
+  ~~~
+## Lora finetune llama3-70b as baseline
+- preprocess train data. Train data for baseline model need remove special tokens.
+  ~~~bash
+  cd RAGLAB
+  sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh
+  ~~~
+- lora finetune llama3-70b-baseline ues processed data
+  ~~~bash
+  sh run/rag_train/script_finetune-llama3-70B-baseline-Lora.sh
+  ~~~
+
+## QLora finetune llama3-70B as baseline
+- preprocess train data. Train data for baseline model need remove special tokens.
+  ~~~bash
+  cd RAGLAB
+  sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh
+  ~~~
+- 8bit QLora finetune llama3-70B 
+  ~~~bash
+  sh run/rag_train/script_finetune-llama3-70B-baseline-QLora-8bit.sh
+  ~~~
+- 4bit QLora fintune llama3-70B
+  ~~~bash
+  sh run/rag_train/script_finetune-llama3-70B-baseline-QLora-4bit.sh
+  ~~~
+
+## 8bit QLora finetune selfrag-70B as baseline
+- 8bit Qlora finetune slefrag 70B
+  ~~~bash
+    sh run/rag_train/script_finetune-selfrag_llama3-70b-QLora-8bit.sh
+  ~~~
+- 4bit Qlora finetune slefrag 70B
+  ~~~bash
+    sh run/rag_train/script_finetune-selfrag_llama3-70b-QLora-4bit.sh
+  ~~~
diff --git a/preprocess/README.md b/preprocess/README.md
diff --git a/readme.md b/readme.md
@@ -224,185 +224,13 @@
 > - During the Factscore evaluation process, we used GPT-3.5 as the evaluation model, so there's no need to download a local model. If you need to use a local model to evaluate Factscore, please refer to [Factscore](https://github.com/shmsw25/FActScore)
 
 # Process knowlwdge database from source
-
-## 💽 process wiki2023 as vector database
-
-### 10-samples test
-- 10-samples test is aimed at validating the environment
-- run colbert embedding process enwiki-20230401-10samples.tsv
-  1. Change root path for variables: `checkpoint`, `index_dbPath`, `collection` in
-[wiki2023-10samples_tsv-2-colbert_embedding.py](https://github.com/fate-ubw/RAGLAB/blob/main/preprocess/colbert-wiki2023-preprocess/wiki2023-db_into_tsv-10samples.py). In file paths, colbert encounters many issues when using relative paths to generate embeddings. Therefore, the current version of raglab uses absolute paths. 
-  ~~~bash
-    # change root path
-  checkpoint = '/your_root_path/RAGLAB/model/colbertv2.0'
-  index_dbPath = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2023-10samples'
-  collection = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2023-10samples/enwiki-20230401-10samples.tsv'
-  ~~~
-  2. run
-  ~~~bash
-  cd RAGLAB
-  sh run/wiki2023_preprocess/2-wiki2023-10samples_tsv-2-colbert_embedding.sh
-  ~~~
-- Embedding precess will take around 15mins in first time.
-- The first time colbert processes embeddings, it takes a relatively long time because it needs to recompile the `torch_extensions`. However, calling the processed embeddings does not require a long time. If there are no errors and the retrieved text can be printed, it indicates that the environment is correct.
-
-### embedding whole wiki2023 
-- you can download the [colbert embdding wiki2023]() as raglab database(40Gb)
-~~~bash
-cd /RAGLAB/data/retrieval/colbertv2.0_embedding
-gdown --id xxxxxx
-~~~
-- modify the path in meta.json file
-- embedding whole wiki2023 to vector need 22 hours, so we recommend download prepared embedding
-
-#### download wiki2023 raw data
-- current version of raglab use wiki2023 as database
-- we get source wiki2023 get from [factscore](https://github.com/shmsw25/FActScore)
-  - method1: url for download wiki2023:[google_drive](https://drive.google.com/file/d/1mekls6OGOKLmt7gYtHs0WGf5oTamTNat/view) 
-  - method2: install throuth gdown 
-  ~~~bash
-  cd RAGLAB/data/retrieval/colbertv2.0_passages
-  mkdir wiki2023
-  pip install gdown
-  gdown --id 1mekls6OGOKLmt7gYtHs0WGf5oTamTNat
-  ~~~
-
-### preprocess wiki2023
-- If the 10-samples test is passed successfully, you can proceed with processing wiki2023.
-1. preprocess `.db -> .tsv` (Colbert can only read files in .tsv format.)
-    ~~~bash
-    cd RAGLAB
-    sh run/wiki2023_preprocess/3-wiki2023_db-2-tsv.sh
-    ~~~
-2. `.tsv -> embedding`
-  - remember to change the root  path of `checkpoint`, `index_dbPath` and `collection`
-    ~~~bash
-      # change root path
-        checkpoint = '/your_root_path/RAGLAB/model/colbertv2.0'
-        index_dbPath = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2023-10samples'
-        collection = '/your_root_path/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2023-10samples/enwiki-20230401-10samples.tsv'
-    ~~~
-  - run bash script
-    ~~~bash
-    cd RAGLAB
-    sh run/wiki2023_preprocess/4-wiki2023_tsv-2-colbert_embedding.sh
-    ~~~
-
-
-
-## 💽 Process wiki2018 as vector database
-- This section is a tutorial on using wiki2018
-
-### Download text files
-  - Directly download wiki2018 raw database using wget
-~~~bash
-cd RAGLAB/data/retrieval/colbertv2.0_passages/wiki2018
-wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
-~~~
-
-### Process raw wiki2018 into colbert format
-
-~~~bash
-cd RAGLAB
-sh run/wiki2018_preprocess/1-wiki2018_tsv_2_tsv.sh
-~~~
-
-### Modify wiki2018 embedding config file
-1. Change the path
-~~~
-cd /RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2018/indexes/wiki2018
-vim metadata.json 
-~~~
-- You only need to modify two paths in the metadata.json file. Here, simply delete the original paths and copy the following paths. Other parameters do not need to be modified.
-~~~sh
-"collection": "/home/ec2-user/SageMaker/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2018/wiki2018.tsv",
-"experiment": "/home/ec2-user/SageMaker/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2018",
-~~~
-- After modification, you can directly start the colbert server. For experimental startup method, refer to the last section of the readme: Inference experiments.
-
-
+- If you wish to process the knowledge database yourself, please refer to the following steps. RAGLAB has already uploaded the processed knowledge database to [Hugging Face](https://huggingface.co/datasets/RAGLAB/data)
+- document: [process_wiki.md](./docs/process_wiki.md)
 
 # 🤖 Train models
 - This section covers the process of training models in RAGLAB. You can either download all pre-trained models from HuggingFace🤗, or use the tutorial below to train from scratch📝.
 - [All data](#all-data-for-reproduce-paper-results) provides all data necessary for finetuning.
-
-## 10-samples test for fintune
-- The 10-samples train dataset has been processed, please directly start the bash script to begin testing.
-- Note: The test script only uses one GPU
-  - full weight requires 80GB VRam GPU
-  ~~~bash
-  cd RAGLAB
-  sh run/rag_train/script_finetune-llama3-baseline-full_weight-10samples.sh
-  ~~~
-  - LoRA (Low-Rank Adaptation) requires at least 26GB of VRAM
-  ~~~bash
-  cd RAGLAB
-  sh run/rag_train/script_finetune-llama3-baseline-Lora-10samples.sh
-  ~~~
-- Congratulations！！！You can now start fine-tuning the baseline model and selfrag-8B
-## finetune self rag 8b
-- full weight finetune
-  ~~~bash
-  cd RAGLAB
-  sh run/rag_train/script_finetune-selfrag_8b-full_weight.sh
-  ~~~
-- lora finetune 
-  ~~~bash
-  cd RAGLAB
-  sh run/rag_train/script_finetune-selfrag_8b-Lora.sh
-  ~~~
-## finetune llama3-8b as baseline
-- preprocess train data. Train data for baseline model need remove special tokens.
-  ~~~bash
-  cd RAGLAB
-  sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh
-  ~~~
-- then you will get baseline train_data without special token and passages (Q: what is specal token? Anawer: special tokens is a concept proposed by SelfRAG)
-- full weight finetune llama3-8b-baseline ues processed data
-  ~~~bash
-  sh run/rag_train/script_finetune-llama3-baseline-full_weight.sh
-  ~~~
-- lora finetune llama3-8b-baseline
-  ~~~bash
-  cd RAGLAB
-  sh run/rag_train/script_finetune-llama3-baseline-Lora.sh
-  ~~~
-## Lora finetune llama3-70b as baseline
-- preprocess train data. Train data for baseline model need remove special tokens.
-  ~~~bash
-  cd RAGLAB
-  sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh
-  ~~~
-- lora finetune llama3-70b-baseline ues processed data
-  ~~~bash
-  sh run/rag_train/script_finetune-llama3-70B-baseline-Lora.sh
-  ~~~
-
-## QLora finetune llama3-70B as baseline
-- preprocess train data. Train data for baseline model need remove special tokens.
-  ~~~bash
-  cd RAGLAB
-  sh run/traindataset_preprocess/selfrag_traindata-remove_special_tokens.sh
-  ~~~
-- 8bit QLora finetune llama3-70B 
-  ~~~bash
-  sh run/rag_train/script_finetune-llama3-70B-baseline-QLora-8bit.sh
-  ~~~
-- 4bit QLora fintune llama3-70B
-  ~~~bash
-  sh run/rag_train/script_finetune-llama3-70B-baseline-QLora-4bit.sh
-  ~~~
-
-## 8bit QLora finetune selfrag-70B as baseline
-- 8bit Qlora finetune slefrag 70B
-  ~~~bash
-    sh run/rag_train/script_finetune-selfrag_llama3-70b-QLora-8bit.sh
-  ~~~
-- 4bit Qlora finetune slefrag 70B
-  ~~~bash
-    sh run/rag_train/script_finetune-selfrag_llama3-70b-QLora-4bit.sh
-  ~~~
-
+- document: [train_docs.md](./docs/train_docs.md)
 
 ## :bookmark: License