Merge branch 'main' into doc-edit

OptimalScale · Aug 7, 2023 · c0ab0e6 · c0ab0e6
2 parents 2679fb9 + 4e385b5
commit c0ab0e6
Show file tree

Hide file tree

Showing 23 changed files with 232 additions and 50 deletions.
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@
 [![Doc](https://img.shields.io/badge/Website-Doc-ff69b4.svg)](https://optimalscale.github.io/LMFlow/)
 [![Embark](https://img.shields.io/badge/Discord-LMFlow-%237289da.svg?logo=discord)](https://discord.gg/u9VJNpzhvA)
 [![slack badge](https://img.shields.io/badge/Slack-Join-blueviolet?logo=slack&amp)](https://join.slack.com/t/lmflow/shared_invite/zt-1wju9nicy-woXbNtS~5MavHSAtiMxmxQ)
-[![WeChat badge](https://img.shields.io/badge/WeChat-Join-brightgreen?logo=wechat&amp)](https://i.imgloc.com/2023/07/13/VgJyaZ.jpeg)
+[![WeChat badge](https://img.shields.io/badge/WeChat-Join-brightgreen?logo=wechat&amp)](https://s1.ax1x.com/2023/08/06/pPAQTPI.jpg)
 
 An extensible, convenient, and efficient toolbox for finetuning large machine learning models, designed to be user-friendly, speedy and reliable, and accessible to the entire community.
 
@@ -33,6 +33,7 @@ Large Model for All.
 
 
 ## Latest News
+* [2023-08-07] Support [Flash Attention-2](https://crfm.stanford.edu/2023/07/17/flash2.html). Check out [flash_attention](https://github.com/OptimalScale/LMFlow/blob/main/readme/flash_attn2.md) for more details.
 * [2023-08-02] Support [Llama2](https://ai.meta.com/llama/), [ChatGLM2](https://huggingface.co/THUDM/chatglm2-6b), and [Baichuan](https://huggingface.co/baichuan-inc/Baichuan-7B) models.
 * [2023-07-23] :rocket: [LMFlow multimodal chatbot](https://github.com/OptimalScale/LMFlow/blob/main/scripts/run_vis_chatbot_gradio_minigpt4.sh) is now available! Support multimodal inputs of images and texts. [Online Demo](http://multimodal.lmflow.online) is also provided (We hold the service on a single GPU, hence one may experience "queuing" or "application busy" sometimes when multiple users are accessing at the same time, please wait and attempt again later when such event happens) :rocket: ![image](https://github.com/OptimalScale/LMFlow/blob/rpan-vision-encoder/assets/multimodal-chatbot-demo.gif)
 * [2023-06-22]  [LMFlow paper](https://arxiv.org/abs/2306.12420) is out! Check out our implementation details at https://arxiv.org/abs/2306.12420
@@ -213,7 +214,7 @@ cd LMFlow
 conda create -n lmflow python=3.9 -y
 conda activate lmflow
 conda install mpi4py
-pip install -e .
+./install.sh
 ```
 
 ## 2.Prepare Dataset
@@ -336,6 +337,16 @@ You can config the deepspeed under configs. Details can be referred at [DeepSpee
 
 Thanks to the great efforts of [llama.cpp](https://github.com/ggerganov/llama.cpp). It is possible for everyone to run their LLaMA models on CPU by 4-bit quantization. We provide a script to convert LLaMA LoRA weights to `.pt` files. You only need to use `convert-pth-to-ggml.py` in llama.cpp to perform quantization.
 
+### 4.4 Vocabulary List Extension
+
+Now you can train your own sentencepiece tokenizer and merge it with model's origin hf tokenizer. Check out [vocab_extension](https://github.com/OptimalScale/LMFlow/blob/main/scripts/vocab_extension) for more details.
+
+### 4.5 Position Interpolation for LLaMA Models
+Now LMFlow supports the latest Linear & NTK (Neural Kernel theory) scaling techniques for LLaMA models. Check out [postion_interpolation](
+https://github.com/OptimalScale/LMFlow/blob/main/readme/Position_Interpolation.md) for more details.
+
+### 4.6 FlashAttention-2
+Now LMFlow supports the latest [FlashAttention-2](https://crfm.stanford.edu/2023/07/17/flash2.html). Check out [flash_attention](https://github.com/OptimalScale/LMFlow/blob/main/readme/flash_attn2.md) for more details.
 
 ## 5. Model Release
 
@@ -385,7 +396,6 @@ Then you can check the model performance at our [Doc](https://optimalscale.githu
 Please refer to our [Documentation](https://optimalscale.github.io/LMFlow/) for more API reference and experimental results.
 
 
-
 ## Acknowledgement
 LMFlow draws inspiration from various studies, including but not limited to:
 - Alpaca: https://github.com/tatsu-lab/stanford_alpaca

diff --git a/install.sh b/install.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+
+pip install -e .
+
+gpu_state="$(nvidia-smi --query-gpu=name --format=csv,noheader)"
+if [[ "${gpu_state}" == *"A100"* || "${gpu_state}" == *"A40"* ]]; then
+  pip install flash-attn==2.0.2
+fi
diff --git a/readme/Position_Interpolation.md b/readme/Position_Interpolation.md
@@ -0,0 +1,40 @@
+# Position Interpolation 
+Now LMFlow supports the latest Linear & NTK (Neural Kernel theory) scaling techniques for LLaMA models. \
+For more details of these techniques, you can checkout the links below:
+* Linear scaling: \
+https://arxiv.org/abs/2306.15595
+* NTK scaling: \
+https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
+## Usage
+To use the Position Interpolation Techniques, you need to set the following options:
+```
+--truncate_to_model_max_length False
+--do_rope_scaling True
+```
+For linear scaling, set the extending ratio by:
+```
+--rope_pi_ratio 4
+```
+For NTK scaling, set the extending ratio by:
+```
+--rope_ntk_ratio 4
+```
+Here is an example of evaluation bash code:
+```
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 \
+    deepspeed examples/evaluation.py \
+    --answer_type text \
+    --model_name_or_path pinkmanlove/llama-7b-hf \
+    --dataset_path data/wiki_en_eval \
+    --deepspeed examples/ds_config.json \
+    --inference_batch_size_per_device 1 \
+    --truncate_to_model_max_length False \
+    --block_size 4096 \
+    --use_flash_attention True \
+    --do_rope_scaling True \
+    --rope_pi_ratio 2 \
+    --rope_ntk_ratio 4 \
+    --metric ppl
+```
diff --git a/readme/flash_attn2.md b/readme/flash_attn2.md
@@ -0,0 +1,18 @@
+# Flash Attention 2.0
+We're thrilled to announce that LMFlow now supports training and inference using **FlashAttention-2**! This cutting-edge feature will take your language modeling to the next level. To use it, simply add ``` --use_flash_attention True ``` to the corresponding bash script.
+Here is an example of how to use it:
+```
+#!/bin/bash
+pip install flash_attn==2.0.2
+
+deepspeed --master_port=11000 \
+   examples/chatbot.py \                           
+      --deepspeed configs/ds_config_chatbot.json \                              
+      --model_name_or_path LMFlow/Full-Robin-7b-v2 \                                                     
+      --max_new_tokens 1024 \
+      --prompt_structure "###Human: {input_text}###Assistant:" \
+      --end_string "#" \
+      --use_flash_attention True
+```
+
+Upgrade to LMFlow now and experience the future of language modeling!
diff --git a/scripts/run_evaluation.sh b/scripts/run_evaluation.sh
@@ -1,5 +1,9 @@
 #!/bin/bash
 
+if [ ! -d data/MedQA-USMLE ]; then
+  cd data && ./download.sh MedQA-USMLE && cd -
+fi
+
 CUDA_VISIBLE_DEVICES=0 \
     deepspeed examples/evaluation.py \
     --answer_type medmcqa \

diff --git a/scripts/run_evaluation_accelerator.sh b/scripts/run_evaluation_accelerator.sh
@@ -1,5 +1,9 @@
 #!/bin/bash
 
+if [ ! -d data/MedQA-USMLE ]; then
+  cd data && ./download.sh MedQA-USMLE && cd -
+fi
+
 CUDA_VISIBLE_DEVICES=0 accelerate launch --config_file configs/accelerator_singlegpu_config.yaml examples/evaluation.py \
     --answer_type usmle \
     --model_name_or_path gpt2-large \

diff --git a/scripts/run_evaluation_with_lora.sh b/scripts/run_evaluation_with_lora.sh
@@ -3,6 +3,11 @@
 # --model_name_or_path specifies the original huggingface model
 # --lora_model_path specifies the model difference introduced by finetuning,
 #   i.e. the one saved by ./scripts/run_finetune_with_lora.sh
+
+if [ ! -d data/alpaca ]; then
+  cd data && ./download.sh alpaca && cd -
+fi
+
 CUDA_VISIBLE_DEVICES=0 \
     deepspeed examples/evaluation.py \
     --answer_type text \

diff --git a/scripts/run_finetune.sh b/scripts/run_finetune.sh
@@ -14,6 +14,9 @@ output_dir=${project_dir}/output_models/${exp_id}
 log_dir=${project_dir}/log/${exp_id}
 
 dataset_path=${project_dir}/data/alpaca/train
+if [ ! -d ${dataset_path} ]; then
+  cd data && ./download.sh alpaca && cd -
+fi
 
 mkdir -p ${output_dir} ${log_dir}
 
@@ -27,7 +30,7 @@ deepspeed ${deepspeed_args} \
     --block_size 512 \
     --per_device_train_batch_size 1 \
     --deepspeed configs/ds_config_zero3.json \
-    --bf16 \
+    --fp16 \
     --run_name finetune \
     --validation_split_percentage 0 \
     --logging_steps 20 \

diff --git a/scripts/run_finetune_with_lora.sh b/scripts/run_finetune_with_lora.sh
@@ -12,6 +12,9 @@ output_dir=${project_dir}/output_models/${exp_id}
 log_dir=${project_dir}/log/${exp_id}
 
 dataset_path=${project_dir}/data/alpaca/train
+if [ ! -d ${dataset_path} ]; then
+  cd data && ./download.sh alpaca && cd -
+fi
 
 mkdir -p ${output_dir} ${log_dir}
 
@@ -28,7 +31,7 @@ deepspeed ${deepspeed_args} \
     --lora_r 8 \
     --save_aggregated_lora 0\
     --deepspeed configs/ds_config_zero2.json \
-    --bf16 \
+    --fp16 \
     --run_name finetune_with_lora \
     --validation_split_percentage 0 \
     --logging_steps 20 \

diff --git a/scripts/run_finetune_with_lora_save_aggregated_weights.sh b/scripts/run_finetune_with_lora_save_aggregated_weights.sh
@@ -13,6 +13,9 @@ log_dir=${project_dir}/log/${exp_id}
 
 dataset_path=${project_dir}/data/alpaca/train
 eval_dataset_path=${project_dir}/data/alpaca/test
+if [ ! -d ${dataset_path} ]; then
+  cd data && ./download.sh alpaca && cd -
+fi
 
 mkdir -p ${output_dir} ${log_dir}
 
@@ -29,7 +32,7 @@ deepspeed ${deepspeed_args} \
     --lora_r 8 \
     --save_aggregated_lora 1\
     --deepspeed configs/ds_config_zero2.json \
-    --bf16 \
+    --fp16 \
     --run_name finetune_with_lora \
     --validation_split_percentage 0 \
     --logging_steps 20 \

diff --git a/scripts/run_multistage_finetune.sh b/scripts/run_multistage_finetune.sh
@@ -11,6 +11,9 @@ project_dir=$(cd "$(dirname $0)"/..; pwd)
 output_dir=${project_dir}/output_models/${exp_id}
 log_dir=${project_dir}/log/${exp_id}
 dataset_path="${project_dir}/data/example_dataset/train"
+if [ ! -d ${dataset_path} ]; then
+  cd data && ./download.sh example_dataset && cd -
+fi
 
 mkdir -p ${output_dir} ${log_dir}
 

diff --git a/scripts/run_raft_align.sh b/scripts/run_raft_align.sh
@@ -11,6 +11,10 @@ project_dir=$(cd "$(dirname $0)"/..; pwd)
 output_dir=${project_dir}/output_models/${exp_id}
 log_dir=${project_dir}/log/${exp_id}
 
+if [ ! -d data/hh_rlhf ]; then
+  cd data && ./download.sh hh_rlhf && cd -
+fi
+
 mkdir -p ${output_dir} ${log_dir}
 
 export PYTHONPATH=.

diff --git a/scripts/run_reward_modeling.sh b/scripts/run_reward_modeling.sh
@@ -14,6 +14,9 @@ output_dir=${project_dir}/output_models/${exp_id}
 log_dir=${project_dir}/log/${exp_id}
 
 dataset_path=${project_dir}/data/hh_rlhf/rm/hh_rlhf_rm_training.json
+if [ ! -d data/hh_rlhf ]; then
+  cd data && ./download.sh hh_rlhf && cd -
+fi
 
 mkdir -p ${output_dir} ${log_dir}
 

diff --git a/scripts/vocab_extension/README.md b/scripts/vocab_extension/README.md
@@ -0,0 +1,23 @@
+# Vocab Extension
+## Train & Merge Tokenizer
+To automatically convert data, train a SentencePiece tokenizer, and merge the tokenizer, you can run the following script:
+```
+bash scripts/vocab_extension/train_merge_tokenizer.sh
+``` 
+Alternatively, you can run each of the three steps separately:
+
+## Convert JSON Data to TXT
+To convert JSON data to TXT for sentencepiece tokenizer training, run:
+```
+bash scripts/vocab_extension/convert_json_to_txt.sh
+```
+## Train SentencePiece Tokenizer
+To train a SentencePiece tokenizer, run:
+```
+bash scripts/vocab_extension/train_tokenizer.sh
+```
+## Merge New Tokenizer with the Origin One
+To merge a new tokenizer with the original one, run:
+```
+bash scripts/vocab_extension/merge_tokenizer.sh
+```
diff --git a/scripts/vocab_extension/merge_tokenizer.sh b/scripts/vocab_extension/merge_tokenizer.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
 mkdir -p ./output_models/new_tokenizer
-python utils/merge_tokenizer.py --tokenizer_dir pinkmanlove/llama-7b-hf \
+python utils/merge_tokenizer.py --tokenizer_dir openlm-research/open_llama_3b \
         --chinese_sp_model_file ./output_models/new_tokenizer/example.model \
         --output_dir ./output_models/merged_tokenizer \
diff --git a/scripts/vocab_extension/train_merge_tokenizer.sh b/scripts/vocab_extension/train_merge_tokenizer.sh
@@ -14,10 +14,11 @@ python utils/train_tokenizer.py --dataset_path ./data/wiki_zh_eval/converted_dat
         --model_type bpe \
         --output_dir ./output_models/new_tokenizer \
         --user_defined_symbols 0,1,2,3,4,5,6,7,8,9,% \
-        --vocab_size 20000
+        --vocab_size 20000 \
+        --max_sentencepiece_length 4
 
 # merge the new tokenizer with the old one
 mkdir -p ./output_models/merged_tokenizer
 python utils/merge_tokenizer.py --chinese_sp_model_file ./output_models/new_tokenizer/example.model \
-        --tokenizer_dir pinkmanlove/llama-7b-hf \
+        --tokenizer_dir openlm-research/open_llama_3b \
         --output_dir ./output_models/merged_tokenizer
diff --git a/scripts/vocab_extension/train_tokenizer.sh b/scripts/vocab_extension/train_tokenizer.sh
@@ -4,4 +4,5 @@ python utils/train_tokenizer.py --dataset_path ./data/wiki_zh_eval/converted_dat
         --model_type bpe \
         --output_dir ./output_models/new_tokenizer \
         --user_defined_symbols 0,1,2,3,4,5,6,7,8,9,% \
-        --vocab_size 20000
+        --vocab_size 20000 \
+        --max_sentencepiece_length 4
diff --git a/src/lmflow/models/hf_decoder_model.py b/src/lmflow/models/hf_decoder_model.py
@@ -75,12 +75,7 @@
             "A100": ["LlamaForCausalLM", "GPTNeoForCausalLM", "GPT2ForCausalLM", "BloomForCausalLM"],
             "A40": ["LlamaForCausalLM","GPTNeoForCausalLM", "GPT2ForCausalLM", "BloomForCausalLM"]
         }
-    if int(flash_attn.__version__.split(".")[0]) == 1:
-        GPU_SUPPORT_FLASH_ATTENTION = {
-            "A100": ["LlamaForCausalLM", "GPTNeoForCausalLM", "GPT2ForCausalLM", "BloomForCausalLM"],
-            "A40": ["GPTNeoForCausalLM", "GPT2ForCausalLM", "BloomForCausalLM"]
-        }
-except ImportError:
+except:
     pass
 
 class HFDecoderModel(DecoderModel, Tunable):
@@ -140,18 +135,40 @@ def __init__(
             "revision": model_args.model_revision,
             "use_auth_token": True if model_args.use_auth_token else None,
         }
-        if model_args.tokenizer_name:
-            tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
-        elif model_args.model_name_or_path:
-            tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
-        else:
-            raise ValueError(
-                "You are instantiating a new tokenizer from scratch. This is"
-                " not supported by this script. You can do it from another"
-                " script, save it, and load it from here, using"
-                " --tokenizer_name."
-            )
+
+        try:
+            if model_args.tokenizer_name:
+                tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
+            elif model_args.model_name_or_path:
+                tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
+            else:
+                raise ValueError(
+                    "You are instantiating a new tokenizer from scratch. This is"
+                    " not supported by this script. You can do it from another"
+                    " script, save it, and load it from here, using"
+                    " --tokenizer_name."
+                )
 
+        except RecursionError:
+            logger.warning("The tokenizer_config.json file doesn't set the special tokens. Using default values: <unk>, <s>, </s> for unknown token, bos token and eos token respectively.")
+            if model_args.tokenizer_name:
+                tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, unk_token="<unk>",
+                                                    bos_token="<s>",
+                                                    eos_token="</s>",
+                                                    **tokenizer_kwargs)
+            elif model_args.model_name_or_path:
+                tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, unk_token="<unk>",
+                                                    bos_token="<s>",
+                                                    eos_token="</s>",
+                                                    **tokenizer_kwargs)
+            else:
+                raise ValueError(
+                    "You are instantiating a new tokenizer from scratch. This is"
+                    " not supported by this script. You can do it from another"
+                    " script, save it, and load it from here, using"
+                    " --tokenizer_name."
+                )
+
         self.tokenizer = tokenizer  
 
         torch_dtype = (

diff --git a/src/lmflow/utils/flash_attention/gpt2_flash_attention.py b/src/lmflow/utils/flash_attention/gpt2_flash_attention.py
@@ -8,11 +8,11 @@
 
 from einops import rearrange
 
-import flash_attn
-if int(flash_attn.__version__.split(".")[0]) == 1:
-    from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
-if int(flash_attn.__version__.split(".")[0]) == 2:
+#try to import flash_attn 2.x.x, if not, import flash_attn 1.x.x
+try:
     from flash_attn.flash_attn_interface import flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
+except:
+    from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
 
 from flash_attn.bert_padding import unpad_input, pad_input
 

diff --git a/src/lmflow/utils/flash_attention/gpt_neo_flash_attention.py b/src/lmflow/utils/flash_attention/gpt_neo_flash_attention.py
@@ -4,11 +4,11 @@
 import transformers
 from einops import rearrange
 
-import flash_attn
-if int(flash_attn.__version__.split(".")[0]) == 1:
-    from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
-if int(flash_attn.__version__.split(".")[0]) == 2:
+#try to import flash_attn 2.x.x, if not, import flash_attn 1.x.x
+try:
     from flash_attn.flash_attn_interface import flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
+except:
+    from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
 
 from flash_attn.bert_padding import unpad_input, pad_input