Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update dataset related documentation #317

Merged
merged 1 commit into from
Apr 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion applications/DeepSpeed-Chat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ A fast, affordable, scalable and open system framework for enabling end-to-end R
- [🕐 Step 1 - Supervised Fine-Tuning](#-step-1---supervised-fine-tuning)
- [🕑 Step 2 - Reward Model](#-step-2---reward-model)
- [🕒 Step 3 - Reinforcement Learning with Human Feedback](#-step-3---reinforcement-learning-with-human-feedback)
- [🐼 Adding and using your own datasets in DeepSpeed-Chat](#-adding-and-using-your-own-datasets-in-deepspeed-chat)
- [🐼 Customizing RLHF training pipeline via DeepSpeed-Chat’s APIs](#-customizing-your-own-rlhf-training-pipeline-using-deepspeed-chats-rlhf-apis)
- [🐼 Serving Your Model: Plug-in and Test!](#-serving-plug-in-your-final-model-trained-by-deepspeed-chat-and-test-it-out)
- [🔥 Training Performance Evaluation 🔥](#-training-performance-evaluation-)
Expand Down Expand Up @@ -230,9 +231,12 @@ bash training_scripts/single_gpu/run_1.3b.sh
</p></details>


### 🐼 Adding and using your own datasets in DeepSpeed-Chat
In addition to the datasets used in our example scripts, you can also add and use your own datasets. To do so, first you need to add a new Class in [training/utils/data/raw_datasets.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/utils/data/raw_datasets.py) to define the format when using your data. You need to make sure to follow the APIs and format defined in the PromptRawDataset class to ensure a consistent data format that DeepSpeed-Chat relies on. You can look at the existing classes to learn how to do so.

Second, you need to add an if condition in function get_raw_dataset in [training/utils/data/data_utils.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/utils/data/data_utils.py) corresponding to your new dataset. The dataset_name string in the if condition should be the dataset name you will provide as a arg for the training scripts. Last, you need to add your new dataset's dataset_name into your "--data_path" arg in your training scripts.


One thing to note that some datasets may only have one response instead of two responses. For those datasets, you can only use them in step 1. And in such case, you should add the dataset_name as part of the "--sft_only_data_path" arg instead of the "--data_path" arg. One thing to note is that: If you plan to only do step 1 SFT, adding more single-response datasets is definitely beneficial. However, if you do plan to do steps 2 and 3, then adding too many single-response datasets during SFT could backfire: these data could be different from the data used for steps 2/3, generating different distributions which could cause training instability/worse model quality during step 2/3. That is part of the reason why we focused on trying the datasets with two responses and the preference, and always split a dataset into all 3 steps.

### 🐼 Customizing your own RLHF training pipeline using DeepSpeed-Chat’s RLHF APIs

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ Most of the arguments used in the main.py file have clear explanations and are u
|----------------------------------------------------------|------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| --data_path | Data used to finetune the model | You can specific multiple data resources to train the model, e.g., Dahoas/rm-static Dahoas/full-hh-rlhf |
| --data_data_split | Split the data for three-step training | Following InstructGPT, we make sure the three-step training has no overlap data. Also we use 20%, 40%, 40% for each step respectively. You can change it to 10 0 0 if you only do SFT. |
| --sft_only_data_path | Single-response data used to finetune the model | For single-response data that will only be used in step 1, you shall put them as part of this arg instead of the above data_path arg. Datasets in this arg will not be splitted and fully used in step 1 only. |
| --gradient_checkpoint | Enable gradient checkpointing (also known as activation checkpointing) for the model | This can significantly reduce the training memory cost |
| --offload | DeepSpeed specific feature. Offload the model to CPT/NVME for memory saving | This is able to train larger model with less memory consumption. But it will slow down the training. |
| --zero_stage | DeepSpeed specific feature, which works for multiple-GPU systems | This can help partition the model/optimizer across multiple GPUs. Please see [here](https://www.deepspeed.ai/tutorials/zero/) |
Expand Down