From 934f24cdb3b9761a54ec838d7eab9dcb7c0d8413 Mon Sep 17 00:00:00 2001
From: jianzhnie <jianzhnie@126.com>
Date: Wed, 24 May 2023 10:40:27 +0800
Subject: [PATCH 1/2] Update README

---
 README.md                 | 102 +++++++++++++++-
 chatgpt/dataset/README.md |   6 -
 docs/README.md            | 243 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 340 insertions(+), 11 deletions(-)
 create mode 100644 docs/README.md

diff --git a/README.md b/README.md
index ddd6d48..abd96c9 100644
--- a/README.md
+++ b/README.md
@@ -31,7 +31,7 @@ If you like the project, please show your support by [leaving a star ⭐](https:
 
 - [2023/05] 🔥 We implement **Stanford Alpaca**.
 - [2023/04] We released **RLHF(Reinforcement Learning with Human Feedback)  Pipeline** .
-- [2023/03] We released the code **OpenChatGPT An Open-Source libraray to train ChatBot like ChatGPT **.
+- [2023/03] We released the code **OpenChatGPT: An Open-Source libraray to train ChatBot like ChatGPT**.
 
 ## Table of Contents
 
@@ -39,9 +39,13 @@ If you like the project, please show your support by [leaving a star ⭐](https:
   - [Introduction](#introduction)
   - [News](#news)
   - [Table of Contents](#table-of-contents)
-  - [Instruction Data](#instruction-data)
+  - [Data Collection](#data-collection)
+    - [Instruction Datasets](#instruction-datasets)
+    - [RLHF Datasets](#rlhf-datasets)
+    - [Data Preprocessing](#data-preprocessing)
+    - [Data Fomatting](#data-fomatting)
   - [Install](#install)
-  - [Fintune](#fintune)
+  - [Instruction Fintune](#instruction-fintune)
     - [Fine-tuning Alpaca-7B](#fine-tuning-alpaca-7b)
     - [Using DeepSpeed](#using-deepspeed)
     - [Fine-tuning Alpaca-7B with Lora](#fine-tuning-alpaca-7b-with-lora)
@@ -52,8 +56,88 @@ If you like the project, please show your support by [leaving a star ⭐](https:
   - [Acknowledgements](#acknowledgements)
   - [Citation](#citation)
 
+## Data Collection
 
-## Instruction Data
+### Instruction Datasets
+
+A collection of open-source instruction tuning datasets to train (text and multi-modal) chat-based LLMs (GPT-4, ChatGPT,LLaMA,Alpaca). 
+
+Referring to [this](https://github.com/jianzhnie/awesome-instruction-datasets) ([@jianzhnie](https://github.com/jianzhnie)), we labeled each collected dataset according to the following rules:
+
+(Lang)Lingual-Tags:
+
+- EN: Instruction datasets in English
+- CN: Instruction datasets in Chinese
+- ML: [Multi-lingual] Instruction datasets in multiple languages
+
+(Task)Task-Tags:
+
+- MT: [Multi-task] Datasets containing multiple tasks
+- TS: [Task-specific] Datasets tailored for specific tasks
+
+(Gen)Generation-method:
+
+- HG: [Human Generated Dataset] Datasets created by humans
+- SI: [Self-Instruct] Datasets generated using self-instruct methods
+- MIX: [Mixed Dataset] Dataset contains both human and machine generated data
+- COL: [Collection of Dataset] Dataset made from a collection of other datasets
+
+| Project                                                      |                           Datasets                           | Org                        | Nums    | Lang  | Task  | Gen  | Type                                      | Src                                 |
+| :----------------------------------------------------------- | :----------------------------------------------------------: | -------------------------- | :------ | :---- | :---- | :--- | :---------------------------------------- | :---------------------------------- |
+| [Chain of Thought](https://github.com/google-research/FLAN)  | [cot_data](https://github.com/google-research/FLAN/tree/main/flan/v2/cot_data) \|[few_shot_data](https://github.com/google-research/FLAN/tree/main/flan/v2/niv2_few_shot_data) | Google                     | 74771   | EN/CN | MT    | HG   | instruct with cot reasoning               | annotating CoT on existing data     |
+| [GPT4all](https://github.com/nomic-ai/gpt4all)               | [nomic-ai/gpt4all-j-prompt-generations](https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations) | nomic-ai                   | 806199  | EN    | MT    | COL  | code, storys and dialogs                  | distillation from GPT-3.5-turbo     |
+| [GPTeacher](https://github.com/teknium1/GPTeacher)           | [GPT-4 General-Instruct ](https://github.com/teknium1/GPTeacher/tree/main/Instruct)\|[Roleplay-Instruct](https://github.com/teknium1/GPTeacher/tree/main/Roleplay) \|[Code-Instruct ](https://github.com/teknium1/GPTeacher/tree/main/Codegen)\| [Toolformer](https://github.com/teknium1/GPTeacher/tree/main/Toolformer) | teknium1                   | 29013   | EN    | MT    | SI   | general, roleplay, toolformer             | GPT-4 & toolformer                  |
+| [Guanaco](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset) | [JosephusCheung/GuanacoDataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset) | JosephusCheung             | 534610  | ML    | MT    | SI   | various linguistic tasks                  | text-davinci-003                    |
+| [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3)    | [Hello-SimpleAI/HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | Hello-SimpleAI \| 万得资讯 | 37175   | EN/CN | TS    | MIX  | dialogue evaluation                       | human or ChatGPT                    |
+| [HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese) | [Hello-SimpleAI/HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese) | Hello-SimpleAI\|万得资讯   | 13k     | CN    | TS    | MIX  | dialogue evaluation                       | human or ChatGPT                    |
+| [alpaca](https://github.com/tatsu-lab/stanford_alpaca)       | [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) | tatsu-lab                  | 52002   | EN    | MT    | SI   | general instruct                          | text-davinci-003                    |
+| [AlpacaDataCleaned](https://github.com/gururise/AlpacaDataCleaned) | [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | yahma                      | 52k     | EN    | MT    | SI   | general instruct                          | text-davinci-003                    |
+| [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) | [alpaca_data_zh_51k](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/data/alpaca_data_zh_51k.json) | ymcui(讯飞)                | 51k     | CN    | MT    | SI   | general instruct                          | text-davinci-003                    |
+| [Luotuo-Chinese-LLM](https://github.com/LC1332/Luotuo-Chinese-LLM)  骆驼 | [trans_chinese_alpaca_data](https://github.com/LC1332/Luotuo-Chinese-LLM/blob/main/data/trans_chinese_alpaca_data.json) | LC1332(商汤)               | 52k     | CN    | MT    | SI   | general instruct                          | text-davinci-003                    |
+| [Natural Instructions](https://github.com/allenai/natural-instructions) | [Allen AI 61 task](https://instructions.apps.allenai.org/#:~:text=Download%20Natural%2DInstructions%20%2D%20v1.1)\|[1.5k task](https://instructions.apps.allenai.org/#:~:text=Natural%2DInstructions%20%2D%20v2-,.,-x) | Allen AI                   | 5040134 | ML    | MT    | COL  | diverse nlp tasks                         | human annotated datasets collection |
+| [belle_cn](https://huggingface.co/BelleGroup)                | [BelleGroup/train_1M_CN](https://huggingface.co/datasets/bellegroup/train_1M_CN) \|[BelleGroup/train_0.5M_CN](https://huggingface.co/datasets/bellegroup/train_0.5M_CN) | BelleGroup(链家)           | 1079517 | CN    | TS/MT | SI   | general, mathematical reasoning, dialogue |                                     |
+
+Here, we only list a small part of the  instruction tuning dataset list, to find more datasets, please check out the following links:
+[jianzhnie/awesome-instruction-datasets](https://github.com/jianzhnie/awesome-instruction-datasets): A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca).
+
+### RLHF Datasets
+
+Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. Follwing is a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources. 
+
+|                           Project                            |              Org              | Nums   |  Lang   | Summary                                                      |
+| :----------------------------------------------------------: | :---------------------------: | ------ | :-----: | ------------------------------------------------------------ |
+| [webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons) |            Openai             | 19,578 | English | In the [WebGPT paper](https://arxiv.org/abs/2112.09332), the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total. |
+|    [SHP](https://huggingface.co/datasets/stanfordnlp/SHP)    |          stanfordnlp          | 349 K  | English | SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., [SteamSHP](https://huggingface.co/stanfordnlp/SteamSHP-flan-t5-xl)). |
+| [rlhf-reward-datasets](https://huggingface.co/datasets/yitingxie/rlhf-reward-datasets) |           yitingxie           | 76.3 k | English |                                                              |
+| [Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf) |            Dahoas             | 112 k  | English | Anthropic's HH dataset reformatted into prompt, chosen, rejected samples. |
+| [Dahoas/synthetic-instruct-gptj-pairwise](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise) |            Dahoas             |        | English |                                                              |
+| [Dahoas/rm-static](https://huggingface.co/datasets/Dahoas/rm-static) |            Dahoas             | 76.3k  | English | Split of [hh-static](https://huggingface.co/datasets/Dahoas/static-hh) used for training reward models after supervised fine-tuning. |
+| [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) |           Anthropic           | 22k    | English | This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data. |
+| [Instruction-Tuning-with-GPT-4/GPT-4-LLM](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) | Instruction-Tuning-with-GPT-4 | 52k    | English | Ranked responses (Note: Data is evaluated by `GPT-4` model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses" |
+| [thu-coai/Safety-Prompts](https://github.com/thu-coai/Safety-Prompts) |           thu-coai            | 100k   | Chinese | 中文安全prompts，用于评测和提升大模型的安全性，将模型的输出与人类的价值观对齐。 |
+| [Chatgpt-Comparison-Detection project](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection) |                               |        |         |                                                              |
+
+To find more datasets, please check out the following links:
+[jianzhnie/awesome-instruction-datasets](https://github.com/jianzhnie/awesome-instruction-datasets): A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca).
+
+### Data Preprocessing
+
+We has developed a data preprocessing code that offers a unified interface for various large language models. This code can be used to preprocess data for a variety of purposes, such as Instruct Tuning and RLHF modeling tasks. If you're interested in learning more, please check out the following links to our prompt dataset and data utilities:
+
+- [prompt_dataset.py](https://github.com/jianzhnie/open-chatgpt/blob/main/chatgpt/dataset/prompt_dataset.py)
+- [data_utils.py](https://github.com/jianzhnie/open-chatgpt/blob/main/chatgpt/dataset/data_utils.py)
+
+### Data Fomatting
+
+In our collection, all data has been formatted using the same templates. Each sample follows the following structure:
+
+```
+[
+{"instruction": instruction string,
+"input": input string, # (may be empty)
+"output": output string}
+]
+```
 
 ## Install
 
@@ -62,7 +146,15 @@ git clone https://github.com/jianzhnie/open-chatgpt.git
 pip install -r requirements.txt
 ```
 
-## Fintune
+**PEFT**
+
+- If you would like to use LORA along with other parameter-efficient methods, please install [peft](https://github.com/huggingface/peft) as an additional dependency.
+
+**DeepSpeed**
+
+- If you want to  accelerate LLM training using techniques such as pipeline parallelism, gradient checkpointing, and tensor fusion. Please install [DeepSpeed](https://github.com/microsoft/DeepSpeed).
+
+## Instruction Fintune
 
 ### Fine-tuning Alpaca-7B
 
diff --git a/chatgpt/dataset/README.md b/chatgpt/dataset/README.md
index d03b617..6b02846 100644
--- a/chatgpt/dataset/README.md
+++ b/chatgpt/dataset/README.md
@@ -98,12 +98,6 @@ LocalDataClass：Dict[str, Type]  = {
 
 
 
-##  Using for training RM model
-
-
-
-
-
 ## Reference
 
 - https://github.com/yaodongC/awesome-instruction-dataset
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 0000000..bff73dc
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,243 @@
+<div align="center">
+  <img src="assets/logo.png" width="800"/>
+<div>&nbsp;</div>
+</div>
+
+<div align="center">
+
+[中文](README_zh.md) | English
+</div>
+
+## Table of Contents
+- [Table of Contents](#table-of-contents)
+- [Introduction](#introduction)
+- [Illustrating  RLHF](#illustrating--rlhf)
+  - [Step 1: Train Supervised Fine-Tuning (SFT)](#step-1-train-supervised-fine-tuning-sft)
+  - [Step 2: Train Reward Model (RM)](#step-2-train-reward-model-rm)
+  - [Step 3: Optimize the Policy Using Reinforcement Learning(RLHF)](#step-3-optimize-the-policy-using-reinforcement-learningrlhf)
+- [RLHF Dataset preparation](#rlhf-dataset-preparation)
+- [☕ Quick Start ☕](#-quick-start-)
+- [Examples](#examples)
+  - [Example1: Learning to summarize with human feedback](#example1-learning-to-summarize-with-human-feedback)
+    - [Step1: Supervised Fine-Tuning (SFT)](#step1-supervised-fine-tuning-sft)
+    - [Step2: Training the Reward Model](#step2-training-the-reward-model)
+    - [Step3: Fine-Tuning the Model using PPO](#step3-fine-tuning-the-model-using-ppo)
+  - [Example2: Learning to generate positive sentiment with human feedback](#example2-learning-to-generate-positive-sentiment-with-human-feedback)
+  - [Example3: StackLLaMA: Train LLaMA with RLHF on StackExchange](#example3-stackllama-train-llama-with-rlhf-on-stackexchange)
+- [Support Model](#support-model)
+- [Contributing](#contributing)
+- [License](#license)
+
+
+## Introduction
+
+`Open-ChatGPT`  is a open-source library that allows you to train a  hyper-personalized ChatGPT-like ai model using your own data and the least amount of compute possible.
+
+`Open-ChatGPT` is a general system framework for enabling an end-to-end training experience for ChatGPT-like models. It can automatically take your favorite pre-trained large language models though an OpenAI InstructGPT style three stages to produce your very own high-quality ChatGPT-style model.
+
+We have Impleamented RLHF (Reinforcement Learning with Human Feedback) powered by transformer library and DeepsSpeed. It supports distributed training and offloading, which can fit extremly large models.
+
+If you like the project, please show your support by [leaving a star ⭐](https://github.com/jianzhnie/open-chatgpt/stargazers).
+
+
+## Illustrating  RLHF
+
+ChatGPT continues the technical path of [InstructGPT/GPT3.5](https://arxiv.org/abs/2203.02155) and adds RLHF (Reinforcement Learning from Human Feedback) which enhances the adjustment of the model output by humans and sorts the results with greater understanding.
+
+Reinforcement learning from human feedback (RLHF) is a challenging concept as it involves multiple model training processes and different deployment stages. We break down the training process into three core steps:
+
+<div align="center">
+<img src="./assets/ChatGPT_Diagram.svg" width="800px"></img>
+
+*<a href="https://openai.com/blog/chatgpt/">official chatgpt blogpost</a>*
+</div>
+
+### Step 1: Train Supervised Fine-Tuning (SFT)
+
+GPT 3.5 itself has difficulty in understanding the different intentions implied in various types of human instructions, and it is also difficult to judge whether the generated content is of high quality. To make [GPT 3.5](https://arxiv.org/abs/2203.02155) initially understand the intent of instructions, high-quality answers are given by human annotators for randomly selected questions in the dataset, and the GPT-3.5 model is fine-tuned with these manually labeled data to obtain the SFT model (Supervised Fine-Tuning).
+
+The SFT model at this point is already better than GPT-3 in following instructions/dialogues, but may not necessarily align with human preferences.
+
+<div align="center">
+  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/rlhf/pretraining.png" width="500"/>
+</div>
+
+### Step 2: Train Reward Model (RM)
+
+The main objective of this stage is to train a reward model by manually labeled training data (about 33K data). Questions are randomly selected from the dataset, and multiple different answers are generated for each question using the model generated in the first stage. Human annotators consider these results comprehensively and provide a ranking order. This process is similar to a coach or teacher's guidance.
+
+Next, use this ranking result data to train the reward model. For multiple ranking results, pairwise combinations form multiple training data pairs. The RM model accepts an input and provides a score that evaluates the quality of the answer. Thus, for a pair of training data, the parameters are adjusted so that the score for a high-quality answer is higher than that for a low-quality answer.
+
+<div align="center">
+  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/rlhf/reward-model.png" width="500"/>
+</div>
+
+### Step 3: Optimize the Policy Using Reinforcement Learning(RLHF)
+
+Finally, once we have the trained SFT model and reward model (RM), we can use reinforcement learning (RL) to fine-tune the SFT model based on feedback using RM. This step keeps our SFT model aligned with human preferences.
+
+This stage uses the reward model trained in the second stage and updates the pre-trained model parameters based on the reward score. Questions are randomly selected from the dataset, and the PPO model is used to generate answers, and the RM model trained in the previous stage is used to provide quality scores. The reward scores are passed in sequence, resulting in a policy gradient, and the PPO model parameters are updated through reinforcement learning.
+
+<div align="center">
+  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/rlhf/rlhf.png" width="500"/>
+</div>
+
+
+If you want to learn more details about RLHF technology, I strongly recommend reading Huggingface's blog [Illustrating Reinforcement Learning from Human Feedback (RLHF)](https://huggingface.co/blog/rlhf) and the [中文翻译版](https://jianzhnie.github.io/machine-learning-wiki/#/deep-rl/papers/RLHF).
+
+
+</p>
+</details>
+
+## RLHF Dataset preparation
+
+To successfully train a ChatGPT-like assistant, you need 3 different datasets: `actor_training_data`, `rlhf_training_data` and `reward_training_data`.
+
+Alternatively, training can be bootstrapped using a pre-existing dataset available on HuggingFace.  High quality candidates are namely the `Anthropic HH RLHF` and the `Stanford Human Preference datasets`, `Reddit TL;DR dataset` and  `Comparisons datasets`.
+
+|                                           Dataset                                            |                                                                            Description                                                                            |     |
+| :------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------: | --- |
+|            [Anthropic HH RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)            |                    This dataset consists of structured question/response pairs with a LLM chatbot that include chosen and rejected responses.                     |     |
+| [Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP)  | This dataset is curated from selected "ask" subreddits and contains questions spanning a wide array of question/answer pairs based on the most upvoted responses. |     |
+|    [Reddit TL;DR dataset](https://huggingface.co/datasets/CarperAI/openai_summarize_tldr)    |         The TL;DR Summary Dataset is a collection of carefully selected Reddit posts that contain both the main content and a summary created by a human.         |     |
+| [Comparisons dataset](https://huggingface.co/datasets/CarperAI/openai_summarize_comparisons) |    It includes Reddit posts and two summaries for each post, as well as a selection value indicating which of the two summaries the human annotator preferred.    |     |
+
+
+To find more datasets, please check out the following links:
+[jianzhnie/awesome-prompt-datasets](https://github.com/jianzhnie/awesome-prompt-datasets): A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)
+
+</p>
+</details>
+
+## ☕ Quick Start ☕
+
+```bash
+git clone https://github.com/jianzhnie/open-chatgpt.git
+pip install -r requirements.txt
+```
+
+## Examples
+
+### Example1: Learning to summarize with human feedback
+
+#### Step1: Supervised Fine-Tuning (SFT)
+
+Firstly, we will fine-tune the transformer model for text summarization on the [`TL;DR`](https://huggingface.co/datasets/CarperAI/openai_summarize_tldr) dataset.
+
+This is relatively straightforward. Load the dataset, tokenize it, and then train the model. The entire pipeline is built using HuggingFace.
+
+- Training with huggingface transformers trainer api.
+
+First, modify the `training_args` in `train_fintune_summarize.py` file with your own param.
+
+```shell
+cd scripts/
+python train_reward_model.py
+```
+
+- Speedup training with deepspeed
+
+First, add the `deepspeed` param in  `train_fintune_summarize.py` file.
+
+```python
+# Prepare the trainer and start training
+training_args = TrainingArguments(
+    output_dir=output_dir,
+    num_train_epochs=5,
+    gradient_accumulation_steps=4,
+    per_device_train_batch_size=16,
+    per_device_eval_batch_size=16,
+    eval_steps=500,
+    save_steps=1000,
+    warmup_steps=100,
+    learning_rate=1e-5,
+    weight_decay=0.001,
+    half_precision_backend=True,
+    fp16=True,
+    adam_beta1=0.9,
+    adam_beta2=0.95,
+    fp16_opt_level='02',  # mixed precision mode
+    do_train=True,  # Perform training
+    do_eval=True,  # Perform evaluation
+    save_strategy='steps',
+    save_total_limit=5,
+    evaluation_strategy='steps',
+    eval_accumulation_steps=1,
+    load_best_model_at_end=True,
+    gradient_checkpointing=True,
+    logging_steps=50,
+    logging_dir='./logs',
+    deepspeed='./ds_config_opt.json',
+)
+```
+Then, run the following command to start training.
+
+```shell
+deepseed train_fintune_summarize.py
+```
+
+The model is evaluated using the ROUGE score. The best model is selected based on the average ROUGE score on the validation set. This model will be used to initialize the reward model, which will be further fine-tuned using PPO.
+
+#### Step2: Training the Reward Model
+
+Our reward model is trained on a collected human quality judgement dataset [Comparisons dataset](https://huggingface.co/datasets/CarperAI/openai_summarize_comparisons), You can download the dataset from huggingface automatically.
+
+We will initialize the reward model from the SFT model and attach a randomly initialized linear head to output a scalar value on top.
+
+Next, we will delve into how the data is input to the model, the loss function, and other issues with the reward model.
+
+Use these code to train your reward model.
+
+```shell
+cd scripts/
+python train_reward_model.py
+```
+
+#### Step3: Fine-Tuning the Model using PPO
+
+We use [awesome-chatgpt-prompts](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts) as example dataset. It is a small dataset with hundreds of prompts.
+
+```python
+python train_ppo_rlhf.py
+```
+
+</p>
+</details>
+
+
+### Example2: Learning to generate positive sentiment with human feedback
+
+```shell
+python gpt2-sentiment.py
+```
+
+
+### Example3: StackLLaMA: Train LLaMA with RLHF on StackExchange
+
+
+```shell
+```
+
+## Support Model
+
+<details><summary><b><i> LLM </i></b></summary>
+
+We support models that can be run efficiently with a limited amount of compute. These are the models with less than 20B parameters currently supported :
+
+- GPTJ: 6B
+- GPTNeoX: 1.3B, 20B
+- OPT: 125M, 359M, 1.3B, 2.7B, 6.7B, 13B
+- BLOOM: 560M, 1.1B, 1.7B, 3B, 7.1B
+- BLOOMZ: 560M, 1.1B, 1.7B, 3B, 7.1B
+
+</details>
+
+
+
+## Contributing
+
+Our purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.
+
+## License
+
+`Openn-ChatGPT` is released under the Apache 2.0 license.

From 7be26f4e08c3ce5de25a71e0a6968018524ee134 Mon Sep 17 00:00:00 2001
From: jianzhnie <jianzhnie@126.com>
Date: Wed, 24 May 2023 10:40:55 +0800
Subject: [PATCH 2/2] Update README.md

---
 README.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index abd96c9..e05453b 100644
--- a/README.md
+++ b/README.md
@@ -60,7 +60,7 @@ If you like the project, please show your support by [leaving a star ⭐](https:
 
 ### Instruction Datasets
 
-A collection of open-source instruction tuning datasets to train (text and multi-modal) chat-based LLMs (GPT-4, ChatGPT,LLaMA,Alpaca). 
+A collection of open-source instruction tuning datasets to train (text and multi-modal) chat-based LLMs (GPT-4, ChatGPT,LLaMA,Alpaca).
 
 Referring to [this](https://github.com/jianzhnie/awesome-instruction-datasets) ([@jianzhnie](https://github.com/jianzhnie)), we labeled each collected dataset according to the following rules:
 
@@ -102,7 +102,7 @@ Here, we only list a small part of the  instruction tuning dataset list, to find
 
 ### RLHF Datasets
 
-Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. Follwing is a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources. 
+Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. Follwing is a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.
 
 |                           Project                            |              Org              | Nums   |  Lang   | Summary                                                      |
 | :----------------------------------------------------------: | :---------------------------: | ------ | :-----: | ------------------------------------------------------------ |
@@ -222,7 +222,7 @@ torchrun --nproc_per_node=8 train_alpaca.py \
     --deepspeed "scripts/ds_config_zero3_auto.json"
 ```
 
-- [LoRA](https://arxiv.org/abs/2106.09685) fine-tunes low-rank slices of the query, key, and value embedding heads. This can reduce the total memory footprint from 112GB to about 7x4=28GB. 
+- [LoRA](https://arxiv.org/abs/2106.09685) fine-tunes low-rank slices of the query, key, and value embedding heads. This can reduce the total memory footprint from 112GB to about 7x4=28GB.
 
 ### Fine-tuning Alpaca-7B with Lora
 
@@ -265,7 +265,7 @@ Example usage:
 ```bash
 python generate_server.py \
     --model_name_or_path decapoda-research/llama-7b-hf \
-    --lora_model_name_or_path  tloen/alpaca-lora-7b 
+    --lora_model_name_or_path  tloen/alpaca-lora-7b
 ```
 
 ### No Enough Memory