Skip to content

Commit

Permalink
Merge pull request #55 from jianzhnie/dev
Browse files Browse the repository at this point in the history
Merge dev into main
  • Loading branch information
jianzhnie authored Dec 31, 2024
2 parents 64b65e5 + 8953847 commit 2f835ec
Show file tree
Hide file tree
Showing 25 changed files with 2 additions and 1,818 deletions.
88 changes: 0 additions & 88 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ If you like the project, please show your support by [leaving a star ⭐](https:
## News

- [2023/05] 🔥 We implement **Stanford Alpaca Lora**.

- [2023/05] 🔥 We implement **Stanford Alpaca**.
- [2023/04] We released **RLHF(Reinforcement Learning with Human Feedback) Pipeline** .
- [2023/03] We released the code **OpenChatGPT: An Open-Source libraray to train ChatBot like ChatGPT**.
Expand All @@ -39,11 +38,6 @@ If you like the project, please show your support by [leaving a star ⭐](https:
- [Introduction](#introduction)
- [News](#news)
- [Table of Contents](#table-of-contents)
- [Data Collection](#data-collection)
- [Instruction Datasets](#instruction-datasets)
- [RLHF Datasets](#rlhf-datasets)
- [Data Preprocessing](#data-preprocessing)
- [Data Fomatting](#data-fomatting)
- [Install](#install)
- [Instruction Fintune](#instruction-fintune)
- [Fine-tuning Alpaca-7B](#fine-tuning-alpaca-7b)
Expand All @@ -56,88 +50,6 @@ If you like the project, please show your support by [leaving a star ⭐](https:
- [Acknowledgements](#acknowledgements)
- [Citation](#citation)

## Data Collection

### Instruction Datasets

A collection of open-source instruction tuning datasets to train (text and multi-modal) chat-based LLMs (GPT-4, ChatGPT,LLaMA,Alpaca).

Referring to [this](https://github.com/jianzhnie/awesome-instruction-datasets) ([@jianzhnie](https://github.com/jianzhnie)), we labeled each collected dataset according to the following rules:

(Lang)Lingual-Tags:

- EN: Instruction datasets in English
- CN: Instruction datasets in Chinese
- ML: [Multi-lingual] Instruction datasets in multiple languages

(Task)Task-Tags:

- MT: [Multi-task] Datasets containing multiple tasks
- TS: [Task-specific] Datasets tailored for specific tasks

(Gen)Generation-method:

- HG: [Human Generated Dataset] Datasets created by humans
- SI: [Self-Instruct] Datasets generated using self-instruct methods
- MIX: [Mixed Dataset] Dataset contains both human and machine generated data
- COL: [Collection of Dataset] Dataset made from a collection of other datasets

| Project | Datasets | Org | Nums | Lang | Task | Gen | Type | Src |
| :----------------------------------------------------------- | :----------------------------------------------------------: | -------------------------- | :------ | :---- | :---- | :--- | :---------------------------------------- | :---------------------------------- |
| [Chain of Thought](https://github.com/google-research/FLAN) | [cot_data](https://github.com/google-research/FLAN/tree/main/flan/v2/cot_data) \|[few_shot_data](https://github.com/google-research/FLAN/tree/main/flan/v2/niv2_few_shot_data) | Google | 74771 | EN/CN | MT | HG | instruct with cot reasoning | annotating CoT on existing data |
| [GPT4all](https://github.com/nomic-ai/gpt4all) | [nomic-ai/gpt4all-j-prompt-generations](https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations) | nomic-ai | 806199 | EN | MT | COL | code, storys and dialogs | distillation from GPT-3.5-turbo |
| [GPTeacher](https://github.com/teknium1/GPTeacher) | [GPT-4 General-Instruct ](https://github.com/teknium1/GPTeacher/tree/main/Instruct)\|[Roleplay-Instruct](https://github.com/teknium1/GPTeacher/tree/main/Roleplay) \|[Code-Instruct ](https://github.com/teknium1/GPTeacher/tree/main/Codegen)\| [Toolformer](https://github.com/teknium1/GPTeacher/tree/main/Toolformer) | teknium1 | 29013 | EN | MT | SI | general, roleplay, toolformer | GPT-4 & toolformer |
| [Guanaco](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset) | [JosephusCheung/GuanacoDataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset) | JosephusCheung | 534610 | ML | MT | SI | various linguistic tasks | text-davinci-003 |
| [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | [Hello-SimpleAI/HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | Hello-SimpleAI \| 万得资讯 | 37175 | EN/CN | TS | MIX | dialogue evaluation | human or ChatGPT |
| [HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese) | [Hello-SimpleAI/HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese) | Hello-SimpleAI\|万得资讯 | 13k | CN | TS | MIX | dialogue evaluation | human or ChatGPT |
| [alpaca](https://github.com/tatsu-lab/stanford_alpaca) | [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) | tatsu-lab | 52002 | EN | MT | SI | general instruct | text-davinci-003 |
| [AlpacaDataCleaned](https://github.com/gururise/AlpacaDataCleaned) | [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | yahma | 52k | EN | MT | SI | general instruct | text-davinci-003 |
| [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) | [alpaca_data_zh_51k](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/data/alpaca_data_zh_51k.json) | ymcui(讯飞) | 51k | CN | MT | SI | general instruct | text-davinci-003 |
| [Luotuo-Chinese-LLM](https://github.com/LC1332/Luotuo-Chinese-LLM) 骆驼 | [trans_chinese_alpaca_data](https://github.com/LC1332/Luotuo-Chinese-LLM/blob/main/data/trans_chinese_alpaca_data.json) | LC1332(商汤) | 52k | CN | MT | SI | general instruct | text-davinci-003 |
| [Natural Instructions](https://github.com/allenai/natural-instructions) | [Allen AI 61 task](https://instructions.apps.allenai.org/#:~:text=Download%20Natural%2DInstructions%20%2D%20v1.1)\|[1.5k task](https://instructions.apps.allenai.org/#:~:text=Natural%2DInstructions%20%2D%20v2-,.,-x) | Allen AI | 5040134 | ML | MT | COL | diverse nlp tasks | human annotated datasets collection |
| [belle_cn](https://huggingface.co/BelleGroup) | [BelleGroup/train_1M_CN](https://huggingface.co/datasets/bellegroup/train_1M_CN) \|[BelleGroup/train_0.5M_CN](https://huggingface.co/datasets/bellegroup/train_0.5M_CN) | BelleGroup(链家) | 1079517 | CN | TS/MT | SI | general, mathematical reasoning, dialogue | |

Here, we only list a small part of the instruction tuning dataset list, to find more datasets, please check out the following links:
[jianzhnie/awesome-instruction-datasets](https://github.com/jianzhnie/awesome-instruction-datasets): A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca).

### RLHF Datasets

Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. Follwing is a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.

| Project | Org | Nums | Lang | Summary |
| :----------------------------------------------------------: | :---------------------------: | ------ | :-----: | ------------------------------------------------------------ |
| [webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons) | Openai | 19,578 | English | In the [WebGPT paper](https://arxiv.org/abs/2112.09332), the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total. |
| [SHP](https://huggingface.co/datasets/stanfordnlp/SHP) | stanfordnlp | 349 K | English | SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., [SteamSHP](https://huggingface.co/stanfordnlp/SteamSHP-flan-t5-xl)). |
| [rlhf-reward-datasets](https://huggingface.co/datasets/yitingxie/rlhf-reward-datasets) | yitingxie | 76.3 k | English | |
| [Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf) | Dahoas | 112 k | English | Anthropic's HH dataset reformatted into prompt, chosen, rejected samples. |
| [Dahoas/synthetic-instruct-gptj-pairwise](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise) | Dahoas | | English | |
| [Dahoas/rm-static](https://huggingface.co/datasets/Dahoas/rm-static) | Dahoas | 76.3k | English | Split of [hh-static](https://huggingface.co/datasets/Dahoas/static-hh) used for training reward models after supervised fine-tuning. |
| [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) | Anthropic | 22k | English | This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data. |
| [Instruction-Tuning-with-GPT-4/GPT-4-LLM](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) | Instruction-Tuning-with-GPT-4 | 52k | English | Ranked responses (Note: Data is evaluated by `GPT-4` model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses" |
| [thu-coai/Safety-Prompts](https://github.com/thu-coai/Safety-Prompts) | thu-coai | 100k | Chinese | 中文安全prompts,用于评测和提升大模型的安全性,将模型的输出与人类的价值观对齐。 |
| [Chatgpt-Comparison-Detection project](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection) | | | | |

To find more datasets, please check out the following links:
[jianzhnie/awesome-instruction-datasets](https://github.com/jianzhnie/awesome-instruction-datasets): A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca).

### Data Preprocessing

We has developed a data preprocessing code that offers a unified interface for various large language models. This code can be used to preprocess data for a variety of purposes, such as Instruct Tuning and RLHF modeling tasks. If you're interested in learning more, please check out the following links to our prompt dataset and data utilities:

- [prompt_dataset.py](https://github.com/jianzhnie/open-chatgpt/blob/main/chatgpt/dataset/prompt_dataset.py)
- [data_utils.py](https://github.com/jianzhnie/open-chatgpt/blob/main/chatgpt/dataset/data_utils.py)

### Data Fomatting

In our collection, all data has been formatted using the same templates. Each sample follows the following structure:

```
[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]
```

## Install

Expand Down
36 changes: 0 additions & 36 deletions configs/NVMe-Support/zero3.json

This file was deleted.

49 changes: 0 additions & 49 deletions configs/full_ds_config/ds_config_zero2.json

This file was deleted.

51 changes: 0 additions & 51 deletions configs/full_ds_config/ds_config_zero2_auto.json

This file was deleted.

58 changes: 0 additions & 58 deletions configs/full_ds_config/ds_config_zero3.json

This file was deleted.

59 changes: 0 additions & 59 deletions configs/full_ds_config/ds_config_zero3_auto.json

This file was deleted.

5 changes: 0 additions & 5 deletions configs/minimal_ds_config/zero0.json

This file was deleted.

5 changes: 0 additions & 5 deletions configs/minimal_ds_config/zero1.json

This file was deleted.

15 changes: 0 additions & 15 deletions configs/minimal_ds_config/zero2.json

This file was deleted.

Loading

0 comments on commit 2f835ec

Please sign in to comment.