Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation #392

Merged
merged 32 commits into from
Aug 21, 2021
Merged
Changes from 1 commit
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
962eacc
Update README.md
StellaAthena Jun 25, 2021
50ea5cc
Update README.md
StellaAthena Jun 25, 2021
5018594
Update README.md
StellaAthena Jun 25, 2021
7a75cd4
Update README.md
StellaAthena Jun 25, 2021
c221109
Update README.md
StellaAthena Jun 25, 2021
8f43bac
Update README.md
StellaAthena Jun 28, 2021
3783d7f
Update README.md
StellaAthena Jun 28, 2021
f22769e
Update README.md
StellaAthena Jun 28, 2021
0b7d2fe
Update README.md
StellaAthena Jul 7, 2021
5f978c8
add info about installing fused kernels
ShivanshuPurohit Jul 9, 2021
20cadc3
Update README.md
ShivanshuPurohit Jul 9, 2021
2686396
Update README.md
ShivanshuPurohit Jul 9, 2021
94980dd
sparsity + minor typos
ShivanshuPurohit Jul 9, 2021
7d44d97
change path to ssd-1
ShivanshuPurohit Jul 10, 2021
4333716
typo
ShivanshuPurohit Jul 10, 2021
05249ab
Update README.md
StellaAthena Jul 10, 2021
28a830e
Update README.md
StellaAthena Jul 10, 2021
ae00018
Update README.md
StellaAthena Jul 10, 2021
5245c6d
Update README.md
StellaAthena Jul 10, 2021
4c6469e
Update README.md
StellaAthena Jul 10, 2021
c695714
Update README.md
StellaAthena Jul 10, 2021
1cccfd2
Update README.md
StellaAthena Jul 10, 2021
486ed38
Update README.md
StellaAthena Jul 10, 2021
1e97f7d
Update README.md
ShivanshuPurohit Jul 16, 2021
a3d06bc
Update README.md
ShivanshuPurohit Jul 24, 2021
3e4f6d9
Merge pull request #380 from EleutherAI/main
StellaAthena Jul 24, 2021
173dfd4
Update README.md
StellaAthena Jul 27, 2021
91bb070
Update README.md
StellaAthena Jul 27, 2021
ff74c8a
Update README.md
StellaAthena Jul 27, 2021
74a6cdd
Update README.md
StellaAthena Jul 30, 2021
e84a344
Merge pull request #385 from EleutherAI/main
StellaAthena Jul 30, 2021
b6de20b
Update README.md
ShivanshuPurohit Aug 21, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update README.md
  • Loading branch information
StellaAthena authored Jul 10, 2021
commit c695714326d1313abd40197b0fda5c0caffcc3aa
139 changes: 73 additions & 66 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,54 +13,85 @@ If you're looking for our TPU codebase, see [GPT-Neo](https://github.com/Eleuthe

GPT-NeoX is under active development.
- [Why GPT-NeoX](#why-gpt-neox)
* [Straightforward Configuration](#straightforward-configuration)
* [Straightforward configuration](#straightforward-configuration)
* [Diverse Modeling Options](#diverse-modeling-options)
* [HuggingFace Integration](huggingface-integration)
* [Pretrained Models](#pretrained-models)
* [HuggingFace Integration](#huggingface-integration)
* [Large Pretrained Models](#large-pretrained-models)
- [Quick Start](#quick-start)
* [Getting Started](#getting-started)
* [Configuration and parameters](#configuration-and-parameters)
* [Configuration and Parameters](#configuration-and-parameters)
* [Datasets](#datasets)
* [Running the Code](#running-the-code)
- [Features:](#features)
- [Features](#features)
* [3D Parallelism](#3d-parallelism)
* [Model Structure](#model-structure)
* [Optimizers](#optimizers)
- [Datasets](#datasets)
* [High-Precision Training:](#high-precision-training-)
- [Datasets](#datasets-1)
* [Preconfigured Datasets](#preconfigured-datasets)
* [Using Custom Data](#using-custom-data)
* [Tokenizers](#tokenizers)
* [Using and Training Tokenizers](#using-and-training-tokenizers)
- [Training and Finetuning](#training-and-finetuning)
- [Inference](#inference)
- [Evaluation](#evaluation)
- [Distilling](#distilling)
- [Monitoring](#monitoring)
* [Monitoring](#monitoring)
- [Placeholder Name](#placeholder-name)
* [Citing GPT-NeoX](#citing-gpt-neox)
* [Licensing](#licensing)
* [Acknowledgements](#acknowledgements)

## Why GPT-NeoX

### Straightforward configuration

- Other libraries such as Megatron-LM require you configure them using command line arguments and global variables, which can often be difficult to work with and iterate upon. We offer straightforward configuration using .yaml files, which enables you to launch training runs across 100s of GPUs with a single line bash script.
- Additionally, we hope to make data preparation easier on the user by providing scripts to automatically download and pretokenize a number of large-scale datasets.

### Diverse Modeling Options

### HuggingFace Integration

### Large Pretrained Models

## Quick Start

Coming Soon: a colab notebook for trying out the model.
**Coming Soon:** a colab notebook for trying out the model.

### Running Locally
### Getting Started

**Warning:** Our codebase relies on [DeeperSpeed](https://github.com/EleutherAI/DeeperSpeed), our fork of the [DeepSpeed](https://github.com/microsoft/DeepSpeed) library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before installing from `requirements/requirements.txt`. Failure to do so may cause other repositories that rely on DeepSpeed to break.

With `torch >= 1.8` installed, run `pip install -r requirements/requirements.txt` followed by
First make sure you are in an environment with Python 3.8 or later and `torch>=1.8` installed. Then run `pip install -r requirements/requirements.txt`.
You may need to change the version of `cupy-cudaxxx` to match your machine's cuda version.

Some features rely on apex, which you can install with the command below:

```bash
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@e2083df5eb96643c61613b9df48dd4eea6b07690
```

All functionality follows the pattern `./deepy.py main_function.py -d configs small.yml `
We also host a Docker Image on Dockerhub at `leogao2/gpt-neox`, which enables easy multi-node training.

### Configuration and Parameters

GPT-NeoX parameters are defined in a YAML configuration file which is passed to the `deepy.py` launcher - for examples see the `configs` folder.

For a full list of parameters and documentation see the [configuration readme](configs).

### Datasets

Quick overview of the datasets we provide

### Running the Code

All functionality follows the pattern `./deepy.py main_function.py -d configs small.yml local_configs.yml`
We currently offer four main functions:
1. `train.py` is used for training models
1. `train.py` is used for training and finetuning models
2. `evaluate.py` is used to evaluate a trained model using the evaluation harness
3. `generate.py` is used to sample text from a file.
3. `generate.py` is used to sample text from a model.
4. `distill.py` is used to distill a larger model into a smaller model.

For information on the required arguments for each function, see the corresponding section below.

## Features

GPT-NeoX offers a wide variety of state-of-the-art and bespoke features
Expand Down Expand Up @@ -98,47 +129,12 @@ GPT-NeoX offers a wide variety of state-of-the-art and bespoke features
- Due to a known issue with `PyTorch`, `bf16` models require doing the all-reduce operation in `fp32`. If you have a patch for this problem, you can turn off the default`"fp32_allreduce": True`.
- Additionally, you have to run `python /home/$USER/gpt-neox/megatron/fused_kernels/setup.py install` to be able to use bf16 (may require root access).

### Straightforward configuration

- Other libraries such as Megatron-LM require you configure them using command line arguments and global variables, which can often be difficult to work with and iterate upon. We offer straightforward configuration using .yaml files, which enables you to launch training runs across 100s of GPUs with a single line bash script.
- Additionally, we hope to make data preparation easier on the user by providing scripts to automatically download and pretokenize a number of large-scale datasets.

## Using the Library

All functionality follows the pattern `./deepy.py main_function.py -d configs config1.yml config2.yml ...`. By default we split our configs into one file about the model and one file about the hardware, but you may use any number of configuration files you like.
We currently offer four main functions:
1. `train.py` is used for training models
2. `evaluate.py` is used to evaluate a trained model using the evaluation harness
3. `generate.py` is used to sample text from a file.
4. `distill.py` is used to distill a larger model into a smaller model.
For information on the required arguments for each function, see the corresponding section below.


### Getting Started

Our codebase relies on [DeeperSpeed](https://github.com/EleutherAI/DeeperSpeed), our fork of the [DeepSpeed](https://github.com/microsoft/DeepSpeed) library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before installing from `requirements/requirements.txt`. Failure to do so may cause other repositories that rely on DeepSpeed to break. Python 3.8 or later is required.

First make sure you are in an environment with `torch>=1.8` installed. Then run `pip install -r requirements/requirements.txt`.
You may need to change the version of `cupy-cudaxxx` to match your machine's cuda version.

Finally, certain features rely on apex, which you can install with the command below:

```bash
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@e2083df5eb96643c61613b9df48dd4eea6b07690
```

We also host a Docker Image on Dockerhub at `leogao2/gpt-neox`, which enables easy multi-node training.

### Configuration and parameters

GPT-NeoX parameters are defined in a YAML configuration file which is passed to the `deepy.py` launcher - for examples see the `configs` folder.

For a full list of parameters and documentation see the [configuration readme](configs).

### Datasets
## Datasets

Once you've installed all the requirements and set up your model configuration, the next step is obtaining and preprocessing your dataset.

### Preconfigured Datasets

For demonstrative purposes we've hosted the Enron Emails corpus and made it available for downloading. Running `python prepare_data.py` will download the tokenizer files and dataset, pretokenize the dataset, and save it into a folder named `./data`.

In the future we will also be adding a single command to preprocess our 800GB language modelling dataset, [The Pile](https://arxiv.org/abs/2101.00027), and all its constituent datasets.
Expand All @@ -151,6 +147,10 @@ Next make sure to download the GPT2 tokenizer vocab, and merge files from the fo
- Vocab: https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
- Merge: https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

### Using Custom Data

### Using and Training Tokenizers

We plan to integrate HuggingFace's `Tokenizers` library soon to make this process smoother.

You can now pretokenize your data using `tools/preprocess_data.py`.
Expand Down Expand Up @@ -208,7 +208,7 @@ You would then run training with the following settings added to your configurat
"data-path": "data/mydataset/mydataset",
```

### Training
## Training and Finetuning

Training is launched using `deepy.py`, a wrapper around Deepspeed's launcher, which launches the same script in parallel across many GPUs / nodes.

Expand All @@ -232,27 +232,31 @@ This will deploy the `pretrain_gpt2.py` script on all nodes with one process per
* Model parameters are defined in the config file `configs/small.yml`.
* Data path parameters are defined in the config file `configs/local_setup.yml`. If you are an EleutherAI member and using the [Kubernetes cluster](kubernetes), the `eleutherai_cluster.yml` config should be instead.

### Monitoring

EleutherAI is currently using [Weights & Biases to record experiments](https://wandb.ai/eleutherai/neox). If you are logged into Weights & Biases on your machine - you can do this by executing `wandb login` - your runs will automatically be recorded. Additionally, set the config parameter `wandb_team` if you would like the run to be added to an organisation/team account.

We also support using Tensorboard via the `tensorboard-dir` argument. To use tensorboard, install the optional packages found at `requirements/requirements-tensorboard.txt`

### Inference
## Inference

[WIP]

### Evaluation
## Evaluation

GPT-NeoX supports evaluation on downstream tasks through the [language model evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).

To evaluate a trained model on the evaluation harness, use `./deepy.py evaluate.py configs/your_config.yml`

### Distilling
## Distilling

[WIP]

## Citing GPT-NeoX

### Monitoring

EleutherAI is currently using [Weights & Biases to record experiments](https://wandb.ai/eleutherai/neox). If you are logged into Weights & Biases on your machine - you can do this by executing `wandb login` - your runs will automatically be recorded. Additionally, set the config parameter `wandb_team` if you would like the run to be added to an organisation/team account.

We also support using Tensorboard via the `tensorboard-dir` argument. To use tensorboard, install the optional packages found at `requirements/requirements-tensorboard.txt`

## Placeholder Name

### Citing GPT-NeoX

If you have found GPT-Neo helpful in your work, you can cite this repository as

Expand All @@ -267,7 +271,7 @@ If you have found GPT-Neo helpful in your work, you can cite this repository as

In the above bibtex entry, names are in alphabetical order, and the year corresponds to the project's open-source release.

## Licensing
### Licensing

This repository hosts code that is part of EleutherAI's GPT-NeoX project. Copyright (c) 2021, EleutherAI contributors (in alphabetical order): Alex Andonian, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Shivanshu Purohit, Tri Songz, Phil Wang, Samuel Weinbach. Licensed under the Apache License:

Expand All @@ -287,6 +291,9 @@ This repository is based off code written by NVIDIA that is licensed under the A

For full terms, see the `LICENSE` file. If you have any questions, comments, or concerns about licensing please email us at contact@eleuther.ai.

## Acknowledgements
### Acknowledgements

We run our experiments on a Kubernetes cluster generously provided by [CoreWeave](https://coreweave.com/).

<small><i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i></small>