Update documentation (#392)

* Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * add info about installing fused kernels * Update README.md * Update README.md * sparsity + minor typos add the instructions to install triton * change path to ssd-1 * typo * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md Co-authored-by: Shivanshu Purohit <42869065+ShivanshuPurohit@users.noreply.github.com>
EleutherAI · Aug 21, 2021 · 1d46283 · 1d46283
1 parent 88b84a2
commit 1d46283
Show file tree

Hide file tree

Showing 2 changed files with 114 additions and 56 deletions.
diff --git a/README.md b/README.md
@@ -7,13 +7,79 @@ This repository records [EleutherAI](www.eleuther.ai)'s work-in-progress for tra
 
 We aim to make this repo a centralized and accessible place to gather techniques for training large scale autoregressive language models, and accelerate research into large scale training. Additionally, we hope to train and open source a 175B parameter GPT3 replication along the way. 
 
-For more info on our progress, please [join our discord](https://discord.gg/zBGx3azzUn) and head to the `#gpt-neo` channel. We're working with cloud compute provider [Coreweave](https://www.coreweave.com/) for training, and hope to release the weights of smaller models as we progress up to 175B parameters.
+If you are interested in contributing, please [join our discord](https://discord.gg/zBGx3azzUn) and head to the `#gpt-neo` channel. We're working with cloud compute provider [Coreweave](https://www.coreweave.com/) for training, and hope to release the weights of smaller models as we progress up to 175B parameters.
 
 If you're looking for our TPU codebase, see [GPT-Neo](https://github.com/EleutherAI/gpt-neo).
 
-GPT-NeoX is under active development.
+- [GPT-NeoX](#gpt-neox)
+  * [Why GPT-NeoX](#why-gpt-neox)
+  * [Quick Start](#quick-start)
+  * [Features](#features)
+    + [3D Parallelism](#3d-parallelism)
+    + [Model Structure](#model-structure)
+    + [Optimizers](#optimizers)
+    + [High-Precision Training](#high-precision-training)
+  * [Datasets](#datasets)
+    + [Preconfigured Datasets](#preconfigured-datasets)
+    + [Using Custom Data](#using-custom-data)
+    + [Using and Training Tokenizers](#using-and-training-tokenizers)
+  * [Training and Finetuning](#training-and-finetuning)
+  * [Inference](#inference)
+  * [Evaluation](#evaluation)
+  * [Distilling](#distilling)
+  * [Monitoring](#monitoring)
+    + [WandB](#wandb)
+    + [Tensorboard](#tensorboard)
+  * [Placeholder Name](#placeholder-name)
+    + [Citing GPT-NeoX](#citing-gpt-neox)
+    + [Licensing](#licensing)
+    + [Acknowledgements](#acknowledgements)
 
-## Features:
+## Why GPT-NeoX
+
+**Straightforward configuration:** Other libraries such as Megatron-LM require you configure them using command line arguments and global variables, which can often be difficult to work with and iterate upon. We offer straightforward configuration using .yaml files, which enables you to launch training runs across 100s of GPUs with a single line bash script. Additionally, we hope to make data preparation easier on the user by providing scripts to automatically download and pretokenize a number of large-scale datasets.
+
+**Diverse Modeling Options:** We provide a wide collections of options for constructing your model.
+
+**HuggingFace Integration:** Our code is designed to work with the HuggingFace `transformers` library. All models trained using this codebase can be uploaded to a custom HuggingFace class with ease, and all HuggingFace tokenizers and datasets can be used to train models.
+
+**Large Pretrained Models:** We offer several large, pretrained models to iterate on. For people who are unable to train billion parameter scale models themselves, this framework allows you to easily interact with models that we have released.
+
+## Quick Start
+
+**Google Colab**
+
+Coming soon: a colab notebook for trying out the model.
+
+**Warning:** Our codebase relies on [DeeperSpeed](https://github.com/EleutherAI/DeeperSpeed), our fork of the [DeepSpeed](https://github.com/microsoft/DeepSpeed) library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before installing from `requirements/requirements.txt`. Failure to do so may cause other repositories that rely on DeepSpeed to break.
+
+First make sure you are in an environment with Python 3.8 or later and `torch>=1.8` installed. Then run `pip install -r requirements/requirements.txt`. 
+You may need to change the version of `cupy-cudaxxx` to match your machine's cuda version.
+
+Some features rely on apex, which you can install with the command below:
+
+```bash
+pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@e2083df5eb96643c61613b9df48dd4eea6b07690
+```
+
+We also host a Docker Image on Dockerhub at `leogao2/gpt-neox`, which enables easy multi-node training.
+
+Once you've installed all the requirements and set up your model configuration, the next step is obtaining and preprocessing your dataset. We provide a data processing library that is easily interfaced with via the function `prepare_data.py`. Calling `python prepare_data.py enron -t CharLevelTokenizer -d ./data/` will download the dataset `enron`, tokenize it with a character-level tokenizer, and save the results to `./data/`. 
+
+GPT-NeoX parameters are defined in a YAML configuration file which is passed to the `deepy.py` launcher. We provide baseline examples for the models found in the paper [Language Models are Few Shot Learners](https://arxiv.org/abs/2005.14165). Configs such as file locations that are dependant on your particular system go in `local_configs.yml`. We have filled it out with some placeholder examples, but you will need to update this for your system.
+
+All functionality follows the pattern `./deepy.py main_function.py -d configs small.yml local_configs.yml`
+We currently offer four main functions:
+1. `train.py` is used for training and finetuning models.
+2. `evaluate.py` is used to evaluate a trained model using the evaluation harness.
+3. `generate.py` is used to sample text from a trained model.
+4. `distill.py` is used to distill a trained model into another model.
+
+For now, run `./deepy.py train.py -d configs small.yml local_configs.yml` to begin training a model and complete this tutorial.
+
+## Features
+
+GPT-NeoX offers a wide variety of state-of-the-art and bespoke features 
 
 ### 3D Parallelism 
 
@@ -23,55 +89,35 @@ GPT-NeoX is under active development.
 
 - **Positional Encodings:** 
 
-    - Choose between T5 RPE style positional encodings, a learned encoding added to the input (GPT2-style), Sinusoidal positional encoding, [rotary positional encodings](https://arxiv.org/abs/2104.09864), and no positional encodings at all (which [recent](https://arxiv.org/abs/1905.04226) [research](https://arxiv.org/abs/2102.11174) has found to even outperform other positional encodings in autoregressive models).
+    - Choose between T5-style relative positional encodings, learned encoding added to the input (GPT2-style), sinusoidal positional encoding, [rotary positional encodings](https://arxiv.org/abs/2104.09864), and no positional encodings at all (which [recent](https://arxiv.org/abs/1905.04226) [research](https://arxiv.org/abs/2102.11174) has found to be competetive with other positional encodings in autoregressive models). Use the `pos-emb` field to select a positional encoding.
 
 - **Sparsity:** 
 
-    - Deepspeed's sparse attention kernels are supported, but don't work with cuda 11.0+, and require a specific hardware setup (V100s/RTX2080s). add `"sparsity": "all"` to your config to use sparse attention on all layers, or `"sparsity": "interspersed"` to use it every other layer. 
+    - Deepspeed's sparse attention kernels are supported, but don't work with cuda 11.0+, and require a specific hardware setup (V100s/RTX2080s/A100s). Add `"sparsity": "all"` to your config file to use sparse attention on all layers, or `"sparsity": "interspersed"` to use it every other layer. To use sparsity, first run `pip install requirements/requirements-sparseattention.txt` to install triton.
 
 - **Norms:**
 
-    - A [recent Google paper](https://arxiv.org/abs/2102.11972) has shown layernorm may not be the best option for transformer models. 
-We offer a choice of layernorm, scalenorm and RMSNorm easily configured by changing a single line in your config file.
+    - We offer a choice of layernorm, scalenorm, RMSNorm, and a custom layernorm kernel. Use the `norm` field to select a normalization.
 
 ### Optimizers
 
-- NeoX supports Adam, CPUAdam, 1-Bit Adam, SM3 and madgrad_wd optimizers, as well as Deepspeed's [Zero Redundancy Optimizer](https://www.deepspeed.ai/features/#the-zero-redundancy-optimizer).
+- NeoX supports Adam, CPUAdam, 1-Bit Adam, SM3 and madgrad_wd optimizers, as well as Deepspeed's [Zero Redundancy Optimizer](https://www.deepspeed.ai/features/#the-zero-redundancy-optimizer). Use the `optimizer` and (if applicable) `zero_optimization` fields to configure your optimizer.
 
 - **Zero Redundancy Optimizer (ZeRO):** 
 
     - ZeRO stage 1 works seamlessly with NeoX, while ZeRO stage 2 requires pipeline parallelism be set to 0. We are additionally working on integrating ZeRO 3 into the codebase.
     Turning on ZeRO is as simple as adding one field to your configuration file.
 
-### Straightforward configuration
-
-- Other libraries such as Megatron-LM require you configure them using command line arguments and global variables, which can often be difficult to work with and iterate upon. We offer straightforward configuration using .yaml files, which enables you to launch training runs across 100s of GPUs with a single line bash script. 
-- Additionally, we hope to make data preparation easier on the user by providing scripts to automatically download and pretokenize a number of large-scale datasets.
-
-## Getting Started
-
-Our codebase relies on [DeeperSpeed](https://github.com/EleutherAI/DeeperSpeed), our fork of the [DeepSpeed](https://github.com/microsoft/DeepSpeed) library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before installing from `requirements/requirements.txt`. Failure to do so may cause other repositories that rely on DeepSpeed to break. Python 3.8 or later is required.
-
-First make sure you are in an environment with `torch>=1.8` installed. Then run `pip install -r requirements/requirements.txt`. 
-You may need to change the version of `cupy-cudaxxx` to match your machine's cuda version.
-
-Finally, certain features rely on apex, which you can install with the command below:
-
-```bash
-pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@e2083df5eb96643c61613b9df48dd4eea6b07690
-```
-
-We also host a Docker Image on Dockerhub at `leogao2/gpt-neox`, which enables easy multi-node training.
-
-### Configuration and parameters
+### High-Precision Training
 
-GPT-NeoX parameters are defined in a YAML configuration file which is passed to the `deepy.py` launcher - for examples see the `configs` folder. 
+ - Choose between `fp16`, `bf16`, and `fp32` operations to get the most performance out of your avaliable compute. Use the `precision` field to configure your precision settings, by adding `"type": "bfloat16"` in the config.
+ - Due to a known issue with `PyTorch`, `bf16` models require doing the all-reduce operation in `fp32`. If you have a patch for this problem, you can turn off the default`"fp32_allreduce": True`.
+ - Additionally, you have to run `python /megatron/fused_kernels/setup.py install` (assuming you're inside `gpt-neox/`) to be able to use bf16 (may require root access).
 
-For a full list of parameters and documentation see the [configuration readme](configs).
+## Datasets
 
-### Datasets
 
-Once you've installed all the requirements and set up your model configuration, the next step is obtaining and preprocessing your dataset. 
+### Preconfigured Datasets
 
 For demonstrative purposes we've hosted the Enron Emails corpus and made it available for downloading. Running `python prepare_data.py` will download the tokenizer files and dataset, pretokenize the dataset, and save it into a folder named `./data`.
 
@@ -85,7 +131,9 @@ Next make sure to download the GPT2 tokenizer vocab, and merge files from the fo
 - Vocab: https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
 - Merge: https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
 
-We plan to integrate HuggingFace's `Tokenizers` library soon to make this process smoother.
+### Using Custom Data
+
+### Using and Training Tokenizers
 
 You can now pretokenize your data using `tools/preprocess_data.py`.
 
@@ -142,14 +190,14 @@ You would then run training with the following settings added to your configurat
   "data-path": "data/mydataset/mydataset",
 ```
 
-### Training
+## Training and Finetuning
 
 Training is launched using `deepy.py`, a wrapper around Deepspeed's launcher, which launches the same script in parallel across many GPUs / nodes.
 
 The general usage pattern is:
 
 ```bash
-./deepy.py [TRAINING_SCRIPT] [path/to/config1.yml] [path/to/config2/yml] ...
+./deepy.py [TRAINING_SCRIPT] [path/to/config1.yml] [path/to/config2.yml] ...
 ```
 
 You can pass in an arbritrary number of configs which will all be merged at runtime.
@@ -166,28 +214,35 @@ This will deploy the `pretrain_gpt2.py` script on all nodes with one process per
 * Model parameters are defined in the config file `configs/small.yml`.
 * Data path parameters are defined in the config file `configs/local_setup.yml`. If you are an EleutherAI member and using the [Kubernetes cluster](kubernetes), the `eleutherai_cluster.yml` config should be instead.
 
-## Monitoring
-
-EleutherAI is currently using [Weights & Biases to record experiments](https://wandb.ai/eleutherai/neox). If you are logged into Weights & Biases on your machine - you can do this by executing `wandb login` - your runs will automatically be recorded. Additionally, set the config parameter `wandb_team` if you would like the run to be added to an organisation/team account.
-
-We also support using Tensorboard via the `tensorboard-dir` argument. To use tensorboard, install the optional packages found at `requirements/requirements-tensorboard.txt`
 
 ## Inference
 
 [WIP]
 
 ## Evaluation
 
-[WIP]
+GPT-NeoX supports evaluation on downstream tasks through the [language model evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
+
+To evaluate a trained model on the evaluation harness, use `./deepy.py evaluate.py configs/your_config.yml`
 
 ## Distilling
 
 [WIP]
 
-## Citing GPT-NeoX
 
+## Monitoring
+
+### WandB
+
+EleutherAI is currently using [Weights & Biases to record experiments](https://wandb.ai/eleutherai/neox). If you are logged into Weights & Biases on your machine - you can do this by executing `wandb login` - your runs will automatically be recorded. Additionally, set the config parameter `wandb_team` if you would like the run to be added to an organisation/team account.
 
-### Citing 
+### Tensorboard
+
+We also support using Tensorboard via the `tensorboard-dir` argument. To use tensorboard, install the optional packages found at `requirements/requirements-tensorboard.txt`
+
+## Placeholder Name
+
+### Citing GPT-NeoX
 
 If you have found GPT-Neo helpful in your work, you can cite this repository as
 
@@ -202,7 +257,7 @@ If you have found GPT-Neo helpful in your work, you can cite this repository as
 
 In the above bibtex entry, names are in alphabetical order, and the year corresponds to the project's open-source release.
 
-## Licensing
+### Licensing
 
 This repository hosts code that is part of EleutherAI's GPT-NeoX project. Copyright (c) 2021, EleutherAI contributors (in alphabetical order): Alex Andonian, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Shivanshu Purohit, Tri Songz, Phil Wang, Samuel Weinbach. Licensed under the Apache License:
 
@@ -222,6 +277,9 @@ This repository is based off code written by NVIDIA that is licensed under the A
 
 For full terms, see the `LICENSE` file. If you have any questions, comments, or concerns about licensing please email us at contact@eleuther.ai.
 
-## Acknowledgements
+### Acknowledgements
 
 We run our experiments on a Kubernetes cluster generously provided by [CoreWeave](https://coreweave.com/).
+
+<small><i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i></small>
+
diff --git a/configs/eleutherai_cluster.yml b/configs/eleutherai_cluster.yml
@@ -1,19 +1,19 @@
 # Data paths and options when using EleutherAI cluster
 {
-  "data-path": "/mnt/ssd-cluster/data/enron/enron_text_document",
+  "data-path": "/mnt/ssd-1/data/enron/enron_text_document",
   # or for weighted datasets: 
-  # "train-data-paths": ["/mnt/ssd-cluster/data/enron/enron_text_document", "/mnt/ssd-cluster/data/enron/enron_text_document"],
-  # "test-data-paths": ["/mnt/ssd-cluster/data/enron/enron_text_document", "/mnt/ssd-cluster/data/enron/enron_text_document"],
-  # "valid-data-paths": ["/mnt/ssd-cluster/data/enron/enron_text_document", "/mnt/ssd-cluster/data/enron/enron_text_document"],
+  # "train-data-paths": ["/mnt/ssd-1/data/enron/enron_text_document", "/mnt/ssd-cluster/data/enron/enron_text_document"],
+  # "test-data-paths": ["/mnt/ssd-1/data/enron/enron_text_document", "/mnt/ssd-cluster/data/enron/enron_text_document"],
+  # "valid-data-paths": ["/mnt/ssd-1/data/enron/enron_text_document", "/mnt/ssd-cluster/data/enron/enron_text_document"],
   # "train-data-weights": [1., 2.],
   # "test-data-weights": [2., 1.],
   # "valid-data-weights": [0.5, 0.4],
 
-  "vocab-file": "/mnt/ssd-cluster/data/gpt2-vocab.json",
-  "merge-file": "/mnt/ssd-cluster/data/gpt2-merges.txt",
-  "save": "/mnt/ssd-cluster/checkpoints",
-  "load": "/mnt/ssd-cluster/checkpoints",
-  "tensorboard-dir": "/mnt/ssd-cluster/tensorboard",
-  "log-dir": "/mnt/ssd-cluster/logs",
+  "vocab-file": "/mnt/ssd-1/data/gpt2-vocab.json",
+  "merge-file": "/mnt/ssd-1/data/gpt2-merges.txt",
+  "save": "/mnt/ssd-1/checkpoints",
+  "load": "/mnt/ssd-1/checkpoints",
+  "tensorboard-dir": "/mnt/ssd-1/tensorboard",
+  "log-dir": "/mnt/ssd-1/logs",
   "wandb_team": "eleutherai",
 }