Add notebook text classification (#88)

* wip * added guide and notebook * Update docs/source/tutorials/fine_tune_bert.mdx Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com> * Update docs/source/tutorials/fine_tune_bert.mdx Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com> * Update notebooks/README.md Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com> * Update notebooks/text-classification/scripts/train.py Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com> * Update notebooks/text-classification/scripts/train.py Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com> --------- Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>
huggingface · Jun 5, 2023 · 872a38e · 872a38e
1 parent 1fa29a6
commit 872a38e
Show file tree

Hide file tree

Showing 4 changed files with 683 additions and 1 deletion.
diff --git a/docs/source/tutorials/fine_tune_bert.mdx b/docs/source/tutorials/fine_tune_bert.mdx
@@ -16,4 +16,210 @@ limitations under the License.
 
 # Fine-tune BERT for Text Classification on AWS Trainium
 
-_Coming soon._
+This tutorial will help you to get started with [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/?nc1=h_ls) and Hugging Face Transformers. It will cover how to set up a Trainium instance on AWS, load & fine-tune a transformers model for text-classification
+
+You will learn how to:
+
+1. [Setup AWS environment](#1-setup-aws-environment)
+2. [Load and process the dataset](#2-load-and-process-the-dataset)
+3. [Fine-tune BERT using Hugging Face Transformers and Optimum Neuron](#3-fine-tune-bert-using-hugging-face-transformers)
+
+Before we can start, make sure you have a [Hugging Face Account](https://huggingface.co/join) to save artifacts and experiments.
+
+## Quick intro: AWS Trainium
+
+[AWS Trainium (Trn1)](https://aws.amazon.com/de/ec2/instance-types/trn1/) is a purpose-built EC2 for deep learning (DL) training workloads. Trainium is the successor of [AWS Inferentia](https://aws.amazon.com/ec2/instance-types/inf1/?nc1=h_ls) focused on high-performance training workloads claiming up to 50% cost-to-train savings over comparable GPU-based instances.
+
+Trainium has been optimized for training natural language processing, computer vision, and recommender models used. The accelerator supports a wide range of data types, including FP32, TF32, BF16, FP16, UINT8, and configurable FP8.
+
+The biggest Trainium instance, the `trn1.32xlarge` comes with over 500GB of memory, making it easy to fine-tune ~10B parameter models on a single instance. Below you will find an overview of the available instance types. More details [here](https://aws.amazon.com/de/ec2/instance-types/trn1/#Product_details):
+
+| instance size                 | accelerators | accelerator memory | vCPU | CPU Memory | price per hour |
+| ----------------------------- | ------------ | ------------------ | ---- | ---------- | -------------- |
+| trn1.2xlarge                  | 1            | 32                 | 8    | 32         | $1.34          |
+| trn1.32xlarge                 | 16           | 512                | 128  | 512        | $21.50         |
+| trn1n.32xlarge (2x bandwidth) | 16           | 512                | 128  | 512        | $24.78         |
+
+---
+
+Now we know what Trainium offers, let's get started. 🚀
+
+_Note: This tutorial was created on a trn1.2xlarge AWS EC2 Instance._
+
+## 1. Setup AWS environment
+
+In this example, we will use the `trn1.2xlarge` instance on AWS with 1 Accelerator, including two Neuron Cores and the [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2).
+
+This blog post doesn't cover how to create the instance in detail. You can check out my previous blog about [“Setting up AWS Trainium for Hugging Face Transformers”](https://www.philschmid.de/setup-aws-trainium), which includes a step-by-step guide on setting up the environment.
+
+Once the instance is up and running, we can ssh into it. But instead of developing inside a terminal we want to use a `Jupyter` environment, which we can use for preparing our dataset and launching the training. For this, we need to add a port for forwarding in the `ssh` command, which will tunnel our localhost traffic to the Trainium instance.
+
+```bash
+PUBLIC_DNS="" # IP address, e.g. ec2-3-80-....
+KEY_PATH="" # local path to key, e.g. ssh/trn.pem
+
+ssh -L 8080:localhost:8080 -i ${KEY_NAME}.pem ubuntu@$PUBLIC_DNS
+```
+
+We can now start our **`jupyter`** server.
+
+```bash
+python -m notebook --allow-root --port=8080
+```
+
+You should see a familiar **`jupyter`** output with a URL to the notebook.
+
+**`http://localhost:8080/?token=8c1739aff1755bd7958c4cfccc8d08cb5da5234f61f129a9`**
+
+We can click on it, and a **`jupyter`** environment opens in our local browser.
+
+![jupyter.webp](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/optimum/neuron/tutorial-fine-tune-bert-jupyter.png)
+
+We are going to use the Jupyter environment only for preparing the dataset and then `torchrun` for launching our training script on both neuron cores for distributed training. Lets create a new notebook and get started.
+
+## 2. Load and process the dataset
+
+We are training a Text Classification model on the [emotion](https://huggingface.co/datasets/philschmid/emotion) dataset to keep the example straightforward. The `emotion` is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.
+
+We will use the `load_dataset()` method from the [🤗 Datasets](https://huggingface.co/docs/datasets/index) library to load the `emotion`.
+
+```python
+from datasets import load_dataset
+
+# Dataset id from huggingface.co/dataset
+dataset_id = "philschmid/emotion"
+
+# Load raw dataset
+raw_dataset = load_dataset(dataset_id)
+
+print(f"Train dataset size: {len(raw_dataset['train'])}")
+print(f"Test dataset size: {len(raw_dataset['test'])}")
+
+# Train dataset size: 16000
+# Test dataset size: 2000
+```
+
+Let’s check out an example of the dataset.
+
+```python
+from random import randrange
+
+random_id = randrange(len(raw_dataset['train']))
+raw_dataset['train'][random_id]
+# {'text': 'i feel isolated and alone in my trade', 'label': 0}
+```
+
+We must convert our "Natural Language" to token IDs to train our model. This is done by a Tokenizer, which tokenizes the inputs (including converting the tokens to their corresponding IDs in the pre-trained vocabulary). if you want to learn more about this, out [chapter 6](https://huggingface.co/course/chapter6/1?fw=pt) of the [Hugging Face Course](https://huggingface.co/course/chapter1/1).
+
+Our Neuron Accelerator expects a fixed shape of inputs. We need to truncate or pad all samples to the same length.
+
+```python
+from transformers import AutoTokenizer
+import os
+# Model id to load the tokenizer
+model_id = "bert-base-uncased"
+save_dataset_path = "lm_dataset"
+# Load Tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# Tokenize helper function
+def tokenize(batch):
+    return tokenizer(batch['text'], padding='max_length', truncation=True,return_tensors="pt")
+
+# Tokenize dataset
+raw_dataset =  raw_dataset.rename_column("label", "labels") # to match Trainer
+tokenized_dataset = raw_dataset.map(tokenize, batched=True, remove_columns=["text"])
+tokenized_dataset = tokenized_dataset.with_format("torch")
+
+# save dataset to disk
+tokenized_dataset["train"].save_to_disk(os.path.join(save_dataset_path,"train"))
+tokenized_dataset["test"].save_to_disk(os.path.join(save_dataset_path,"eval"))
+```
+
+## 3. Fine-tune BERT using Hugging Face Transformers
+
+Normally you would use the [Trainer](https://huggingface.co/docs/transformers/v4.19.4/en/main_classes/trainer#transformers.Trainer) and [TrainingArguments](https://huggingface.co/docs/transformers/v4.19.4/en/main_classes/trainer#transformers.TrainingArguments) to fine-tune PyTorch-based transformer models.
+
+But together with AWS, we have developed a `TrainiumTrainer` to improve performance, robustness, and safety when training on Trainium instances. The `TrainiumTrainer` also comes with a [model cache](https://www.notion.so/Getting-started-with-AWS-Trainium-and-Hugging-Face-Transformers-8428c72556194aed9c393de101229dcf), which allows us to use precompiled models and configuration from Hugging Face Hub to skip the compilation step, which would be needed at the beginning of training. This can reduce the training time by ~3x.
+
+The `TrainiumTrainer` is part of the `optimum-neuron` library and can be used as a 1-to-1 replacement for the `Trainer`. You only have to adjust the import in your training script.
+
+```diff
+- from transformers import Trainer
++ from optimum.neuron import TrainiumTrainer as Trainer
+```
+
+We prepared a simple [train.py](https://github.com/philschmid/aws-neuron-samples/blob/main/training/scripts/train.py) training script based on the ["Getting started with Pytorch 2.0 and Hugging Face Transformers”](https://www.philschmid.de/getting-started-pytorch-2-0-transformers#3-fine-tune--evaluate-bert-model-with-the-hugging-face-trainer) blog post with the `TrainiumTrainier`. Below is an excerpt
+
+```python
+from transformers import TrainingArguments
+from optimum.neuron import TrainiumTrainer as Trainer
+
+def parse_args():
+	...
+
+def training_function(args):
+
+    # load dataset from disk and tokenizer
+    train_dataset = load_from_disk(os.path.join(args.dataset_path, "train"))
+		...
+
+    # Download the model from huggingface.co/models
+    model = AutoModelForSequenceClassification.from_pretrained(
+        args.model_id, num_labels=num_labels, label2id=label2id, id2label=id2label
+    )
+
+    training_args = TrainingArguments(
+			...
+    )
+
+    # Create Trainer instance
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        compute_metrics=compute_metrics,
+    )
+
+    # Start training
+    trainer.train()
+```
+
+We can load the training script into our environment using the `wget` command or manually copy it into the notebook from [here](https://github.com/huggingface/optimum-neuron/blob/notebooks/text-classification/scripts/train.py).
+
+```python
+!wget https://raw.githubusercontent.com/huggingface/optimum-neuron/notebooks/text-classification/scripts/train.py
+```
+
+We will use `torchrun` to launch our training script on both neuron cores for distributed training. `torchrun` is a tool that automatically distributes a PyTorch model across multiple accelerators. We can pass the number of accelerators as `nproc_per_node` arguments alongside our hyperparameters.
+
+We'll use the following command to launch training:
+
+```bash
+!torchrun --nproc_per_node=2 scripts/train.py \
+ --model_id bert-base-uncased \
+ --dataset_path lm_dataset \
+ --lr 5e-5 \
+ --per_device_train_batch_size 16 \
+ --bf16 True \
+ --epochs 3
+```
+
+_**Note**: If you see bad, bad accuracy, you might want to deactivate `bf16` for now._
+
+After 9 minutes the training was completed and achieved an excellent f1 score of `0.914`.
+
+```bash
+***** train metrics *****
+  epoch                    =        3.0
+  train_runtime            =    0:08:30
+  train_samples            =      16000
+  train_samples_per_second =     96.337
+
+***** eval metrics *****
+  eval_f1                  =      0.914
+  eval_runtime             =    0:00:08
+```
+
+Last but not least, terminate the EC2 instance to avoid unnecessary charges. Looking at the price-performance, our training only cost **`20ct`** (**`1.34$/h * 0.15h = 0.20$`**)
diff --git a/notebooks/README.md b/notebooks/README.md
@@ -0,0 +1,5 @@
+# Notebooks
+
+| Notebook                                                                                                | Task                | Model Architectures                                                 |
+| ------------------------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------------------ |
+| [Getting started with AWS Trainium and Hugging Face Transformers](./text-classification/notebook.ipynb) | text-classification | BERT, RoBERTa, DistilBERT, XLM-RoBERTa, ALBERT, Electra, CamemBERT |