Skip to content

Commit

Permalink
Add notebook text classification (#88)
Browse files Browse the repository at this point in the history
* wip

* added guide and notebook

* Update docs/source/tutorials/fine_tune_bert.mdx

Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>

* Update docs/source/tutorials/fine_tune_bert.mdx

Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>

* Update notebooks/README.md

Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>

* Update notebooks/text-classification/scripts/train.py

Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>

* Update notebooks/text-classification/scripts/train.py

Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>

---------

Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>
  • Loading branch information
philschmid and michaelbenayoun authored Jun 5, 2023
1 parent 1fa29a6 commit 872a38e
Show file tree
Hide file tree
Showing 4 changed files with 683 additions and 1 deletion.
208 changes: 207 additions & 1 deletion docs/source/tutorials/fine_tune_bert.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,210 @@ limitations under the License.

# Fine-tune BERT for Text Classification on AWS Trainium

_Coming soon._
This tutorial will help you to get started with [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/?nc1=h_ls) and Hugging Face Transformers. It will cover how to set up a Trainium instance on AWS, load & fine-tune a transformers model for text-classification

You will learn how to:

1. [Setup AWS environment](#1-setup-aws-environment)
2. [Load and process the dataset](#2-load-and-process-the-dataset)
3. [Fine-tune BERT using Hugging Face Transformers and Optimum Neuron](#3-fine-tune-bert-using-hugging-face-transformers)

Before we can start, make sure you have a [Hugging Face Account](https://huggingface.co/join) to save artifacts and experiments.

## Quick intro: AWS Trainium

[AWS Trainium (Trn1)](https://aws.amazon.com/de/ec2/instance-types/trn1/) is a purpose-built EC2 for deep learning (DL) training workloads. Trainium is the successor of [AWS Inferentia](https://aws.amazon.com/ec2/instance-types/inf1/?nc1=h_ls) focused on high-performance training workloads claiming up to 50% cost-to-train savings over comparable GPU-based instances.

Trainium has been optimized for training natural language processing, computer vision, and recommender models used. The accelerator supports a wide range of data types, including FP32, TF32, BF16, FP16, UINT8, and configurable FP8.

The biggest Trainium instance, the `trn1.32xlarge` comes with over 500GB of memory, making it easy to fine-tune ~10B parameter models on a single instance. Below you will find an overview of the available instance types. More details [here](https://aws.amazon.com/de/ec2/instance-types/trn1/#Product_details):

| instance size | accelerators | accelerator memory | vCPU | CPU Memory | price per hour |
| ----------------------------- | ------------ | ------------------ | ---- | ---------- | -------------- |
| trn1.2xlarge | 1 | 32 | 8 | 32 | $1.34 |
| trn1.32xlarge | 16 | 512 | 128 | 512 | $21.50 |
| trn1n.32xlarge (2x bandwidth) | 16 | 512 | 128 | 512 | $24.78 |

---

Now we know what Trainium offers, let's get started. 🚀

_Note: This tutorial was created on a trn1.2xlarge AWS EC2 Instance._

## 1. Setup AWS environment

In this example, we will use the `trn1.2xlarge` instance on AWS with 1 Accelerator, including two Neuron Cores and the [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2).

This blog post doesn't cover how to create the instance in detail. You can check out my previous blog about [“Setting up AWS Trainium for Hugging Face Transformers”](https://www.philschmid.de/setup-aws-trainium), which includes a step-by-step guide on setting up the environment.

Once the instance is up and running, we can ssh into it. But instead of developing inside a terminal we want to use a `Jupyter` environment, which we can use for preparing our dataset and launching the training. For this, we need to add a port for forwarding in the `ssh` command, which will tunnel our localhost traffic to the Trainium instance.

```bash
PUBLIC_DNS="" # IP address, e.g. ec2-3-80-....
KEY_PATH="" # local path to key, e.g. ssh/trn.pem

ssh -L 8080:localhost:8080 -i ${KEY_NAME}.pem ubuntu@$PUBLIC_DNS
```

We can now start our **`jupyter`** server.

```bash
python -m notebook --allow-root --port=8080
```

You should see a familiar **`jupyter`** output with a URL to the notebook.

**`http://localhost:8080/?token=8c1739aff1755bd7958c4cfccc8d08cb5da5234f61f129a9`**

We can click on it, and a **`jupyter`** environment opens in our local browser.

![jupyter.webp](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/optimum/neuron/tutorial-fine-tune-bert-jupyter.png)

We are going to use the Jupyter environment only for preparing the dataset and then `torchrun` for launching our training script on both neuron cores for distributed training. Lets create a new notebook and get started.

## 2. Load and process the dataset

We are training a Text Classification model on the [emotion](https://huggingface.co/datasets/philschmid/emotion) dataset to keep the example straightforward. The `emotion` is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.

We will use the `load_dataset()` method from the [🤗 Datasets](https://huggingface.co/docs/datasets/index) library to load the `emotion`.

```python
from datasets import load_dataset

# Dataset id from huggingface.co/dataset
dataset_id = "philschmid/emotion"

# Load raw dataset
raw_dataset = load_dataset(dataset_id)

print(f"Train dataset size: {len(raw_dataset['train'])}")
print(f"Test dataset size: {len(raw_dataset['test'])}")
# Train dataset size: 16000
# Test dataset size: 2000
```
Let’s check out an example of the dataset.
```python
from random import randrange
random_id = randrange(len(raw_dataset['train']))
raw_dataset['train'][random_id]
# {'text': 'i feel isolated and alone in my trade', 'label': 0}
```
We must convert our "Natural Language" to token IDs to train our model. This is done by a Tokenizer, which tokenizes the inputs (including converting the tokens to their corresponding IDs in the pre-trained vocabulary). if you want to learn more about this, out [chapter 6](https://huggingface.co/course/chapter6/1?fw=pt) of the [Hugging Face Course](https://huggingface.co/course/chapter1/1).

Our Neuron Accelerator expects a fixed shape of inputs. We need to truncate or pad all samples to the same length.

```python
from transformers import AutoTokenizer
import os
# Model id to load the tokenizer
model_id = "bert-base-uncased"
save_dataset_path = "lm_dataset"
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Tokenize helper function
def tokenize(batch):
return tokenizer(batch['text'], padding='max_length', truncation=True,return_tensors="pt")

# Tokenize dataset
raw_dataset = raw_dataset.rename_column("label", "labels") # to match Trainer
tokenized_dataset = raw_dataset.map(tokenize, batched=True, remove_columns=["text"])
tokenized_dataset = tokenized_dataset.with_format("torch")

# save dataset to disk
tokenized_dataset["train"].save_to_disk(os.path.join(save_dataset_path,"train"))
tokenized_dataset["test"].save_to_disk(os.path.join(save_dataset_path,"eval"))
```

## 3. Fine-tune BERT using Hugging Face Transformers

Normally you would use the [Trainer](https://huggingface.co/docs/transformers/v4.19.4/en/main_classes/trainer#transformers.Trainer) and [TrainingArguments](https://huggingface.co/docs/transformers/v4.19.4/en/main_classes/trainer#transformers.TrainingArguments) to fine-tune PyTorch-based transformer models.

But together with AWS, we have developed a `TrainiumTrainer` to improve performance, robustness, and safety when training on Trainium instances. The `TrainiumTrainer` also comes with a [model cache](https://www.notion.so/Getting-started-with-AWS-Trainium-and-Hugging-Face-Transformers-8428c72556194aed9c393de101229dcf), which allows us to use precompiled models and configuration from Hugging Face Hub to skip the compilation step, which would be needed at the beginning of training. This can reduce the training time by ~3x.

The `TrainiumTrainer` is part of the `optimum-neuron` library and can be used as a 1-to-1 replacement for the `Trainer`. You only have to adjust the import in your training script.

```diff
- from transformers import Trainer
+ from optimum.neuron import TrainiumTrainer as Trainer
```

We prepared a simple [train.py](https://github.com/philschmid/aws-neuron-samples/blob/main/training/scripts/train.py) training script based on the ["Getting started with Pytorch 2.0 and Hugging Face Transformers”](https://www.philschmid.de/getting-started-pytorch-2-0-transformers#3-fine-tune--evaluate-bert-model-with-the-hugging-face-trainer) blog post with the `TrainiumTrainier`. Below is an excerpt

```python
from transformers import TrainingArguments
from optimum.neuron import TrainiumTrainer as Trainer

def parse_args():
...

def training_function(args):

# load dataset from disk and tokenizer
train_dataset = load_from_disk(os.path.join(args.dataset_path, "train"))
...

# Download the model from huggingface.co/models
model = AutoModelForSequenceClassification.from_pretrained(
args.model_id, num_labels=num_labels, label2id=label2id, id2label=id2label
)

training_args = TrainingArguments(
...
)

# Create Trainer instance
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)

# Start training
trainer.train()
```

We can load the training script into our environment using the `wget` command or manually copy it into the notebook from [here](https://github.com/huggingface/optimum-neuron/blob/notebooks/text-classification/scripts/train.py).

```python
!wget https://raw.githubusercontent.com/huggingface/optimum-neuron/notebooks/text-classification/scripts/train.py
```

We will use `torchrun` to launch our training script on both neuron cores for distributed training. `torchrun` is a tool that automatically distributes a PyTorch model across multiple accelerators. We can pass the number of accelerators as `nproc_per_node` arguments alongside our hyperparameters.

We'll use the following command to launch training:

```bash
!torchrun --nproc_per_node=2 scripts/train.py \
--model_id bert-base-uncased \
--dataset_path lm_dataset \
--lr 5e-5 \
--per_device_train_batch_size 16 \
--bf16 True \
--epochs 3
```

_**Note**: If you see bad, bad accuracy, you might want to deactivate `bf16` for now._

After 9 minutes the training was completed and achieved an excellent f1 score of `0.914`.

```bash
***** train metrics *****
epoch = 3.0
train_runtime = 0:08:30
train_samples = 16000
train_samples_per_second = 96.337

***** eval metrics *****
eval_f1 = 0.914
eval_runtime = 0:00:08
```

Last but not least, terminate the EC2 instance to avoid unnecessary charges. Looking at the price-performance, our training only cost **`20ct`** (**`1.34$/h * 0.15h = 0.20$`**)
5 changes: 5 additions & 0 deletions notebooks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Notebooks

| Notebook | Task | Model Architectures |
| ------------------------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------------------ |
| [Getting started with AWS Trainium and Hugging Face Transformers](./text-classification/notebook.ipynb) | text-classification | BERT, RoBERTa, DistilBERT, XLM-RoBERTa, ALBERT, Electra, CamemBERT |
Loading

0 comments on commit 872a38e

Please sign in to comment.