This page explains how to fine-tune a pre-sparsified BERT model onto a downstream dataset with SparseML's Trainer
.
Sparse Transfer Learning is quite similiar to typical NLP transfer learning, where we fine-tune a checkpoint pretrained on a large dataset like WikipediaBookCorpus onto a smaller downstream dataset and task. However, with Sparse Transfer Learning, we simply start the fine-tuning process from a pre-sparsified model and maintain sparsity while the training process occurs.
SparseZoo contains pre-sparsified checkpoints of common NLP models like BERT and RoBERTa. These models can be used as the starting checkpoint for the sparse transfer learning workflow.
Check out the full list of pre-sparsified models
Install via pip
:
pip install sparseml[transformers]
Let's try a simple example of fine-tuning a pre-sparsified model onto the SST dataset. SST2 is a sentiment analysis dataset, with each sentence labeled with a 0 or 1 representing negative or positive sentiment.
We will fine-tune a 90% pruned BERT-base, identified by the following stub:
zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned90-none
Run the following to download it:
from sparsezoo import Model
model_stub = "zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned90-none"
zoo_model = Model(model_stub, download_path="./model")
model_path = zoo_model.training.path
Additionally, SparseML allows you to apply model distillation from a dense teacher model during the fine-tuning process. This is an optional step, but it can help increase accuracy.
For SST2, there is a dense BERT-base trained on SST2, identified by the following stub:
zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/base-none
Run the following to download it:
from sparsezoo import Model
teacher_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/base-none"
zoo_model = Model(teacher_stub, download_path="./teacher")
teacher_path = zoo_model.training.path
With the models downloaded, we will set up the Hugging Face tokenizer
, config
, and model
. These are all native Hugging Face objects, so check out the Hugging Face docs for more details on AutoModel
, AutoConfig
, and AutoTokenizer
as needed.
We instantiate these classes by passing the local path to the directory containing the pytorch_model.bin
, tokenizer.json
, and config.json
files from the SparseZoo download.
from sparseml.transformers.utils import SparseAutoModel
from transformers import AutoModelForSequenceClassification, AutoConfig, AutoTokenizer
NUM_LABELS = 2
# see examples for how to use models with different tokenizers
tokenizer = AutoTokenizer.from_pretrained(model_path)
# setup configs
model_config = AutoConfig.from_pretrained(model_path, num_labels=NUM_LABELS)
teacher_config = AutoConfig.from_pretrained(teacher_path, num_labels=NUM_LABELS)
# load model from local directory (model_path)
model_kwargs = {"config": model_config}
model_kwargs["state_dict"], s_delayed = SparseAutoModel._loadable_state_dict(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path, **model_kwargs,)
# load model from local directory (teacher_path)
teacher_kwargs = {'config':teacher_config}
teacher_kwargs["state_dict"], t_delayed = SparseAutoModel._loadable_state_dict(teacher_path)
teacher = AutoModelForSequenceClassification.from_pretrained(teacher_path, **teacher_kwargs,)
Next, download a dataset and prepare it for training. We can use the Hugging Face datasets
library.
Begin by loading the SST2 dataset:
from datasets import load_dataset
from pprint import pprint
# load dataset natively
dataset = load_dataset("glue", "sst2")
pprint(dataset["train"][100])
# >> {'idx': 0,
# >> 'label': 0,
# >> 'sentence': 'hide new secretions from the parental units '}
Next, tokenize the dataset using the tokenizer
created above.
def preprocess_fn(examples):
return tokenizer(examples["sentence"],
padding="max_length",
max_length=min(tokenizer.model_max_length, 128),
truncation=True)
tokenized_dataset = dataset.map(preprocess_fn, batched=True)
To run sparse transfer learning, we first need to create/select a sparsification recipe. For sparse transfer, we need a recipe that instructs SparseML to maintain sparsity during training and to quantize the model over the final epochs.
For the SST2 dataset, there is a transfer learning recipe available in SparseZoo, identified by the following stub:
zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none
We will use this recipe for the example. This is what it looks like:
version: 1.1.0
# General Variables
num_epochs: &num_epochs 13
init_lr: 1.5e-4
final_lr: 0
qat_start_epoch: &qat_start_epoch 8.0
observer_epoch: &observer_epoch 12.0
quantize_embeddings: &quantize_embeddings 1
distill_hardness: &distill_hardness 1.0
distill_temperature: &distill_temperature 2.0
weight_decay: 0.01
# Modifiers:
training_modifiers:
- !EpochRangeModifier
end_epoch: eval(num_epochs)
start_epoch: 0.0
- !LearningRateFunctionModifier
start_epoch: 0
end_epoch: eval(num_epochs)
lr_func: linear
init_lr: eval(init_lr)
final_lr: eval(final_lr)
quantization_modifiers:
- !QuantizationModifier
start_epoch: eval(qat_start_epoch)
disable_quantization_observer_epoch: eval(observer_epoch)
freeze_bn_stats_epoch: eval(observer_epoch)
quantize_embeddings: eval(quantize_embeddings)
quantize_linear_activations: 0
exclude_module_types: ['LayerNorm', 'Tanh']
submodules:
- bert.embeddings
- bert.encoder
- bert.pooler
- classifier
distillation_modifiers:
- !DistillationModifier
hardness: eval(distill_hardness)
temperature: eval(distill_temperature)
distill_output_keys: [logits]
constant_modifiers:
- !ConstantPruningModifier
start_epoch: 0.0
params: __ALL_PRUNABLE__
regularization_modifiers:
- !SetWeightDecayModifier
start_epoch: 0.0
weight_decay: eval(weight_decay)
The key Modifiers
for sparse transfer learning are the following:
ConstantPruningModifier
instructs SparseML to maintain the sparsity structure of the network during the fine-tuning processQuantizationModifier
instructs SparseML to apply quantization aware training to quantize the weights over the final epochsDistillationModifier
instructs SparseML to apply model distillation at the logit layer
Run the following to download the recipe to your local directory:
from sparsezoo import Model
transfer_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
zoo_model = Model(transfer_stub, download_path="./transfer_recipe")
recipe_path = zoo_model.recipes.default.path
We can use the Evaluate library to compute and report metrics.
from transformers import EvalPrediction
from datasets import load_metric
import numpy as np
metric = load_metric("accuracy")
def compute_metrics(p: EvalPrediction):
preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
preds = np.argmax(preds, axis=1)
return metric.compute(predictions=preds, references=p.label_ids)
With the recipe created, we are now ready to kick off transfer learning.
SparseML offers a custom Trainer
class that inherits from the familiar Hugging Face Trainer
. SparseML's Trainer
extends the functionality to enable passing a recipe
(such as the one we downloaded above). SparseML's Trainer
parses the recipe and adjusts the training loop to apply the specified algorithms.
As you can see, it works just like Hugging Face's Trainer:
from sparseml.transformers.sparsification import Trainer, TrainingArguments
from transformers import default_data_collator
# create TrainingArguments
training_args = TrainingArguments(
output_dir="./training_output",
do_train=True,
do_eval=True,
resume_from_checkpoint=False,
evaluation_strategy="epoch",
save_strategy="epoch",
logging_strategy="epoch",
save_total_limit=1,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
fp16=True)
# create SparseML Trainer
trainer = Trainer(
model=model,
model_state_path=model_path,
recipe=recipe_path,
teacher=teacher,
metadata_args=["per_device_train_batch_size","per_device_eval_batch_size","fp16"],
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
tokenizer=tokenizer,
data_collator=default_data_collator,
compute_metrics=compute_metrics)
# kick off training
train_result = trainer.train(resume_from_checkpoint=False)
trainer.save_model() # Saves the tokenizer too for easy upload
trainer.save_state()
trainer.save_optimizer_and_scheduler(training_args.output_dir)
The Trainer
class parses the transfer learning recipe and updates the training loop to maintain sparsity and apply quantization-aware training over the final epochs. After training for 13 epochs, the final model is 90% pruned and quantized reaching ~92% accuracy on the validation set.
Note that in this case, we passed the dense teacher
we downloaded from SparseZoo to the Trainer
. This is an optional argument (turn off by setting teacher="disable"
), but can help to increase accuracy during the training process. You can specify the hyperparameters of the distillation process via the DistillationModifiers
.
Checkout the Trainer
and TrainingArguments
API level docs for more details.
SparseML provides a sparseml.transformers.export_onnx
command that you can use to export your trained model to ONNX. Be sure the --model_path
argument points to your trained model:
sparseml.transformers.export_onnx \
--model_path ./training_output \
--task text_classification
The command creates a ./deployment
folder in your local directory, which contains the ONNX file and necessary Hugging Face tokenizer and configuration files for deployment with DeepSparse.
Take a look at the tutorials for more examples in other use cases:
- Sparse Transfer with GLUE Datasets (SST2) for sentiment analysis
- Sparse Transfer with Custom Datasets (RottenTomatoes) and Custom Teacher from HF Hub for sentiment analysis
- Sparse Transfer with GLUE Datasets (QQP) for multi-input text classification
- Sparse Transfer with Custom Datasets (SICK) for multi-input text classification
- Sparse Transfer with Custom Datasets (TweetEval) and Custom Teacher for single input text classification
- Sparse Transfer with Custom Datasets (GoEmotions) for multi-label text classification
- Sparse Transfer with Conll2003 for named entity recognition
- Sparse Transfer with Custom Datasets (WNUT) and Custom Teacher for named entity recognition
- Sparse Transfer with SQuAD (example coming soon!)
- Sparse Transfer with Squadshifts Amazon (example coming soon!)