Skip to content

Commit

Permalink
Implement README for NLP toolkit (intel#35)
Browse files Browse the repository at this point in the history
* Implement README for NLP toolkit

* Update default value for pruning and distillation config

* Update README and documents
  • Loading branch information
PenghuiCheng authored Apr 16, 2022
1 parent 4cedaa6 commit 1d14327
Show file tree
Hide file tree
Showing 18 changed files with 515 additions and 39 deletions.
109 changes: 107 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,107 @@
# Nlp-toolkit:Optimization for Natural Language Processing Models
Nlp-toolkit is a powerful toolkit for automatically applying model optimizations on Natural Language Processing Models. It leverages [Intel® Neural Compressor](https://intel.github.io/neural-compressor) to provide a variety of optimization methods: quantization, pruning, distillation and so on. This toolkit is equipped with the capability to enable various Deep Learning framework like PyTorch/TensorFlow. For PyTorch models, it also support NNCF provider to optimization.
# NLP Toolkit:Optimization for Natural Language Processing (NLP) Models
NLP Toolkit is a powerful toolkit for automatically applying model optimizations on Natural Language Processing Models. It leverages [Intel® Neural Compressor](https://intel.github.io/neural-compressor) to provide a variety of optimization methods: quantization, pruning, distillation and so on.

## What does NLP Toolkit offer?
This toolkit allows developers to improve the productivity through ease-of-use model compression APIs by extending HuggingFace transformer APIs for deep learning models in NLP (Natural Language Processing) domain and accelerate the inference performance using compressed models.

- Model Compression

|Framework |Quantization |Pruning/Sparsity |Distillation |
|-------------------|:--------------------:|:---------------:|:--------------------:|
|PyTorch |✔ |✔ |✔ |

- Data Augmentation for NLP Datasets
- NLP Executor for Inference Acceleration

## Getting Started
### Installation
#### Install Dependency
```bash
pip install -r requirements.txt
```

#### Install NLP Toolkit
```bash
git clone https://github.com/intel-innersource/frameworks.ai.nlp-toolkit.intel-nlp-toolkit.git nlp_toolkit
cd nlp_toolkit
git submodule update --init --recursive
python setup.py install
```

### Quantization
```python
from nlp_toolkit import NLPTrainer, QuantizationConfig, metric, objectives

# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(...)
metric = metrics.Metric(name="eval_f1", is_relative=True, criterion=0.01)
q_config = QuantizationConfig(
approach="PostTrainingStatic",
metrics=[metric],
objectives=[objectives.performance]
)
model = trainer.quantize(quant_config=q_config)
```

Please refer to [quantization document](docs/quantization.md) for more details.

### Pruning
```python
from nlp_toolkit import NLPTrainer, Pruner, PruningConfig

# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(...)
metric = metrics.Metric(name="eval_accuracy")
pruner = Pruner(prune_type='BasicMagnitude', target_sparsity_ratio=0.9)
p_conf = PruningConfig(pruner=[pruner], metrics=metric)
model = trainer.prune(pruning_config=p_conf)
```

Please refer to [pruning document](docs/pruning.md) for more details.

### Distillation
```python
from nlp_toolkit import NLPTrainer, DistillationConfig, Criterion

# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
teacher_model = ... # exist model
trainer = NLPTrainer(...)
metric = metrics.Metric(name="eval_accuracy")
d_conf = DistillationConfig(metrics=metric)
model = trainer.distill(distillation_config=d_conf, teacher_model=teacher_model)
```

Please refer to [distillation document](docs/distillation.md) for more details.

### Data Augmentation
Data augmentation provides the facilities to generate synthesized NLP dataset for further model optimization. The data augmentation supports text generation on popular fine-tuned models like GPT, GPT2, and other text synthesis approaches from [nlpaug](https://github.com/makcedward/nlpaug).

```python
from nlp_toolkit.preprocessing.data_augmentation import DataAugmentation
aug = DataAugmentation(augmenter_type="TextGenerationAug")
aug.input_dataset = "original_dataset.csv" # example: https://huggingface.co/datasets/glue/viewer/sst2/train
aug.column_names = "sentence"
aug.output_path = os.path.join(self.result_path, "test2.cvs")
aug.augmenter_arguments = {'model_name_or_path': 'gpt2-medium'}
aug.data_augment()
raw_datasets = load_dataset("csv", data_files=aug.output_path, delimiter="\t", split="train")
```

Please refer to [data augmentation document](docs/data_augmentation.md) for more details.

### NLP Executor
NLP Executor is an inference executor for Natural Language Processing (NLP) models, providing the optimal performance by quantization and sparsity. The executor is a baremetal reference engine of NLP Toolkit and supports typical NLP models.

```python
from engine.compile import compile
# /path/to/your/model is a TensorFlow pb model or ONNX model
model = compile('/path/to/your/model')
inputs = ... # [input_ids, segment_ids, input_mask]
model.inference(inputs)
```

Please refer to [NLP executor document](docs/nlp_executor.md) for more details.

114 changes: 114 additions & 0 deletions docs/data_augmentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Data Augmentation: The Tool for Augmenting NLP Datasets
Data Augmentation is a tool to helps you with augmenting nlp datasets for your machine learning projects. The tool integrated [nlpaug](https://github.com/makcedward/nlpaug) and other methods from Intel Lab.

## Getting Started!
### Installation
#### Install Dependency
pip install nlpaug
pip install transformers>=4.12.0

#### Install Nlp-toolkit
git clone https://github.com/intel-innersource/frameworks.ai.nlp-toolkit.intel-nlp-toolkit.git nlp_toolkit
cd nlp_toolkit
git submodule update --init --recursive
python setup.py install

### Data Augmentation
#### Script(Please refer to [example](tests/test_data_augmentation.py))
```python
from nlp_toolkit.preprocessing.data_augmentation import DataAugmentation
aug = DataAugmentation(augmenter_type="TextGenerationAug")
aug.input_dataset = "dev.csv"
aug.output_path = os.path.join(self.result_path, "test1.cvs")
aug.augmenter_arguments = {'model_name_or_path': 'gpt2-medium'}
aug.data_augment()
raw_datasets = load_dataset("csv", data_files=aug.output_path, delimiter="\t", split="train")
self.assertTrue(len(raw_datasets) == 10)
```

#### Parameters of DataAugmentation
|parameter |type |Description |default value |
|:---------|:----|:------------------------------------------------------------------|:-------------|
|augmenter_type|String|Augmentation type |NA |
|input_dataset|String|Dataset name or a csv or a json file |None |
|output_path|String|Saved path and name of augmented data file |"save_path/augmented_dataset.csv"|
|data_config_or_task_name|String|Task name of glue dataset or data configure name |None |
|augmenter_arguments|Dict|parameters for augmenters, different augmenter has different parameters |None|
|column_names|String|Which you want to augmentation, it is used for python package datasets|"sentence"|
|split|String|Which dataset you want to augmentation, like:'validation', 'train' |"validation" |
|num_samples|Integer|How many augmented samples can each sentence generate |1 |
|device|String|'cuda' or 'cpu' device you used |1 |

#### Supported Augmenter
|augmenter_type |augmenter_arguments |default value |
|:--------------|:-------------------------------------------------------------------|:-------------|
|"TextGenerationAug"|refer to "Text Generation Augmenter" field in this document |NA |
|"KeyboardAug"|refer to ["KeyboardAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/keyboard.py#L46) |NA |
|"OcrAug"|refer to ["OcrAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/ocr.py#L38) |NA |
|"SpellingAug"|refer to ["SpellingAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/word/spelling.py#L49) |NA |
|"ContextualWordEmbsForSentenceAug"|refer to ["ContextualWordEmbsForSentenceAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/sentence/context_word_embs_sentence.py#L77) | |

#### Text Generation Augmenter
The text generation augment contains the recipe to run data augmentation algorithm based on conditional text generation using auto-regressive transformer model (like GPT, GPT-2, Transformer-XL, XLNet, CTRL) in order to automatically generate labeled data.
Our approach follows algorithms described by [Not Enough Data? Deep Learning to the Rescue!](https://arxiv.org/abs/1911.03118) and [Natural Language Generation for Effective Knowledge Distillation](https://www.aclweb.org/anthology/D19-6122.pdf).

- First, we fine-tune an auto-regressive model on the training set. Each sample contains both the label and the sentence.
- Prepare datasets:

example:
```python
from datasets import load_dataset
from nlp_toolkit.preprocessing.utils import EOS
for split in {'train', 'validation'}:
dataset = load_dataset('glue', 'sst2', split=split)
with open('SST-2/' + split + '.txt', 'w') as fw:
for d in dataset:
fw.write(str(d['label']) + '\t' + d['sentence'] + EOS + '\n')
```

- Fine-tuning Causal Language Model

You can use the script [run_clm.py](https://github.com/huggingface/transformers/tree/v4.6.1/examples/pytorch/language-modeling/run_clm.py) from transformers examples for fine-tuning GPT2 (gpt2-medium) on SST-2. The loss is that of causal language modeling.

```shell
DATASET=SST-2
TRAIN_FILE=$DATASET/train.txt
VALIDATION_FILE=$DATASET/validation.txt
MODEL=gpt2-medium
MODEL_DIR=model/$MODEL-$DATASET

python3 transformers/examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path $MODEL \
--train_file $TRAIN_FILE \
--validation_file $VALIDATION_FILE \
--do_train \
--do_eval \
--output_dir $MODEL_DIR \
--overwrite_output_dir
```


- Second, we generate labeled data. Given class labels sampled from the training set, we use the fine-tuned language model to predict sentences with below script:
```python
from nlp_toolkit.preprocessing.data_augmentation import DataAugmentation
aug = DataAugmentation(augmenter_type="TextGenerationAug")
aug.input_dataset = "/your/original/training_set.csv"
aug.output_path = os.path.join(self.result_path, "/your/augmented/dataset.cvs")
aug.augmenter_arguments = {'model_name_or_path': '/your/fine-tuned/model'}
aug.data_augment()
```

This data augmentation algorithm can be used in several scenarios, like model distillation.


augmenter_arguments:
|parameter |Type|Description |default value |
|:---------|:---|:---------------------------------------------------|:-------------|
|"model_name_or_path"|String|Language modeling model to generate data, refer to [line](nlp_toolkit/preprocessing/data_augmentation.py#L181)|NA|
|"stop_token"|String|Stop token used in input data file |[EOS](nlp_toolkit/preprocessing/utils.py#L7)|
|"num_return_sentences"|Integer|Total samples to generate, -1 means the number of the input samples |-1|
|"temperature"|float|parameter for CLM model |1.0|
|"k"|float|top K |0.0|
|"p"|float|top p |0.9|
|"repetition_penalty"|float|repetition_penalty |1.0|

68 changes: 68 additions & 0 deletions docs/distillation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Distillation
## script:
```python
from nlp_toolkit import metric, NLPTrainer, objectives, QuantizationConfig, Criterion
# Create a trainer like you do in transformers examples, just replace transformers.Trainer with NLPTrainer
# ~~trainer = transformers.Trainer(......)~~
trainer = NLPTrainer(......)
tune_metric = metrics.Metric(name="eval_accuracy")
distillation_conf = DistillationConfig(metrics=tune_metric)
model = trainer.distill(
distillation_config=distillation_conf, teacher_model=teacher_model
)
```

Please refer to [example](../examples/optimize/pytorch/huggingface/text-classification/distillation/run_glue.py) for the details.

## Create an instance of Metric
The Metric define which metric will used to measure the performance of tuned models.
- example:
```python
Metric(name="eval_accuracy")
```

Please refer to [metrics document](metrics.md) for the details.

## Create an instance of Criterion(Optional)
The criterion used in training phase.

- arguments:
|Argument |Type |Description |Default value |
|:----------|:----------|:-----------------------------------------------|:----------------|
|name |String|Name of criterion, like:"KnowledgeLoss", "IntermediateLayersLoss" |"KnowledgeLoss"|
|temperature|Float |parameter for KnowledgeDistillationLoss |1.0 |
|loss_types|List of string|Type of loss |['CE', 'CE'] |
|loss_weight_ratio|List of float|weight ratio of loss |[0.5, 0.5] |
|layer_mappings|List|parameter for IntermediateLayersLoss |[] |
|add_origin_loss|bool|parameter for IntermediateLayersLoss |False |

- example:
```python
Criterion(name='KnowledgeLoss')
```

## Create an instance of DistillationConfig
The DistillationConfig contains all the information related to the model distillation behavior. If you created Metric and Criterion instance, then you can create an instance of DistillationConfig. Metric and pruner is optional.

- arguments:
|Argument |Type |Description |Default value |
|:----------|:----------|:-----------------------------------------------|:----------------|
|framework |string |which framework you used |"pytorch" |
|criterion|Criterion |criterion of training |"KnowledgeLoss"|
|metrics |Metric |Used to evaluate accuracy of tuning model, no need for NoTrainerOptimizer|None |

- example:
```python
distillation_conf = DistillationConfig(metrics=tune_metric)
```

## Distill with Trainer
- Distill with Trainer
NLPTrainer inherits from transformers.Trainer, so you can create trainer like you do in transformers examples. Then you can distill model with trainer.distill function.
```python
from nlp_toolkit import metric, NLPTrainer, objectives, QuantizationConfig,
trainer = NLPTrainer(......)
model = trainer.distill(
distillation_config=distillation_conf, teacher_model=teacher_model
)
```
16 changes: 16 additions & 0 deletions docs/metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Metric
The Metric define which metric will used to measure the performance of tuned models.
- arguments:
|Argument |Type |Description |Default value |
|:----------|:----------|:-----------------------------------------------|:----------------|
|name |string |Metric name which evaluate function returns, like:"eval_f1", "eval_accuracy"...| |
|greater_is_better|bool |Used to describe the usage of the metric, like: greater is better for f1, this parameter is only used for quantization.|True|
|is_relative|bool |Used in conjunction with "criterion", if "criterion" is 0.01, and "is_relative" is True, it means that we want to get an optimized model which metric drop <1% relative, if "is_relative" is False, means metric drop <1% absolute, this parameter is only used for quantization.|True |
|criterion |float |Used in conjunction with "is_relative". if "criterion" is 0.01, and "is_relative" is True, it means that we want to get an optimized model which metric drop <1% relative, if "criterion" is 0.02, means metric drop <2% relative, this parameter is only used for quantization.|0.01 |
|weight_ratio|float |Used when there are multiple metrics, for example: you want to focus on both F1 and accuracy, then you will create f1 instance and accuracy instance, and indicate their weight proportion. if weight_ratio of f1 is 0.3, and weight ratio of accuracy is 0.7, then the final metric to tune is f1*0.3 + accuracy*0.7, this parameter is only used for quantization.|None |

- example:
```python
from nlp_toolkit import metric
metric.Metric(name="eval_f1", greater_is_better=True, is_relative=True, criterion=0.01, weight_ratio=None)
```
7 changes: 4 additions & 3 deletions docs/nlp_executor.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# NLP Executor:A baremetal inference engine for Natural Language Processing(NLP) Models
NLP Executor is an inference executor for Natural Language Processing (NLP) models, providing the optimal performance via various model compression techniques like quantization, sparsity. The executor is a baremetal reference engine of NLP Toolkit and supports popular NLP models.
NLP Executor is an inference executor for Natural Language Processing (NLP) models, providing the optimal performance by quantization and sparsity. The executor is a baremetal reference engine of NLP Toolkit and supports typical NLP models.

## Deployment Architecture
The executor supports model optimization and high performance kernel for CPU.
Expand Down Expand Up @@ -65,12 +65,13 @@ If you use pip install -e . to install the executor in your current folder, plea

You can skip the step if install with other ways .

```
```python
from engine_py import Model
# load the model, config_path:path of generated yaml, weight_path: path of generated bin
model = Model(config_path, weight_path)
# use model.forward to do inference
out = model.forward([input_ids, segment_ids, input_mask]) ```
out = model.forward([input_ids, segment_ids, input_mask])
```

The `input_ids`, `segment_ids` and `input_mask` are the input numpy array data of a bert model, which have size (batch_size, seq_len).
Note that the `out` is a list contains the bert model output numpy data (`out=[output numpy data]`).
19 changes: 19 additions & 0 deletions docs/objectives.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Objective
In terms of evaluating the status of a specific model during tuning, we should have general objectives to measure the status of different models.

NLP Toolkit supports to optimized low-precision recipes for deep learning models to achieve optimal product objectives like inference performance and memory usage with expected accuracy criteria and now supports Objectives which is supported in [INC](https://github.com/intel-innersource/frameworks.ai.lpot.intel-lpot/blob/master/docs/objective.md#built-in-objective-support-list).

- arguments:
|Argument |Type |Description |Default value |
|:----------|:----------|:-----------------------------------------------|:----------------|
|name |string |the Objective name in [INC](https://github.com/intel-innersource/frameworks.ai.lpot.intel-lpot/blob/master/docs/objective.md#built-in-objective-support-list). Like "performance", "modelsize",......and so on| |
|greater_is_better|bool |Used to describe the usage of the objective, like: greater is better for performance, but lower is better for modelsize|True|
|weight_ratio|float |Used when there are multiple objective, for example: you want to focus on both performance and modelsize, then you will create performance objective instance and modelsize objective instance, and indicate their weight proportion|None |

- example:
```python
from nlp_toolkit import objectives
objectives.Objective(name="performance", greater_is_better=True, weight_ratio=None)
```

- Built-in Objective instance: performance, modelsize.
Loading

0 comments on commit 1d14327

Please sign in to comment.