Implement README for NLP toolkit (intel#35)

* Implement README for NLP toolkit * Update default value for pruning and distillation config * Update README and documents
sywangyi · Apr 16, 2022 · 1d14327 · 1d14327
1 parent 4cedaa6
commit 1d14327
Show file tree

Hide file tree

Showing 18 changed files with 515 additions and 39 deletions.
diff --git a/README.md b/README.md
@@ -1,2 +1,107 @@
-# Nlp-toolkit：Optimization for Natural Language Processing Models
-Nlp-toolkit is a powerful toolkit for automatically applying model optimizations on Natural Language Processing Models. It leverages [Intel® Neural Compressor](https://intel.github.io/neural-compressor) to provide a variety of optimization methods: quantization, pruning, distillation and so on. This toolkit is equipped with the capability to enable various Deep Learning framework like PyTorch/TensorFlow. For PyTorch models, it also support NNCF provider to optimization. 
+# NLP Toolkit：Optimization for Natural Language Processing (NLP) Models
+NLP Toolkit is a powerful toolkit for automatically applying model optimizations on Natural Language Processing Models. It leverages [Intel® Neural Compressor](https://intel.github.io/neural-compressor) to provide a variety of optimization methods: quantization, pruning, distillation and so on.
+
+## What does NLP Toolkit offer?
+This toolkit allows developers to improve the productivity through ease-of-use model compression APIs by extending HuggingFace transformer APIs for deep learning models in NLP (Natural Language Processing) domain and accelerate the inference performance using compressed models.
+
+- Model Compression
+
+    |Framework          |Quantization          |Pruning/Sparsity |Distillation          |
+    |-------------------|:--------------------:|:---------------:|:--------------------:|
+    |PyTorch            |&#10004;              |&#10004;         |&#10004;              |
+
+- Data Augmentation for NLP Datasets
+- NLP Executor for Inference Acceleration
+
+## Getting Started
+### Installation
+#### Install Dependency
+```bash
+pip install -r requirements.txt
+```
+
+#### Install NLP Toolkit
+```bash
+git clone https://github.com/intel-innersource/frameworks.ai.nlp-toolkit.intel-nlp-toolkit.git nlp_toolkit
+cd nlp_toolkit
+git submodule update --init --recursive
+python setup.py install
+```
+
+### Quantization
+```python
+from nlp_toolkit import NLPTrainer, QuantizationConfig, metric, objectives
+
+# Replace transformers.Trainer with NLPTrainer
+# trainer = transformers.Trainer(...)
+trainer = NLPTrainer(...)
+metric = metrics.Metric(name="eval_f1", is_relative=True, criterion=0.01)
+q_config = QuantizationConfig(
+    approach="PostTrainingStatic",
+    metrics=[metric],
+    objectives=[objectives.performance]
+)
+model = trainer.quantize(quant_config=q_config)
+```
+
+Please refer to [quantization document](docs/quantization.md) for more details.
+
+### Pruning
+```python
+from nlp_toolkit import NLPTrainer, Pruner, PruningConfig
+
+# Replace transformers.Trainer with NLPTrainer
+# trainer = transformers.Trainer(...)
+trainer = NLPTrainer(...)
+metric = metrics.Metric(name="eval_accuracy")
+pruner = Pruner(prune_type='BasicMagnitude', target_sparsity_ratio=0.9)
+p_conf = PruningConfig(pruner=[pruner], metrics=metric)
+model = trainer.prune(pruning_config=p_conf)
+```
+
+Please refer to [pruning document](docs/pruning.md) for more details.
+
+### Distillation
+```python
+from nlp_toolkit import NLPTrainer, DistillationConfig, Criterion
+
+# Replace transformers.Trainer with NLPTrainer
+# trainer = transformers.Trainer(...)
+teacher_model = ... # exist model
+trainer = NLPTrainer(...)
+metric = metrics.Metric(name="eval_accuracy")
+d_conf = DistillationConfig(metrics=metric)
+model = trainer.distill(distillation_config=d_conf, teacher_model=teacher_model)
+```
+
+Please refer to [distillation document](docs/distillation.md) for more details.
+
+### Data Augmentation
+Data augmentation provides the facilities to generate synthesized NLP dataset for further model optimization. The data augmentation supports text generation on popular fine-tuned models like GPT, GPT2, and other text synthesis approaches from [nlpaug](https://github.com/makcedward/nlpaug).
+
+```python
+from nlp_toolkit.preprocessing.data_augmentation import DataAugmentation
+aug = DataAugmentation(augmenter_type="TextGenerationAug")
+aug.input_dataset = "original_dataset.csv" # example: https://huggingface.co/datasets/glue/viewer/sst2/train
+aug.column_names = "sentence"
+aug.output_path = os.path.join(self.result_path, "test2.cvs")
+aug.augmenter_arguments = {'model_name_or_path': 'gpt2-medium'}
+aug.data_augment()
+raw_datasets = load_dataset("csv", data_files=aug.output_path, delimiter="\t", split="train")
+```
+
+Please refer to [data augmentation document](docs/data_augmentation.md) for more details.
+
+### NLP Executor
+NLP Executor is an inference executor for Natural Language Processing (NLP) models, providing the optimal performance by quantization and sparsity. The executor is a baremetal reference engine of NLP Toolkit and supports typical NLP models.
+
+```python
+from engine.compile import compile
+# /path/to/your/model is a TensorFlow pb model or ONNX model
+model = compile('/path/to/your/model')
+inputs = ... # [input_ids, segment_ids, input_mask]
+model.inference(inputs)
+```
+
+Please refer to [NLP executor document](docs/nlp_executor.md) for more details.
+
diff --git a/docs/data_augmentation.md b/docs/data_augmentation.md
@@ -0,0 +1,114 @@
+# Data Augmentation: The Tool for Augmenting NLP Datasets
+Data Augmentation is a tool to helps you with augmenting nlp datasets for your machine learning projects. The tool integrated [nlpaug](https://github.com/makcedward/nlpaug) and other methods from Intel Lab.
+
+## Getting Started!
+### Installation
+#### Install Dependency
+pip install nlpaug
+pip install transformers>=4.12.0
+
+#### Install Nlp-toolkit
+git clone https://github.com/intel-innersource/frameworks.ai.nlp-toolkit.intel-nlp-toolkit.git nlp_toolkit
+cd nlp_toolkit
+git submodule update --init --recursive
+python setup.py install
+
+### Data Augmentation
+#### Script(Please refer to [example](tests/test_data_augmentation.py))
+    ```python
+    from nlp_toolkit.preprocessing.data_augmentation import DataAugmentation
+    aug = DataAugmentation(augmenter_type="TextGenerationAug")
+    aug.input_dataset = "dev.csv"
+    aug.output_path = os.path.join(self.result_path, "test1.cvs")
+    aug.augmenter_arguments = {'model_name_or_path': 'gpt2-medium'}
+    aug.data_augment()
+    raw_datasets = load_dataset("csv", data_files=aug.output_path, delimiter="\t", split="train")
+    self.assertTrue(len(raw_datasets) == 10)
+    ```
+
+#### Parameters of DataAugmentation
+|parameter |type |Description                                                           |default value |
+|:---------|:----|:------------------------------------------------------------------|:-------------|
+|augmenter_type|String|Augmentation type                                             |NA  |
+|input_dataset|String|Dataset name or a csv or a json file                           |None  |
+|output_path|String|Saved path and name of augmented data file                       |"save_path/augmented_dataset.csv"|
+|data_config_or_task_name|String|Task name of glue dataset or data configure name    |None  |
+|augmenter_arguments|Dict|parameters for augmenters, different augmenter has different parameters |None|
+|column_names|String|Which you want to augmentation, it is used for python package datasets|"sentence"|
+|split|String|Which dataset you want to augmentation, like:'validation', 'train'     |"validation"  |
+|num_samples|Integer|How many augmented samples can each sentence generate           |1  |
+|device|String|'cuda' or 'cpu' device you used                                       |1  |
+
+#### Supported Augmenter
+|augmenter_type |augmenter_arguments                                                 |default value |
+|:--------------|:-------------------------------------------------------------------|:-------------|
+|"TextGenerationAug"|refer to "Text Generation Augmenter" field in this document               |NA  |
+|"KeyboardAug"|refer to ["KeyboardAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/keyboard.py#L46)      |NA  |
+|"OcrAug"|refer to ["OcrAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/ocr.py#L38)           |NA  |
+|"SpellingAug"|refer to ["SpellingAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/word/spelling.py#L49)      |NA  |
+|"ContextualWordEmbsForSentenceAug"|refer to ["ContextualWordEmbsForSentenceAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/sentence/context_word_embs_sentence.py#L77)      |    |
+
+#### Text Generation Augmenter
+The text generation augment contains the recipe to run data augmentation algorithm based on conditional text generation using auto-regressive transformer model (like GPT, GPT-2, Transformer-XL, XLNet, CTRL) in order to automatically generate labeled data.
+Our approach follows algorithms described by [Not Enough Data? Deep Learning to the Rescue!](https://arxiv.org/abs/1911.03118) and [Natural Language Generation for Effective Knowledge Distillation](https://www.aclweb.org/anthology/D19-6122.pdf).
+
+- First, we fine-tune an auto-regressive model on the training set. Each sample contains both the label and the sentence.
+    - Prepare datasets:
+
+        example:
+        ```python
+        from datasets import load_dataset
+        from nlp_toolkit.preprocessing.utils import EOS
+        for split in {'train', 'validation'}:
+            dataset = load_dataset('glue', 'sst2', split=split)
+            with open('SST-2/' + split + '.txt', 'w') as fw:
+                for d in dataset:
+                    fw.write(str(d['label']) + '\t' + d['sentence'] + EOS + '\n')
+        ```
+
+    - Fine-tuning Causal Language Model
+
+        You can use the script [run_clm.py](https://github.com/huggingface/transformers/tree/v4.6.1/examples/pytorch/language-modeling/run_clm.py) from transformers examples for fine-tuning GPT2 (gpt2-medium) on SST-2. The loss is that of causal language modeling. 
+
+        ```shell
+        DATASET=SST-2
+        TRAIN_FILE=$DATASET/train.txt
+        VALIDATION_FILE=$DATASET/validation.txt
+        MODEL=gpt2-medium
+        MODEL_DIR=model/$MODEL-$DATASET
+
+        python3 transformers/examples/pytorch/language-modeling/run_clm.py \
+            --model_name_or_path $MODEL \
+            --train_file $TRAIN_FILE \
+            --validation_file $VALIDATION_FILE \
+            --do_train \
+            --do_eval \
+            --output_dir $MODEL_DIR \
+            --overwrite_output_dir
+        ```
+
+
+- Second, we generate labeled data. Given class labels sampled from the training set, we use the fine-tuned language model to predict sentences with below script:
+    ```python
+    from nlp_toolkit.preprocessing.data_augmentation import DataAugmentation
+    aug = DataAugmentation(augmenter_type="TextGenerationAug")
+    aug.input_dataset = "/your/original/training_set.csv"
+    aug.output_path = os.path.join(self.result_path, "/your/augmented/dataset.cvs")
+    aug.augmenter_arguments = {'model_name_or_path': '/your/fine-tuned/model'}
+    aug.data_augment()
+    ```
+
+This data augmentation algorithm can be used in several scenarios, like model distillation.
+
+
+augmenter_arguments:
+|parameter |Type|Description                                                 |default value |
+|:---------|:---|:---------------------------------------------------|:-------------|
+|"model_name_or_path"|String|Language modeling model to generate data, refer to [line](nlp_toolkit/preprocessing/data_augmentation.py#L181)|NA|
+|"stop_token"|String|Stop token used in input data file                     |[EOS](nlp_toolkit/preprocessing/utils.py#L7)|
+|"num_return_sentences"|Integer|Total samples to generate, -1 means the number of the input samples                    |-1|
+|"temperature"|float|parameter for CLM model                               |1.0|
+|"k"|float|top K                                |0.0|
+|"p"|float|top p                                |0.9|
+|"repetition_penalty"|float|repetition_penalty                                |1.0|
+
diff --git a/docs/distillation.md b/docs/distillation.md
@@ -0,0 +1,68 @@
+# Distillation
+## script:
+```python
+from nlp_toolkit import metric, NLPTrainer, objectives, QuantizationConfig, Criterion
+# Create a trainer like you do in transformers examples, just replace transformers.Trainer with NLPTrainer
+# ~~trainer = transformers.Trainer(......)~~
+trainer = NLPTrainer(......)
+tune_metric = metrics.Metric(name="eval_accuracy")
+distillation_conf = DistillationConfig(metrics=tune_metric)
+model = trainer.distill(
+    distillation_config=distillation_conf, teacher_model=teacher_model
+)
+```
+
+Please refer to [example](../examples/optimize/pytorch/huggingface/text-classification/distillation/run_glue.py) for the details.
+
+## Create an instance of Metric
+The Metric define which metric will used to measure the performance of tuned models.
+- example:
+    ```python
+    Metric(name="eval_accuracy")
+    ```
+
+    Please refer to [metrics document](metrics.md) for the details.
+
+## Create an instance of Criterion(Optional)
+The criterion used in training phase.
+
+- arguments:
+    |Argument   |Type       |Description                                        |Default value    |
+    |:----------|:----------|:-----------------------------------------------|:----------------|
+    |name       |String|Name of criterion, like:"KnowledgeLoss", "IntermediateLayersLoss"  |"KnowledgeLoss"|
+    |temperature|Float |parameter for KnowledgeDistillationLoss               |1.0             |
+    |loss_types|List of string|Type of loss                               |['CE', 'CE']        |
+    |loss_weight_ratio|List of float|weight ratio of loss                 |[0.5, 0.5]     |
+    |layer_mappings|List|parameter for IntermediateLayersLoss             |[] |
+    |add_origin_loss|bool|parameter for IntermediateLayersLoss            |False |
+
+- example:
+    ```python
+    Criterion(name='KnowledgeLoss')
+    ```
+
+## Create an instance of DistillationConfig
+The DistillationConfig contains all the information related to the model distillation behavior. If you created Metric and Criterion instance, then you can create an instance of DistillationConfig. Metric and pruner is optional.
+
+- arguments:
+    |Argument   |Type       |Description                                        |Default value    |
+    |:----------|:----------|:-----------------------------------------------|:----------------|
+    |framework  |string     |which framework you used                        |"pytorch"        |
+    |criterion|Criterion |criterion of training                              |"KnowledgeLoss"|
+    |metrics    |Metric    |Used to evaluate accuracy of tuning model, no need for NoTrainerOptimizer|None    |
+
+- example:
+    ```python
+    distillation_conf = DistillationConfig(metrics=tune_metric)
+    ```
+
+## Distill with Trainer
+- Distill with Trainer
+    NLPTrainer inherits from transformers.Trainer, so you can create trainer like you do in transformers examples. Then you can distill model with trainer.distill function.
+    ```python
+    from nlp_toolkit import metric, NLPTrainer, objectives, QuantizationConfig,
+    trainer = NLPTrainer(......)
+    model = trainer.distill(
+        distillation_config=distillation_conf, teacher_model=teacher_model
+    )
+    ```
diff --git a/docs/metrics.md b/docs/metrics.md
@@ -0,0 +1,16 @@
+# Metric
+The Metric define which metric will used to measure the performance of tuned models.
+- arguments:
+    |Argument   |Type       |Description                                        |Default value    |
+    |:----------|:----------|:-----------------------------------------------|:----------------|
+    |name       |string     |Metric name which evaluate function returns, like:"eval_f1", "eval_accuracy"...|        |
+    |greater_is_better|bool |Used to describe the usage of the metric, like: greater is better for f1, this parameter is only used for quantization.|True|
+    |is_relative|bool       |Used in conjunction with "criterion", if "criterion" is 0.01, and "is_relative" is True, it means that we want to get an optimized model which metric drop <1% relative, if "is_relative" is False, means metric drop <1% absolute, this parameter is only used for quantization.|True    |
+    |criterion  |float    |Used in conjunction with "is_relative". if "criterion" is 0.01, and "is_relative" is True, it means that we want to get an optimized model which metric drop <1% relative, if "criterion" is 0.02, means metric drop <2% relative, this parameter is only used for quantization.|0.01              |
+    |weight_ratio|float   |Used when there are multiple metrics, for example: you want to focus on both F1 and accuracy, then you will create f1 instance and accuracy instance, and indicate their weight proportion. if weight_ratio of f1 is 0.3, and weight ratio of accuracy is 0.7, then the final metric to tune is f1*0.3 + accuracy*0.7, this parameter is only used for quantization.|None |
+
+- example:
+    ```python
+    from nlp_toolkit import metric
+    metric.Metric(name="eval_f1", greater_is_better=True, is_relative=True, criterion=0.01, weight_ratio=None)
+    ```
diff --git a/docs/nlp_executor.md b/docs/nlp_executor.md
@@ -1,5 +1,5 @@
 # NLP Executor：A baremetal inference engine for Natural Language Processing(NLP) Models
-NLP Executor is an inference executor for Natural Language Processing (NLP) models, providing the optimal performance via various model compression techniques like quantization, sparsity. The executor is a baremetal reference engine of NLP Toolkit and supports popular NLP models.
+NLP Executor is an inference executor for Natural Language Processing (NLP) models, providing the optimal performance by quantization and sparsity. The executor is a baremetal reference engine of NLP Toolkit and supports typical NLP models.
 
 ## Deployment Architecture
 The executor supports model optimization and high performance kernel for CPU.
@@ -65,12 +65,13 @@ If you use pip install -e . to install the executor in your current folder, plea
 
 You can skip the step if install with other ways .
 
-```
+```python
 from engine_py import Model
 # load the model, config_path:path of generated yaml, weight_path: path of generated bin
 model = Model(config_path, weight_path)
 # use model.forward to do inference
-out = model.forward([input_ids, segment_ids, input_mask]) ```
+out = model.forward([input_ids, segment_ids, input_mask])
+```
 
 The `input_ids`, `segment_ids` and `input_mask` are the input numpy array data of a bert model, which have size (batch_size, seq_len). 
 Note that the `out` is a list contains the bert model output numpy data (`out=[output numpy data]`). 
diff --git a/docs/objectives.md b/docs/objectives.md
@@ -0,0 +1,19 @@
+# Objective
+In terms of evaluating the status of a specific model during tuning, we should have general objectives to measure the status of different models.
+
+NLP Toolkit supports to optimized low-precision recipes for deep learning models to achieve optimal product objectives like inference performance and memory usage with expected accuracy criteria and now supports Objectives which is supported in [INC](https://github.com/intel-innersource/frameworks.ai.lpot.intel-lpot/blob/master/docs/objective.md#built-in-objective-support-list).
+
+- arguments:
+    |Argument   |Type       |Description                                        |Default value    |
+    |:----------|:----------|:-----------------------------------------------|:----------------|
+    |name       |string     |the Objective name in [INC](https://github.com/intel-innersource/frameworks.ai.lpot.intel-lpot/blob/master/docs/objective.md#built-in-objective-support-list). Like "performance", "modelsize",......and so on|        |
+    |greater_is_better|bool |Used to describe the usage of the objective, like: greater is better for performance, but lower is better for modelsize|True|
+    |weight_ratio|float   |Used when there are multiple objective, for example: you want to focus on both performance and modelsize, then you will create performance objective instance and modelsize objective instance, and indicate their weight proportion|None |
+
+- example:
+    ```python
+    from nlp_toolkit import objectives
+    objectives.Objective(name="performance", greater_is_better=True, weight_ratio=None)
+    ```
+
+- Built-in Objective instance: performance, modelsize.