Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids. The toolkit aims to provide the key features and examples as below:
-
Seamless user experience of model compressions on Transformers-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
-
Advanced compression-aware software optimizations and runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
-
Accelerated end-to-end Transformer-based applications such as Stable Diffusion, GPT-J-6B, and T5, and SetFit
pip install intel-extension-for-transformers
For more installation method, please refer to Installation Page
from datasets import load_dataset, load_metric
from transformers import AutoConfig,AutoModelForSequenceClassification,AutoTokenizer
raw_datasets = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
raw_datasets = raw_datasets.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
from intel_extension_for_transformers.optimization import QuantizationConfig, metrics, objectives
from intel_extension_for_transformers.optimization.trainer import NLPTrainer
config = AutoConfig.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",config=config)
model.config.label2id = {0: 0, 1: 1}
model.config.id2label = {0: 'NEGATIVE', 1: 'POSITIVE'}
# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(model=model,
train_dataset=raw_datasets["train"],
eval_dataset=raw_datasets["validation"],
tokenizer=tokenizer
)
q_config = QuantizationConfig(metrics=[metrics.Metric(name="eval_loss", greater_is_better=False)])
model = trainer.quantize(quant_config=q_config)
input = tokenizer("I like Intel Extension for Transformers", return_tensors="pt")
output = model(**input).logits.argmax().item()
For more quick samples, please refer to Get Started Page. For more validated examples, please refer to Support Model Matrix
OVERVIEW | |||||||
---|---|---|---|---|---|---|---|
Model Compression | Neural Engine | Kernel Libraries | Examples | ||||
MODEL COMPRESSION | |||||||
Quantization | Pruning | Distillation | Orchestration | ||||
Neural Architecture Search | Export | Metrics/Objectives | Pipeline | ||||
NEURAL ENGINE | |||||||
Model Compilation | Custom Pattern | Deployment | Profiling | ||||
KERNEL LIBRARIES | |||||||
Sparse GEMM Kernels | Custom INT8 Kernels | Profiling | Benchmark | ||||
ALGORITHMS | |||||||
Length Adaptive | Data Augmentation | ||||||
TUTORIALS AND RESULTS | |||||||
Tutorials | Supported Models | Model Performance | Kernel Performance |
- Blog published on Medium: MLefficiency — Optimizing transformer models for efficiency (Dec 2022)
- NeurIPS'2022: Fast Distilbert on CPUs (Nov 2022)
- NeurIPS'2022: QuaLA-MiniLM: a Quantized Length Adaptive MiniLM (Nov 2022)
- Blog published by Cohere: Top NLP Papers—November 2022 (Nov 2022)
- Blog published by Alibaba: Deep learning inference optimization for Address Purification (Aug 2022)
- NeurIPS'2021: Prune Once for All: Sparse Pre-Trained Language Models (Nov 2021)