Scenario | Model | Examples | Hardware Targeted Optimization |
---|---|---|---|
NLP | llama2 | Link | CPU : with ONNX Runtime optimizations for optimized FP32 ONNX modelCPU : with ONNX Runtime optimizations for optimized INT8 ONNX modelCPU : with ONNX Runtime optimizations for optimized INT4 ONNX modelGPU : with ONNX Runtime optimizations for optimized FP16 ONNX modelGPU : with ONNX Runtime optimizations for optimized INT4 ONNX modelGPU : with QLoRA for model fine tune and ONNX Runtime optimizations for optimized ONNX modelAzureML compute : with AzureML compute to fine tune and optimize for your local GPUs |
mistral | Link | CPU : with Optimum conversion and ONNX Runtime optimizations and Intel® Neural Compressor static quantization for optimized INT8 ONNX modelGPU : with ONNX Runtime optimizations for optimized FP16 ONNX model |
|
open llama | Link | GPU : with Optimum conversion and merging and ONNX Runtime optimizations for optimized ONNX model GPU : with SparseGPT and TorchTRT conversion for an optimized PyTorch model with sparsityAzureML compute : with Optimum conversion and merging and ONNX Runtime optimizations in AzureMLCPU : with Optimum conversion and merging and ONNX Runtime optimizations and Intel® Neural Compressor 4-bits weight-only quantization for optimized INT4 ONNX model |
|
phi2 | Link | CPU : with ONNX Runtime optimizations fp32/int4GPU with ONNX Runtime optimizations fp16/int4, with PyTorch QLoRA for model fine tuneGPU with SliceGPT for an optimized PyTorch model with sparsity |
|
falcon | Link | GPU : with ONNX Runtime optimizations for optimized FP16 ONNX model |
|
red pajama | Link | CPU : with Optimum conversion and merging and ONNX Runtime optimizations for a single optimized ONNX model |
|
bert | Link | CPU : with ONNX Runtime optimizations and quantization for optimized INT8 ONNX modelCPU : with ONNX Runtime optimizations and Intel® Neural Compressor quantization for optimized INT8 ONNX modelCPU : with PyTorch QAT Customized Training Loop and ONNX Runtime optimizations for optimized ONNX INT8 modelGPU : with ONNX Runtime optimizations for CUDA EPGPU : with ONNX Runtime optimizations for TRT EP |
|
deberta | Link | GPU : Optimize Azureml Registry Model with ONNX Runtime optimizations and quantization |
|
gptj | Link | CPU : with Intel® Neural Compressor static/dynamic quantization for INT8 ONNX model |
|
Audio | whisper | Link | CPU : with ONNX Runtime optimizations for all-in-one ONNX model in FP32CPU : with ONNX Runtime optimizations for all-in-one ONNX model in INT8CPU : with ONNX Runtime optimizations and Intel® Neural Compressor Dynamic Quantization for all-in-one ONNX model in INT8GPU : with ONNX Runtime optimizations for all-in-one ONNX model in FP32GPU : with ONNX Runtime optimizations for all-in-one ONNX model in FP16GPU : with ONNX Runtime optimizations for all-in-one ONNX model in INT8 |
audio spectrogram transformer |
Link | CPU : with ONNX Runtime optimizations and quantization for optimized INT8 ONNX model |
|
Vision | stable diffusion stable diffusion XL |
Link | GPU : with ONNX Runtime optimization for DirectML EPGPU : with ONNX Runtime optimization for CUDA EPIntel CPU : with OpenVINO toolkit |
squeezenet | Link | GPU : with ONNX Runtime optimizations with DirectML EP |
|
mobilenet | Link | Qualcomm NPU : with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP |
|
resnet | Link | CPU : with ONNX Runtime static/dynamic Quantization for ONNX INT8 modelCPU : with PyTorch QAT Default Training Loop and ONNX Runtime optimizations for ONNX INT8 modelCPU : with PyTorch QAT Lightning Module and ONNX Runtime optimizations for ONNX INT8 modelAMD DPU : with AMD Vitis-AI QuantizationIntel GPU : with ONNX Runtime optimizations with multiple EPs |
|
VGG | Link | Qualcomm NPU : with SNPE toolkit |
|
inception | Link | Qualcomm NPU : with SNPE toolkit |
|
super resolution | Link | CPU : with ONNX Runtime pre/post processing integration for a single ONNX model |