Before you begin, make sure you install all necessary libraries by running:
pip install "optimum-onnx[onnxruntime]"@git+https://github.com/huggingface/optimum-onnx.git
If you want to use the GPU version of ONNX Runtime, make sure the CUDA and cuDNN requirements are satisfied, and install the additional dependencies by running :
pip install "optimum-onnx[onnxruntime-gpu]"@git+https://github.com/huggingface/optimum-onnx.git
To avoid conflicts between onnxruntime
and onnxruntime-gpu
, make sure the package onnxruntime
is not installed by running pip uninstall onnxruntime
prior to installing Optimum.
It is possible to export 🤗 Transformers, Diffusers, Timm and Sentence Transformers models to the ONNX format and perform graph optimization as well as quantization easily:
optimum-cli export onnx --model meta-llama/Llama-3.2-1B onnx_llama/
The model can also be optimized and quantized with onnxruntime
.
For more information on the ONNX export, please check the documentation.
Once the model is exported to the ONNX format, we provide Python classes enabling you to run the exported ONNX model in a seemless manner using ONNX Runtime in the backend:
from transformers import AutoTokenizer, pipeline
- from transformers import AutoModelForCausalLM
+ from optimum.onnxruntime import ORTModelForCausalLM
- model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") # PyTorch checkpoint
+ model = ORTModelForCausalLM.from_pretrained("onnx-community/Llama-3.2-1B", subfolder="onnx") # ONNX checkpoint
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
More details on how to run ONNX models with ORTModelForXXX
classes here.