-
Notifications
You must be signed in to change notification settings - Fork 30k
Description
Feature request
Unlock a new workflow for on-device use-cases via torch.export and ExecuTorch.
So ideally the users can have an e2e experience by loading a pretrained transformer model from HuggingFace, export and lower it to ExecuTorch
and get reasonable performance out-of-the-box.
For example:
- Load a model with StaticCache:
model = AutoModelForCausalLM.from_pretrained(
hf_model_repo,
config=config,
attn_implementation="sdpa",
cache_config={
"use_cache": True,
"cache_implementation": "static",
"max_cache_length": 128,
}, # Mandatory field to set ONLY for "Export to ExecuTorch" workflow, optional in other use-cases
)
- Then export the model with StaticCache.
exported_program = convert_and_export_with_cache(
model,
args=(model_inputs,),
kwargs={"position_ids": <val>, "inputs_embeds": <val>, "cache_position": <val>}
and then further lower the exported program to ExecuTorch
with delegates for performance:
executorch_m = lower_to_executorch(
model,
recipes="xnnpack_fp32", # Delegate to XNNPACK backend
)
# The lowered artifact can be saved into a `.pte` binary format for integration and distribution.
With that you may get a model for on-device with reasonable performance to start with.
From there and still within ExecuTorch
stack, you can easily tailor the experience for your use-cases, of course, with better performance! Note that ExecuTorch
supports delegatation to XNNPACK backend, Apple Core ML and MPS, Qualcomm QNN, ARM Ethos-U, Vulkan GPU and more. You can learn more by reading our tutorial.
- Use the exported/lowered artifact for inference:
# The lowered artifact can run on a local device in the ExecuTorch runtime in c++ or via pybind, providing the same experience as how users run inference with the eager model on server.
generate(model=executorch_m, prompt="Hello world") # Will generate up to the maximal sequence length/cache length
The example workflow above shows direct integration between ExecuTorch
and HF transformers
models. Eventually this workflow could be accessible via optimum exporters-et
, Transformers.js
or in ExecuTorch
and torchchat
.
Motivation
Unlock a whole new on-device experience of using HuggingFace models w/o leaving the PyTorch ecosystem (ExecuTorch
is native PyTorch!)
Issues Tracker
Fundamental
- Make
StaticCache
compatible withtorch.export
: PR Make static cache compatible with torch.export #32168 - Make Cache statically configurable at model construction time #32500: PR Make StaticCache configurable at model construct time #32830
- Fix get_usable_length for StaticCache #32503
- Support dynamic length slicing in
StaticCache
: PR [WIP] Dynamic length in static cache #30862 - Implement
generate
(inference) for torch exported text-generation models #32504: PR Generate using exported model and enable gemma2-2b in ExecuTorch #33707 - Convert Hugging Face tokenizer files to be the c++
llm_runner
consumable: How to convert tokenizer of SmolLM model as accepted by executorch pytorch/executorch#6813
E2E workflow
- Umbrella task for
Optimum
enablement: Export-to-ExecuTorch via Optimum integration optimum#2128 - Umbrella task for
Tranformers.js
enablement: Export-to-ExecuTorch via transformers.js integration transformers.js#1039
Optimization
- Support quantized models w/ ExecuTorch + TorchAO Export to ExecuTorch with Quantization #34787
Models
- Gemma is ExecuTorch compatible #33709: PR Generate using exported model and enable gemma2-2b in ExecuTorch #33707
- Llama is ExecuTorch compatible #32505: PR Llama3 and Llama2 are ExecuTorch compatible #34101
- CLIP is ExecuTorch compatible #32506
- Bert is ExecuTorch compatible #32507: PR Bert is ExecuTorch compatible #34424
- Bart/Wav2Vec2 is ExecuTorch compatible #32508
- TrOCR is ExecuTorch compatible #32509
- Qwen is ExecuTorch compatible #33833: PR Qwen2.5 is ExecuTorch Compatible #34102
- T5 is ExecuTorch compatible #33834: PR Export T5 (encoder-decoder) to ExecuTorch #36486
- DistillBERT is ExecuTorch compatible #33835: PR DistilBERT is ExecuTorch compatible #34475
- Albert is ExecuTorch compatible #33836: PR Albert is ExecuTorch compatible #34476
- Stable Diffusion is ExecuTorch compatible #33837
- Phi3 is ExecuTorch compatible #33838
- Mamba is ExecuTorch compatible #33839
- Olmo is ExecuTorch compatible #33840: PR Olmo is ExecuTorch Compatible #34181
- Roberta is ExecuTorch compatible #33841: PR Roberta is ExecuTorch compatible #34425
- MobileBERT is ExecuTorch compatible #33843: PR MobileBERT is ExecuTorch compatible #34473
- SmolLM is ExecuTorch Compatible #34879
- InternVL is ExecuTorch Compatible #35327
- Gemma3 is ExecuTorch compatible #37727: PR Gemma3 is Torch Exportable #37728
And more! We're ambitious to expanding the model coverage massively. Please comment below if you are interested in a particular model for on-device use-case!
Even better, we warmly welcome direct contributions from the community to support more models in exporting to ExecuTorch!
- Cohere2: Add Cohere2 model #35224
- OLMo2: Add OLMo November 2024 #34551
- DPT, DepthAnything & ZoeDepth: fix(DPT,Depth-Anything)
torch.export
#34103 - Whisper is ExecuTorch compatible #33842: PR Support Whisper optimum-executorch#45
- Qwen3 is ExecuTorch compatible #37844: Adding Qwen3 and Qwen3MoE #36878
Your contribution
- Co-design the "Export to ExecuTorch" workflow.
- Co-design the
generate
for exported model and the integration inOptimum
- Identify and fill gaps in DevX and UX
Here is how ExecuTorch implements the generate()
for llama2/3 in eager python and c++.