Skip to content

Export to ExecuTorch #32253

@guangy10

Description

@guangy10

Feature request

Unlock a new workflow for on-device use-cases via torch.export and ExecuTorch.

So ideally the users can have an e2e experience by loading a pretrained transformer model from HuggingFace, export and lower it to ExecuTorch and get reasonable performance out-of-the-box.

For example:

  1. Load a model with StaticCache:
model = AutoModelForCausalLM.from_pretrained(
    hf_model_repo,
    config=config,
    attn_implementation="sdpa",
    cache_config={
        "use_cache": True, 
        "cache_implementation": "static", 
        "max_cache_length": 128,
    },  # Mandatory field to set ONLY for "Export to ExecuTorch" workflow, optional in other use-cases
)
  1. Then export the model with StaticCache.
exported_program = convert_and_export_with_cache(
    model, 
    args=(model_inputs,), 
    kwargs={"position_ids": <val>, "inputs_embeds": <val>, "cache_position": <val>}

and then further lower the exported program to ExecuTorch with delegates for performance:

executorch_m = lower_to_executorch(
    model, 
    recipes="xnnpack_fp32",  # Delegate to XNNPACK backend
)

# The lowered artifact can be saved into a `.pte` binary format for integration and distribution.

With that you may get a model for on-device with reasonable performance to start with.

From there and still within ExecuTorch stack, you can easily tailor the experience for your use-cases, of course, with better performance! Note that ExecuTorch supports delegatation to XNNPACK backend, Apple Core ML and MPS, Qualcomm QNN, ARM Ethos-U, Vulkan GPU and more. You can learn more by reading our tutorial.

  1. Use the exported/lowered artifact for inference:

# The lowered artifact can run on a local device in the ExecuTorch runtime in c++ or via pybind, providing the same experience as how users run inference with the eager model on server.

generate(model=executorch_m, prompt="Hello world")  # Will generate up to the maximal sequence length/cache length 

The example workflow above shows direct integration between ExecuTorch and HF transformers models. Eventually this workflow could be accessible via optimum exporters-et, Transformers.js or in ExecuTorch and torchchat.

Motivation

Unlock a whole new on-device experience of using HuggingFace models w/o leaving the PyTorch ecosystem (ExecuTorch is native PyTorch!)

Issues Tracker

Fundamental

E2E workflow

Optimization

Models

And more! We're ambitious to expanding the model coverage massively. Please comment below if you are interested in a particular model for on-device use-case!

Even better, we warmly welcome direct contributions from the community to support more models in exporting to ExecuTorch!

Your contribution

  1. Co-design the "Export to ExecuTorch" workflow.
  2. Co-design the generate for exported model and the integration in Optimum
  3. Identify and fill gaps in DevX and UX

Here is how ExecuTorch implements the generate() for llama2/3 in eager python and c++.

cc: @amyeroberts @gante @ArthurZucker @michaelbenayoun

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions