Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
🚀 The feature, motivation and pitch
Currently, ExecuTorch supports one file format, ‘PTE’. The PTE file contains everything required to execute the model; instructions, delegated blobs and constant weights.
If there are two PTE files based on a common model, there’s currently no way for them to share weights or other data. If a system wants to download both PTE files, those PTE files will need to duplicate data on disk. There’s a similar problem when loading them; even if there was available disk space, loading both PTE files at the same time would require duplicating the data in RAM. For very large models, this could mean duplicating gigabytes of data. On edge systems with constrained disk space and RAM, this probably isn’t possible.
Note: This doc is for backend data separation. For backend weight sharing doc, please see: [RFC] Enable Weight Sharing across a single Backend
RFC (Optional)
Scope
Assumptions
Goals
Non-goals
Overview
Data separation is a proposed new feature that allows parts of the PTE file to live in separate, sharable files. Data separation majorly unblocks data sharing between separate PTE files.
Example
Note: each box is a separate file. The arrows indicate the dependency. Eg. PTE1 requires data1 and shared_data to execute.
PTE1 and PTE2 are separate models that share data. An example use case is LoRA. Multiple LoRA programs may share the same foundation weights and be optimized for different tasks eg. assistant or summarization. Here, PTE1 and PTE2 contain separate LoRA programs. ‘shared_data’ contains the foundation weights for both LoRA programs. For LLMs, foundation weights can be on the order of gigabytes. Without sharing, PTE1 and PTE2 must both hold a copy, duplicating potentially gigabytes of data.
‘data1’ and ‘data2’ may contain LoRA adapter weights. LoRA adapter weights are usually small, on the order of megabytes. The size can vary depending on the degree of fine-tuning. Having ‘data1’ and ‘data2’ in standalone files helps with deployment efficiency. LoRA adapter weights are likely in a faster deployment cadence compared to the foundation weights. Deploying a smaller file OTA is quicker and less prone to failure. If the PTE/LoRA weights are small, it’s reasonable to keep them in a single file and update them together.
Design
We propose new ahead-of-time APIs that provide backends with all the graphs across partitions and methods to be lowered. This enables backends to identify the shared components across these graphs. Additionally, we provide a blob storage service to backends to serialize data that is shared across graphs. At runtime, backends can retrieve the shared data for any further initialization. The design details are fleshed out in the Blob Storage Service here: (#8187). See sections ‘AoT: Preprocess’ and “Runtime: NamedDataMap’.
cc @mcr229, @iseeyuan, @dbort, @JacobSzwejbka, @tarun292
Beta Was this translation helpful? Give feedback.
All reactions