Skip to content

Commit

Permalink
Add readme for other backends
Browse files Browse the repository at this point in the history
Differential Revision: D64997867

Pull Request resolved: pytorch#6556
  • Loading branch information
cccclai authored Oct 29, 2024
1 parent 47bca20 commit 461d61d
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 1 deletion.
7 changes: 6 additions & 1 deletion examples/models/llama/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,8 @@ Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus
</em>
</p>

[Please visit this section to try it on non-CPU backend, including CoreML, MPS, Qualcomm HTP or MediaTek](non_cpu_backends.md).

# Instructions

## Tested on
Expand Down Expand Up @@ -242,6 +244,9 @@ You can export and run the original Llama 3 8B instruct model.

Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.


If you're interested in deploying on non-CPU backends, [please refer the non-cpu-backend section](non_cpu_backends.md)

## Step 3: Run on your computer to validate

1. Build executorch with optimized CPU performance as follows. Build options available [here](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt#L59).
Expand All @@ -261,7 +266,7 @@ You can export and run the original Llama 3 8B instruct model.
cmake --build cmake-out -j16 --target install --config Release
```
Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the session of Common Issues and Mitigations below for solutions.
Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the section of Common Issues and Mitigations below for solutions.

2. Build llama runner.
```
Expand Down
1 change: 1 addition & 0 deletions examples/models/llama/UTILS.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ For CoreML, there are 2 additional optional arguments:
* `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though)
* `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML

To deploy the large 8B model on the above backends, [please visit this section](non_cpu_backends.md).

## Download models from Hugging Face and convert from safetensor format to state dict

Expand Down
24 changes: 24 additions & 0 deletions examples/models/llama/non_cpu_backends.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@

# Running Llama 3/3.1 8B on non-CPU backends

### QNN
Please follow [the instructions](https://pytorch.org/executorch/stable/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.html) to deploy Llama 3 8B to an Android smartphone with Qualcomm SoCs.

### MPS
Export:
```
python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --mps --use_sdpa_with_kv_cache -d fp32 -qmode 8da4w -G 32 --embedding-quantize 4,32
```

After exporting the MPS model .pte file, the [iOS LLAMA](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) app can support running the model. ` --embedding-quantize 4,32` is an optional args for quantizing embedding to reduce the model size.

### CoreML
Export:
```
python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --coreml --coreml-ios 18 --coreml-quantize b4w
```

After exporting the CoreML model .pte file, please [follow the instruction to build llama runner](https://github.com/pytorch/executorch/tree/main/examples/models/llama#step-3-run-on-your-computer-to-validate) with CoreML flags enabled as the instruction described.

### MTK
Please [follow the instructions](https://github.com/pytorch/executorch/tree/main/examples/mediatek#llama-example-instructions) to deploy llama3 8b to an Android phones with MediaTek chip

0 comments on commit 461d61d

Please sign in to comment.