You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Note the script should be run from the executorch/examples/apple/coreml/llama directory.)
11
+
10
12
The runner is written in python and is only intended to serve as an example for how the model inputs should be processed; it is not performant.
11
13
12
14
@@ -15,18 +17,26 @@ Run model with:
15
17
python run.py -m /path/to/model.pte -p /path/to/params.json -t /path/to/tokenizer.model --seq_length 64 --max_seq_length 1024 --prompt "Once upon a time," --n_steps 512
16
18
```
17
19
18
-
The model here is based on a "sliding" cache, where old tokens are evicted from the cache. There is no actual sliding in the implementation, though.tion.
20
+
21
+
(Note the script should be run from the executorch/examples/apple/coreml/llama directory.)
22
+
23
+
The model here is based on a "sliding" cache, where old tokens are evicted from the cache. There is no actual sliding in the implementation, though.
19
24
20
25
21
26
## Export args
22
27
* seq_length: the number of tokens processed by the model. Sequences shorter than seq_length must be padded, and sequences longer than it must be chunked.
23
28
* max_seq_length: the maximum context tokens that can be processed.
24
29
* cache_size: the size of the KV cache sequences. This parameter is optional, and defaults to max_seq_length - seq_length. If a smaller cache_size is used, older tokens are evicted from the cache and no longer play a role in attention. For example, if max_seq_length=1024, but cache_size is 512, the model can generate up to 1024 tokens, but only the current tokens and the previous 512 will participate in attention. In terms of computation, cache_size plays a similar role to max_seq_length in models without cache eviction.
25
30
* use_cache_list: boolean option that controls whether KV caches are passed as a list of 4D tensors, one per layer, or if they are passed as one 5D tensor. (Note that use_cache_list does not work with ExecuTorch pybindings.)
26
-
*target_size: this option splits linear layers into chunks of target size. For example, if target_size is 1024, a linear layer with (in_features=512, out_features=8096) will be split into 8 linear layers with (in_features=512, out_features=1024) and the results concatted. If not specified, the default is no splitting.
31
+
*target_split_size: this option splits linear layers into chunks of target size. For example, if target_split_size is 1024, a linear layer with (in_features=512, out_features=8096) will be split into 8 linear layers with (in_features=512, out_features=1024) and the results concatted. If not specified, the default is no splitting.
27
32
* max_splits: this controls the maximum number of splits for linear layers. It is only relevant if target_size is passed and defaults to 8.
28
33
29
34
## Llama1B on iPhone 15
30
35
31
-
We are actively experimenting with different settings, but here are ones we've found that work well on iPhone 15 Pro for Llama1B:
We are actively experimenting with different settings. But here are ones that we've found work well for Llama1B on iPhone 15 Pro:
37
+
38
+
* Set use_cache_list
39
+
* Split linear layers with target_split_size=1024, max_splits=8
40
+
* Use seq_length=32 or seq_length=64, both of which offer reasonable tradeoffs for prefill and decode performance. seq_length=32 is better at decode and seq_length=64 is better at prefill.
41
+
42
+
In our tests, we set max_seq_length=1024, but if your application allows for it, performance can improve with max_seq_length=512 or by keeping max_seq_length=1024 and setting cache_size=512-seq_length.
0 commit comments