You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For better performance, use "--use_cache_list" export arg (does not work with pybindings). You can also set "--target_size", which splits linear layers into smaller sizes for the ANE (it defaults to no splitting). This can have substantial impact on performance. For example, on Llama1B by setting "--target_size" to 1024, I see 1.34x increase in inference speed on M1 Pro (but loading time is increased). We need further experiments to tune this.
11
-
12
10
The runner is written in python and is only intended to serve as an example for how the model inputs should be processed; it is not performant.
13
11
14
12
@@ -17,4 +15,18 @@ Run model with:
17
15
python run.py -m /path/to/model.pte -p /path/to/params.json -t /path/to/tokenizer.model --seq_length 64 --max_seq_length 1024 --prompt "Once upon a time," --n_steps 512
18
16
```
19
17
20
-
The model here is based on a "sliding" cache, where old tokens are evicted from the cache. By default, the cache size is max_seq_length - seq_length, but you can explicitly pass in a smaller cache size (e.g., --cache_size 512). This can speed up computation and reduce memory. Keep in mind that once cache_size is reached, older tokens get evicted from the cache and do not participate in attention.
18
+
The model here is based on a "sliding" cache, where old tokens are evicted from the cache. There is no actual sliding in the implementation, though.tion.
19
+
20
+
21
+
## Export args
22
+
* seq_length: the number of tokens processed by the model. Sequences shorter than seq_length must be padded, and sequences longer than it must be chunked.
23
+
* max_seq_length: the maximum context tokens that can be processed.
24
+
* cache_size: the size of the KV cache sequences. This parameter is optional, and defaults to max_seq_length - seq_length. If a smaller cache_size is used, older tokens are evicted from the cache and no longer play a role in attention. For example, if max_seq_length=1024, but cache_size is 512, the model can generate up to 1024 tokens, but only the current tokens and the previous 512 will participate in attention. In terms of computation, cache_size plays a similar role to max_seq_length in models without cache eviction.
25
+
* use_cache_list: boolean option that controls whether KV caches are passed as a list of 4D tensors, one per layer, or if they are passed as one 5D tensor. (Note that use_cache_list does not work with ExecuTorch pybindings.)
26
+
* target_size: this option splits linear layers into chunks of target size. For example, if target_size is 1024, a linear layer with (in_features=512, out_features=8096) will be split into 8 linear layers with (in_features=512, out_features=1024) and the results concatted. If not specified, the default is no splitting.
27
+
* max_splits: this controls the maximum number of splits for linear layers. It is only relevant if target_size is passed and defaults to 8.
28
+
29
+
## Llama1B on iPhone 15
30
+
31
+
We are actively experimenting with different settings, but here are ones we've found that work well on iPhone 15 Pro for Llama1B:
0 commit comments