|
| 1 | +# ExLlama |
| 2 | + |
| 3 | +A rewrite of the HF transformers implementation of Llama with the following goals, among others: |
| 4 | + |
| 5 | +* Designed for use with quantized weights |
| 6 | +* Memory-efficient inference (not just attention) |
| 7 | +* Mapping across multiple devices |
| 8 | +* Built-in (multi) LoRA support |
| 9 | +* Companion library of funky sampling functions |
| 10 | + |
| 11 | +Disclaimer: This is currently a preview of a work in progress. Or maybe a proof of concept. Either way it will all |
| 12 | +change, a lot. More to be added, much will be removed, etc. Don't use this yet. |
| 13 | + |
| 14 | +## Dependencies |
| 15 | + |
| 16 | +This list might be incomplete: |
| 17 | + |
| 18 | +* `torch` tested on 2.0.0 with cu117 |
| 19 | +* `xformers` tested on 0.0.18 (or, comment out the xformers attention stuff which doesn't work anyway) |
| 20 | +* `safetensors` 0.3.1 |
| 21 | +* `transformers` 4.28.0 (only for LlamaTokenizer, which will be removed soon) |
| 22 | +* `gptq-llama` tested on 0.2, commit @eaa9955d8700dc8566f0c443054233e9c4503f66 |
| 23 | +* `sentencepiece` |
| 24 | + |
| 25 | +These shouldn't be needed but PyCharm says they're referenced somewhere: |
| 26 | + |
| 27 | +* `colorama` |
| 28 | +* `numpy` |
| 29 | + |
| 30 | +## Some preliminary results |
| 31 | + |
| 32 | +| | Model | Cache | Inference | Total | Max (actual) | Speed 1 | Speed 2 | |
| 33 | +|-----------------------------|-----------|----------|-----------|-----------|--------------|-----------|----------| |
| 34 | +| 7B 4bit 128g, HF | 4,859 MB | - | 4,080 MB | 8,940 MB | 10,712 MB | 2,342 t/s | 31 t/s | |
| 35 | +| 13B 4bit 128g, HF | 8,393 MB | - | 6,509 MB | 14,902 MB | 18,268 MB | 1,267 t/s | 25 t/s | |
| 36 | +| 30B 4bit 128g, HF | 19,071 MB | - | OoM | OoM | OoM | OoM | OoM | |
| 37 | +| | | | | | | | | |
| 38 | +| 7B 4bit 128g, HF, 16-bit | 3,796 MB | - | 2,252 MB | 6,058 MB | 7,670 MB | 2,991 t/s | 31 t/s | |
| 39 | +| 13B 4bit 128g, HF, 16-bit | 7,033 MB | - | 3,530 MB | 10,563 MB | 12,370 MB | 2,225 t/s | 25 t/s | |
| 40 | +| 30B 4bit 128g, HF, 16-bit * | 17,062 MB | - | 3,689 MB | 20,715 MB | 22,734 MB | 996 t/s | 17 t/s | |
| 41 | +| | | | | | | | | |
| 42 | +| 7B 4bit 128g, ExLlama | 3,611 MB | 1,023 MB | 966 MB | 5,600 MB | 7,062 MB | 2,258 t/s | 66 t/s | |
| 43 | +| 13B 4bit 128g, ExLlama | 6,827 MB | 1,600 MB | 1,333 MB | 9,760 MB | 11,270 MB | 1,498 t/s | 51 t/s | |
| 44 | +| 30B 4bit 128g, ExLlama ** | 17,036 MB | 3,119 MB | 893 MB | 21,048 MB | 22,514 MB | 150 t/s | 14 t/s | |
| 45 | + |
| 46 | +*) 1024 tokens only. OoMs on full context length |
| 47 | +**) Only quantized matmul, hence the lower speed. Could run at full speed on a 24 GB GPU, but I'd have to close my |
| 48 | +podcasts, so... |
| 49 | + |
| 50 | +All results (except for #6) are for 1920-token sequence lengths (speed 1) grown to 2048 tokens one token at a time |
| 51 | +(speed 2). First six are the standard implementation in Transformers, loaded in 4-bit mode more or less following the |
| 52 | +methods in [this repo](https://github.com/johnsmith0031/alpaca_lora_4bit). Last three use the new implementation. |
| 53 | + |
| 54 | +* **Model** is the base VRAM usage of each model before any inference is run. |
| 55 | +* **Cache** usage measures the size of the cache, which the new model pre-allocates in full to avoid concatenating the |
| 56 | +cache on every inference step, which is very wasteful as it turns out. |
| 57 | +* **Inference** is peak usage measured during inference. This is considerably higher than it should be right now, due to |
| 58 | +the conversion of quantized parameters back into floats to use the much faster PyTorch matmul instead of the quant-cuda |
| 59 | +version, for large enough tensors. I'm hopeful this can be optimized a lot. Might look at fused matmul with Triton. |
| 60 | +* **Total** sums up VRAM the model *should* be using, except for... |
| 61 | +* **Max** is the actual VRAM usage as reported by `nvidia-smi`. Apparently PyTorch adds a bit of overhead for CUDA |
| 62 | +kernels and whatnot. It seems very unpredictable, so maybe the behavior could be tweaked. |
| 63 | + |
| 64 | +## Todo |
| 65 | + |
| 66 | +- [ ] Write to-do list |
0 commit comments