Skip to content

Commit 5c5fcaa

Browse files
committed
Initial commit
0 parents  commit 5c5fcaa

12 files changed

+1823
-0
lines changed

LICENSE

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
MIT License
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy
4+
of this software and associated documentation files (the "Software"), to deal
5+
in the Software without restriction, including without limitation the rights
6+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7+
copies of the Software, and to permit persons to whom the Software is
8+
furnished to do so, subject to the following conditions:
9+
10+
The above copyright notice and this permission notice shall be included in all
11+
copies or substantial portions of the Software.
12+
13+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19+
SOFTWARE.

README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# ExLlama
2+
3+
A rewrite of the HF transformers implementation of Llama with the following goals, among others:
4+
5+
* Designed for use with quantized weights
6+
* Memory-efficient inference (not just attention)
7+
* Mapping across multiple devices
8+
* Built-in (multi) LoRA support
9+
* Companion library of funky sampling functions
10+
11+
Disclaimer: This is currently a preview of a work in progress. Or maybe a proof of concept. Either way it will all
12+
change, a lot. More to be added, much will be removed, etc. Don't use this yet.
13+
14+
## Dependencies
15+
16+
This list might be incomplete:
17+
18+
* `torch` tested on 2.0.0 with cu117
19+
* `xformers` tested on 0.0.18 (or, comment out the xformers attention stuff which doesn't work anyway)
20+
* `safetensors` 0.3.1
21+
* `transformers` 4.28.0 (only for LlamaTokenizer, which will be removed soon)
22+
* `gptq-llama` tested on 0.2, commit @eaa9955d8700dc8566f0c443054233e9c4503f66
23+
* `sentencepiece`
24+
25+
These shouldn't be needed but PyCharm says they're referenced somewhere:
26+
27+
* `colorama`
28+
* `numpy`
29+
30+
## Some preliminary results
31+
32+
| | Model | Cache | Inference | Total | Max (actual) | Speed 1 | Speed 2 |
33+
|-----------------------------|-----------|----------|-----------|-----------|--------------|-----------|----------|
34+
| 7B 4bit 128g, HF | 4,859 MB | - | 4,080 MB | 8,940 MB | 10,712 MB | 2,342 t/s | 31 t/s |
35+
| 13B 4bit 128g, HF | 8,393 MB | - | 6,509 MB | 14,902 MB | 18,268 MB | 1,267 t/s | 25 t/s |
36+
| 30B 4bit 128g, HF | 19,071 MB | - | OoM | OoM | OoM | OoM | OoM |
37+
| | | | | | | | |
38+
| 7B 4bit 128g, HF, 16-bit | 3,796 MB | - | 2,252 MB | 6,058 MB | 7,670 MB | 2,991 t/s | 31 t/s |
39+
| 13B 4bit 128g, HF, 16-bit | 7,033 MB | - | 3,530 MB | 10,563 MB | 12,370 MB | 2,225 t/s | 25 t/s |
40+
| 30B 4bit 128g, HF, 16-bit * | 17,062 MB | - | 3,689 MB | 20,715 MB | 22,734 MB | 996 t/s | 17 t/s |
41+
| | | | | | | | |
42+
| 7B 4bit 128g, ExLlama | 3,611 MB | 1,023 MB | 966 MB | 5,600 MB | 7,062 MB | 2,258 t/s | 66 t/s |
43+
| 13B 4bit 128g, ExLlama | 6,827 MB | 1,600 MB | 1,333 MB | 9,760 MB | 11,270 MB | 1,498 t/s | 51 t/s |
44+
| 30B 4bit 128g, ExLlama ** | 17,036 MB | 3,119 MB | 893 MB | 21,048 MB | 22,514 MB | 150 t/s | 14 t/s |
45+
46+
*) 1024 tokens only. OoMs on full context length
47+
**) Only quantized matmul, hence the lower speed. Could run at full speed on a 24 GB GPU, but I'd have to close my
48+
podcasts, so...
49+
50+
All results (except for #6) are for 1920-token sequence lengths (speed 1) grown to 2048 tokens one token at a time
51+
(speed 2). First six are the standard implementation in Transformers, loaded in 4-bit mode more or less following the
52+
methods in [this repo](https://github.com/johnsmith0031/alpaca_lora_4bit). Last three use the new implementation.
53+
54+
* **Model** is the base VRAM usage of each model before any inference is run.
55+
* **Cache** usage measures the size of the cache, which the new model pre-allocates in full to avoid concatenating the
56+
cache on every inference step, which is very wasteful as it turns out.
57+
* **Inference** is peak usage measured during inference. This is considerably higher than it should be right now, due to
58+
the conversion of quantized parameters back into floats to use the much faster PyTorch matmul instead of the quant-cuda
59+
version, for large enough tensors. I'm hopeful this can be optimized a lot. Might look at fused matmul with Triton.
60+
* **Total** sums up VRAM the model *should* be using, except for...
61+
* **Max** is the actual VRAM usage as reported by `nvidia-smi`. Apparently PyTorch adds a bit of overhead for CUDA
62+
kernels and whatnot. It seems very unpredictable, so maybe the behavior could be tweaked.
63+
64+
## Todo
65+
66+
- [ ] Write to-do list

0 commit comments

Comments
 (0)