-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for exl2-quantized models #1965
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
f3e8eac
to
8e03024
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things seem to be working.
In general though, I feel like the PR is currently adding way too many indirections than necessary, every logic should be in their appropriate files and never add more.
layers/tensor_parallel
is about the actual parallel logic, it can have some slight variations based on quantization, but only to change higher order loading logic.weights.py
is all about creating the actual individual tensors required on the model. This one knows about quantization and how to shard tensors.layers/gptq
anything else that is GPTQ specific, not about loading/sharding tensors and more about running init phase (scratch buffers) and actual forwards.
revision="3.0bpw", | ||
# Set max input length to avoid OOM due to extremely large | ||
# scratch buffer. | ||
max_input_length=1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need that ? I thought you fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed that it does not unconditionally use a 4096 input length, but the value given by warmup. The default is probably still too much. E.g., in Llama2 8B for the output layer
4096 length * 16 batch size 128,256 pieces * 2 sizeof(float16) = 15.7GiB
The scratch buffer is allocated for the worst-case.
q_invperm: Optional[torch.Tensor] = None | ||
|
||
@property | ||
def device(self) -> torch.device: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove that. weigths.qweight.device
is just as easy to read and clearer imho.
|
||
w = Exl2Weight(**tensors) | ||
w.q_scale_max /= 256 | ||
w.q_perm = w.q_perm.short() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ?
Can't we make the dataclass immutable (or at least treat it as is).
imho once loaded, nothing should be modified in tensors and always just passed as-is. Nothing should be required at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this to a proper dataclass constructor + __post_init__
.
# Find the size of the scratch space. | ||
for layer in LAYERS: | ||
FIXED_BYTES = max( | ||
FIXED_BYTES, layer.scratch_space_fixed(max_input_len=max_total_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why we need 2 loops.
Do you have the link in the original repo for that logic ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to know to know the max memory use of all layers to allocate the scratch buffer, before we can pass (slices of) the scratch buffer to the layers in their post-init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also replaced the loop with a more readable for-comprehension now.
d14c046
to
4057345
Compare
Mostly straightforward, changes to existing code: * Wrap quantizer parameters in a small wrapper to avoid passing around untyped tuples and needing to repack them as a dict. * Move scratch space computation to warmup, because we need the maximum input sequence length to avoid allocating huge scratch buffers that OOM.
This test fails somewhat regularly due to non-determinism and this test is primarily to verify that we are loading a model which doesn't have `float16` as the default dtype correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What does this PR do?
Add support for exl2 quantization
Mostly straightforward, changes to existing code:
Draft: needs a rebase, exllama kernels seem non-deterministic, so logprobs sometimes change slightly?
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.