Transformers save/load compatibility and inference kernels #3

BlackSamorez · 2024-01-16T11:41:08Z

This PR.

Merges QuantizedWeight and QuantizedLinear into one class
Decouples QuantizedLinear creation with empty weights and QuantizedLinear initialisation with KMeans
Changes saving to save only the final state dict
Adds custom modeling_llama.py, allowing to load the saved state dict with load_pretrained
Adds conversion script for previously saved models
Adds custom matmul kernel written in triton (maybe separate into a different PR)

justheuristic · 2024-01-16T15:49:48Z

src/aq.py

+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        return F.linear(input, self.reconstruct_weight(), self.bias)
+        # original_shape = input.shape


commented code; do we need it?

justheuristic · 2024-01-16T15:51:06Z

src/aq.py

@@ -153,7 +209,7 @@ def get_scales(self) -> torch.Tensor:
        else:  # train scale codebook only
            return self.scales_clusters.gather(1, self.scales_indices)[:, :, None, None]

-    def forward(self, selection: Union[slice, ellipsis, torch.Tensor] = ...):
+    def reconstruct_weight(self, selection: Union[slice, ellipsis, torch.Tensor] = ...):


I'd request that we keep it as "forward" for notebook and experiment compatibility.
Alternatively,

def forward(self, *args, **kwargs): self.reconstruct_weight(*args, **kwargs)?

I would make sense for this layer to behave like a normal nn.Linear layer. That is, actually perform a forward pass on forward. That way we won't have to change much code when replacing nn.Linears with it.

Ah, I get it now. Maybe best to move that to QuantizedLinear class and keep QuantizedWeight as is?

Done more or less

src/matmul_kernels.py

convert_to_hf.py

src/matmul_kernels.py

src/utils.py

justheuristic

Awesome work!

While we wait for other reviews, can you please run an experiment on this branch to check that everything works and achieves ~same perplexity?

AlexKoff88 · 2024-01-26T05:55:04Z

It looks awesome really!

I wonder if you guys consider extending your work to the case when there can be multiple sets of lookups for one weight tensor. I mean having something like: two additive 8-bit lookups for a tile of 32x4096 weights. The idea is to make it more HW-friendly so that both lookups can be stored in the shared memory of GPU and weights can be quickly prefetched before the MatMul. 8-bit + 8-bit = 16-bit scheme to represent 8 weights should be also HW-friendly both from the unpacking and storage point of view.

What do you think?

BlackSamorez · 2024-01-27T12:00:53Z

@AlexKoff88 Thanks!
We're working on optimizing inference for both GPUs and CPUs, and, indeed, having smaller codebooks has the potential to greatly improve the performance. We'll make sure to publish the code once we have reliable results in terms of both model compression quality and inference speed, so stay tuned!

justheuristic · 2024-02-06T19:39:31Z

inference_lib/src/aqlm/inference.py

+from aqlm.utils import get_int_dtype
+
+
+class QuantizedLinear(nn.Module):


please create an issue, for later, to examine and potentially deduplicate the code here,

with a full understanding that we're about as likely to fix it as to never open it

To save you time, here's one possible issue text:

In the current version, we're reusing some of the code between `src` and `inference_lib/src`. For instance, inference_lib/src/inference.py:QuantizedLinear resembles src/aq.py:QuantizedLinear. If we have time, it would be nice to selectively merge some of them.

My idea was that inference code should be completely separated from the quantisation code not to break the latter and because there isn't much overlap anyway. Those two classes serve very different purpose and share surprisingly little code.

I agree with @BlackSamorez on this one.

justheuristic · 2024-02-06T19:40:40Z

inference_lib/src/aqlm/inference_kernels/cuda_kernel.cpp

@@ -0,0 +1,61 @@
+#include <torch/all.h>


Strong opinion: we need compilation instructions OR a promise that you'll add them with a fixed deadline.

If i missed the instructions somewhere, please direct me to them

It compiles in runtime here no deliberate compilation actions needed

transformers/llama/modeling_llama_aqlm.py

justheuristic

LGTM.

In future, it would be nice to minimize the amount of code we copy from transformers, but that can wait.

You did a gargantuan amount of work here :)

Co-authored-by: justheuristic <justheuristic@gmail.com>

Vahe1994

Nice work!
I think the last major thing that is missing - instruction in Readme.md.
Would be nice to have there :

List of models in HF
How to and what to install.
Instruction how to interact with the code (referring to colab notebook)

justheuristic · 2024-02-07T13:22:51Z

README.md

+| Mixtral-8x7b| 1x15       | 4.61           | 12.6            | [Link](https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x15-hf)|
+
+
+### Dependencies


Minor: i'd still call this installation

why: there's another "dependencies" later, which could cause confusion

BlackSamorez force-pushed the transformers branch 2 times, most recently from f366016 to f498eaf Compare January 16, 2024 11:49

Vahe1994 requested review from justheuristic and Vahe1994 January 16, 2024 13:08

BlackSamorez force-pushed the transformers branch from 57c2462 to 98fbddc Compare January 16, 2024 15:04