Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformers save/load compatibility and inference kernels #3

Merged
merged 111 commits into from
Feb 7, 2024
Merged
Changes from 1 commit
Commits
Show all changes
111 commits
Select commit Hold shift + click to select a range
e159c7f
Added 1x16 CUDA kernel
efrantar Jan 27, 2024
9ec6b70
Conversion script
Jan 16, 2024
395a9f6
black, src, indent
Jan 16, 2024
8e80100
pack_int_data
Jan 16, 2024
4cabbc3
deprecated double quant, scales
Jan 14, 2024
2c97a9f
estimate_nbits_per_parameter
Jan 14, 2024
7864596
scales fix
Jan 14, 2024
3f66363
First triton kernel
Jan 16, 2024
07986d9
black, isort
Jan 16, 2024
56359db
Quantization refactoring started
Jan 16, 2024
6ad6a6b
restored double quant
Jan 16, 2024
46be31d
estimate_nbits_per_parameter
Jan 16, 2024
6ee4353
less diff aq_engine
Jan 16, 2024
9e917f1
bias processing
Jan 16, 2024
e281962
removed debug prints
Jan 16, 2024
8c35e89
additional kwargs in config
Jan 16, 2024
a795ee6
removed matmul kernel
Jan 16, 2024
10c0a7e
packing and unpacking integers
Jan 16, 2024
c41f09e
packs and unpacks
Jan 16, 2024
ed1430f
undoing
Jan 17, 2024
eb89781
FinalizedQuantizedLinear
Jan 17, 2024
96301f7
tied up and saving
Jan 17, 2024
c19a5f6
fixed saving
Jan 17, 2024
0013d8e
removed unsupported kwargs
Jan 17, 2024
89d6085
triton kernel again
Jan 18, 2024
444f788
bias in triton
Jan 18, 2024
0f5483a
renamed smt
Jan 18, 2024
a9e65bb
new configuration ides
Jan 18, 2024
27e3855
inference file copying
Jan 18, 2024
0f37ea6
separated saving
Jan 18, 2024
e3b623f
Fixed cloning
Jan 20, 2024
fe9a9f2
skernel
Jan 20, 2024
164c5d1
better saving
Jan 20, 2024
db0d5a2
isort
Jan 20, 2024
0d7e1af
removed unnecessary dependencies
Jan 20, 2024
9f83593
lm_eval tokenizer trust remote code
Jan 21, 2024
990d5b2
llama tokenizer
Jan 21, 2024
b0b59f5
Deleted llama tokenizers
Jan 22, 2024
fd8395f
faster triton kernel
Jan 22, 2024
e519193
has_bias tl constexpr
Jan 22, 2024
9e89f61
cpp_kernel benchmarks
Jan 26, 2024
e673e0b
better order
Jan 26, 2024
0d69cf6
better compile flags
Jan 26, 2024
59e1f3c
removed unnecessary pragmas
Jan 26, 2024
e98ed7c
fixed stuff
Jan 27, 2024
c245c5e
removed test function
BlackSamorez Jan 27, 2024
147ed79
icpx
BlackSamorez Jan 27, 2024
a5ae331
inference_lib
BlackSamorez Jan 28, 2024
11e5dfd
inference lib done
Jan 28, 2024
e5e9ee9
Correct modeling_llama.py
Jan 28, 2024
48d6ddf
new version and fixed path
Jan 28, 2024
51d664b
undoing src and main
Jan 28, 2024
8979ae1
Merge remote-tracking branch 'origin/cuda-kernel' into transformers_cuda
Jan 28, 2024
4f31eae
cuda kernel
Jan 28, 2024
c9cf936
cuda kernel integration
Jan 28, 2024
d0f6ed4
removed cpp kernel
Jan 28, 2024
6cc2756
removed src changes
Jan 28, 2024
7476833
rmd testing notebook
Jan 28, 2024
eb8c2cd
dev3
Jan 28, 2024
1db5115
include nonpython files
Jan 28, 2024
7f7e853
benchmarks (temp)
Jan 29, 2024
5b3a5d2
test update
Jan 29, 2024
1804499
Some fixes and added 2x8 kernel
efrantar Jan 29, 2024
07f72b6
Merge remote-tracking branch 'origin/cuda-kernel' into transformers
Jan 29, 2024
bf0880f
new kernels
Jan 29, 2024
d7c4561
kernel asserts fix
Jan 30, 2024
823db17
numba kernel
Jan 30, 2024
22a7994
cleaner benchmark
Jan 30, 2024
b906bfd
handling flash-attn
Jan 30, 2024
6a6ebd3
no cuda import
BlackSamorez Jan 30, 2024
3937640
numba kernel working
BlackSamorez Jan 30, 2024
c643fec
black isort
Jan 30, 2024
d67d119
newer matmul benchmark
BlackSamorez Jan 31, 2024
c31d532
Merge branch 'transformers' of github.com:Vahe1994/AQLM into transfor…
BlackSamorez Jan 31, 2024
2d0cae8
fixed transposes
Jan 31, 2024
3deeab2
updated benchmarks
BlackSamorez Jan 31, 2024
aca05dd
removed extra benchmarks
Feb 5, 2024
cfa5e4a
less diff
Feb 5, 2024
9498bf3
benchmarks
Feb 5, 2024
7c6d234
Merge branch 'transformers' of github.com:Vahe1994/AQLM into transfor…
Feb 5, 2024
426a7b6
numba parallel and style
Feb 6, 2024
78cc9a8
cuda moved
Feb 6, 2024
1278164
moved cuda kernel
Feb 6, 2024
2dbd188
moved numba kernel
Feb 6, 2024
935347e
removed unnecessary functions
Feb 6, 2024
7b8faf8
dev7
Feb 6, 2024
33b0464
updated manifest
Feb 6, 2024
ead1c00
dev9
Feb 6, 2024
88d9a93
Update transformers/llama/modeling_llama_aqlm.py
BlackSamorez Feb 6, 2024
28d70f8
Update benchmark/generate_benchmark.py
BlackSamorez Feb 6, 2024
b31a3fc
Update benchmark/generate_benchmark.py
BlackSamorez Feb 6, 2024
d9f6b25
Update inference_lib/setup.cfg
BlackSamorez Feb 6, 2024
503ff40
correct authors
Feb 6, 2024
26ff8b0
cpp 1x16
Feb 6, 2024
09a7810
2x8 matmat cpp
Feb 6, 2024
c434d42
dev10
Feb 6, 2024
788c289
colab example
Feb 6, 2024
9fdf0a6
black
Feb 6, 2024
f2ef38b
colab example notebook
Feb 6, 2024
7342655
dev11 fix from Elias
Feb 6, 2024
989d5d8
dev12 __CUDA_ARCH__
Feb 6, 2024
5d4f4f3
much stuff
Feb 7, 2024
2a32c0a
readme, demo, req
Feb 7, 2024
f019b4e
more readme
Feb 7, 2024
d90c43b
dtype asserts
Feb 7, 2024
e06a789
black
Feb 7, 2024
098363a
installation
Feb 7, 2024
d7b6dfa
1.0.0
Feb 7, 2024
4bd67b9
1.0.0 for colab
Feb 7, 2024
d44c29d
deleted output
Feb 7, 2024
79706d0
mistral and mixtral
Feb 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
packing and unpacking integers
  • Loading branch information
Andrei Panferov committed Jan 28, 2024
commit 10c0a7ee415ecb7b4e66d83030b98be3f571c0dd
10 changes: 5 additions & 5 deletions src/aq.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from tqdm.auto import trange

from src.kmeans import find_nearest_cluster, fit_faiss_kmeans, fit_kmeans, fit_kmeans_1d
from src.utils import ellipsis, get_int_dtype, maybe_script
from src.utils import ellipsis, get_int_dtype, maybe_script, pack_int_data, unpack_int_data


class QuantizedLinear(nn.Module):
Expand Down Expand Up @@ -127,8 +127,8 @@ def initialize(
self.scales.data = scales
else:
scales_clusters, scales_indices, _ = fit_kmeans_1d(scales.flatten(1, -1), k=2**self.scale_nbits)
self.scales_clusters = nn.Parameter(scales_clusters, requires_grad=True)
self.scales_indices = nn.Parameter(scales_indices, requires_grad=False)
self.scales_clusters.data = scales_clusters
self.scales_indices = pack_int_data(scales_indices, self.scale_nbits)

weight_for_init = (weight_groupwise / scales).swapaxes(1, 2).reshape_as(reference_weight)
del weight_groupwise
Expand All @@ -141,7 +141,7 @@ def initialize(
codebook_size=self.codebook_size,
**init_kwargs,
)
self.codes.data = codes
self.codes.data = pack_int_data(codes, self.nbits_per_codebook)
self.codebooks.data = codebooks
if self.bias is not None and bias is not None:
self.bias.data = bias
Expand Down Expand Up @@ -210,7 +210,7 @@ def reconstruct_weight(self, selection: Union[slice, ellipsis, torch.Tensor] = .

"""
weight = _dequantize_weight(
self.codes[selection].to(torch.int64) % (2**self.nbits_per_codebook),
unpack_int_data(self.codes[selection], self.nbits_per_codebook),
self.get_codebooks(),
self.get_scales()[selection],
)
Expand Down