You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A quantizer cache similar to what is described in the [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache paper](https://huggingface.co/papers/2402.02750).
1023
-
It allows the model to generate longer sequence length without allocating too much memory for Key and Value cache by applying quantization.
1024
-
1025
-
The cache has two types of storage, one for original precision and one for the quantized cache. A `residual length` is set as a maximum capacity for the
1026
-
original precision cache. When the length goes beyond maximum capacity, the original precision cache is discarded and moved into the quantized cache. The
1027
-
quantization is done per-channel with a set `q_group_size` for both Keys and Values, in contrast to what was described in the paper.
1028
-
1029
-
It stores Keys and Values a list of quantized tensors (tuples in case we need to store metadata), one for each layer. Additionally, it stores the Key and
1030
-
Value in original precision states as a list of tensors, one for each layer. The size of each tensor
1031
-
is `[batch_size, num_heads, seq_len - residual_length, head_dim]`
1032
-
1033
-
Uses `quanto` as a backend to perform quantization. Current implementation supports `int2` and `int4` dtypes only.
1034
-
1035
-
Parameters:
1036
-
cache_config (`QuantizedCacheConfig`):
1037
-
A configuration containing all the arguments to be used by the quantizer, including axis, qtype and group size.
1038
-
1039
-
Example:
1040
-
1041
-
```python
1042
-
>>> # Run pip install quanto first if you don't have it yet
1043
-
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, QuantoQuantizedCache, QuantizedCacheConfig
1044
-
1045
-
>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
A quantizer cache similar to what is described in the [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache paper](https://arxiv.org/abs/2402.02750).
1067
-
It allows the model to generate longer sequence length without allocating too much memory for Key and Value cache by applying quantization.
1068
-
1069
-
The cache has two types of storage, one for original precision and one for the quantized cache. A `residual length` is set as a maximum capacity for the
1070
-
original precision cache. When the length goes beyond maximum capacity, the original precision cache is discarded and moved into the quantized cache. The
1071
-
quantization is done per-channel with a set `q_group_size` for both Keys and Values, in contrast to what was described in the paper.
1072
-
1073
-
It stores Keys and Values a list of quantized tensors (tuples in case we need to store metadata), one for each layer. Additionally, it stores the Key and
1074
-
Value in original precision states as a list of tensors, one for each layer. The size of each tensor
1075
-
is `[batch_size, num_heads, seq_len - residual_length, head_dim]`
1076
-
1077
-
Uses `HQQ` as a backend to perform quantization. Current implementation supports `int2`, `int4`, `int8` dtypes.
1078
-
1079
-
Parameters:
1080
-
cache_config (`QuantizedCacheConfig`):
1081
-
A configuration containing all the arguments to be used by the quantizer, including axis, qtype and group size.
1082
-
1083
-
Example:
1084
-
1085
-
```python
1086
-
>>> # Run pip install hqq first if you don't have it yet
1087
-
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, HQQQuantizedCache, QuantizedCacheConfig
1088
-
1089
-
>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
A quantizer cache similar to what is described in the [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache paper](https://huggingface.co/papers/2402.02750).
2249
+
It allows the model to generate longer sequence length without allocating too much memory for Key and Value cache by applying quantization.
2250
+
2251
+
The cache has two types of storage, one for original precision and one for the quantized cache. A `residual length` is set as a maximum capacity for the
2252
+
original precision cache. When the length goes beyond maximum capacity, the original precision cache is discarded and moved into the quantized cache. The
2253
+
quantization is done per-channel with a set `q_group_size` for both Keys and Values, in contrast to what was described in the paper.
2254
+
2255
+
It stores Keys and Values a list of quantized tensors (tuples in case we need to store metadata), one for each layer. Additionally, it stores the Key and
2256
+
Value in original precision states as a list of tensors, one for each layer. The size of each tensor
2257
+
is `[batch_size, num_heads, seq_len - residual_length, head_dim]`
2258
+
2259
+
Uses `quanto` as a backend to perform quantization. Current implementation supports `int2` and `int4` dtypes only.
2260
+
2261
+
Parameters:
2262
+
cache_config (`QuantizedCacheConfig`):
2263
+
A configuration containing all the arguments to be used by the quantizer, including axis, qtype and group size.
2264
+
2265
+
Example:
2266
+
2267
+
```python
2268
+
>>> # Run pip install quanto first if you don't have it yet
2269
+
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, QuantoQuantizedCache, QuantizedCacheConfig
2270
+
2271
+
>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
A quantizer cache similar to what is described in the [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache paper](https://arxiv.org/abs/2402.02750).
2293
+
It allows the model to generate longer sequence length without allocating too much memory for Key and Value cache by applying quantization.
2294
+
2295
+
The cache has two types of storage, one for original precision and one for the quantized cache. A `residual length` is set as a maximum capacity for the
2296
+
original precision cache. When the length goes beyond maximum capacity, the original precision cache is discarded and moved into the quantized cache. The
2297
+
quantization is done per-channel with a set `q_group_size` for both Keys and Values, in contrast to what was described in the paper.
2298
+
2299
+
It stores Keys and Values a list of quantized tensors (tuples in case we need to store metadata), one for each layer. Additionally, it stores the Key and
2300
+
Value in original precision states as a list of tensors, one for each layer. The size of each tensor
2301
+
is `[batch_size, num_heads, seq_len - residual_length, head_dim]`
2302
+
2303
+
Uses `HQQ` as a backend to perform quantization. Current implementation supports `int2`, `int4`, `int8` dtypes.
2304
+
2305
+
Parameters:
2306
+
cache_config (`QuantizedCacheConfig`):
2307
+
A configuration containing all the arguments to be used by the quantizer, including axis, qtype and group size.
2308
+
2309
+
Example:
2310
+
2311
+
```python
2312
+
>>> # Run pip install hqq first if you don't have it yet
2313
+
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, HQQQuantizedCache, QuantizedCacheConfig
2314
+
2315
+
>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
0 commit comments