Skip to content

Commit 17a42e5

Browse files
authored
Add BF16 to GGUF (lllyasviel#2877)
1 parent 0ced1d0 commit 17a42e5

File tree

2 files changed

+4
-0
lines changed

2 files changed

+4
-0
lines changed

backend/operations_gguf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
gguf.GGMLQuantizationType.Q5_K: gguf.Q5_K,
1414
gguf.GGMLQuantizationType.Q6_K: gguf.Q6_K,
1515
gguf.GGMLQuantizationType.Q8_0: gguf.Q8_0,
16+
gguf.GGMLQuantizationType.BF16: gguf.BF16,
1617
}
1718

1819

packages_3rdparty/gguf/quants.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,9 @@ def quantize_blocks(cls, blocks: np.ndarray) -> np.ndarray:
268268
def dequantize_blocks(cls, blocks: np.ndarray) -> np.ndarray:
269269
return (blocks.view(np.int16).astype(np.int32) << 16).view(np.float32)
270270

271+
@classmethod
272+
def dequantize_blocks_pytorch(cls, blocks, block_size, type_size, parameter) -> torch.Tensor:
273+
return (blocks.view(torch.int16).to(torch.int32) << 16).view(torch.float32)
271274

272275
class Q4_0(__Quant, qtype=GGMLQuantizationType.Q4_0):
273276
@classmethod

0 commit comments

Comments
 (0)