Avoid dtoh copy for dequantization of f16/f32 #2424

EricLBuehler · 2024-08-17T15:19:10Z

Currently, we execute a dtoh copy when dequantizing f16/f32 on CUDA when this is not necessary. We can just add a simple cast kernel to ensure that we keep the data on the device.

This PR only updates CUDA dequantization, Metal dequantization executes on the CPU. Generall, dequantization performance shouldn't matter too much, as we should avoid it in a hot loop.

EricLBuehler · 2024-09-25T01:45:11Z

Closing to avoid excessive stagnant PRs.

Fast dequant using kernel for f32 and f16

3ef9510

This was referenced Aug 17, 2024

Question: How to use quantized tensors? #1006

Closed

Add GGUF BF16 dtype support #2387

Open

EricLBuehler closed this Sep 25, 2024

EricLBuehler deleted the fast_dequant_f32_f16 branch September 25, 2024 01:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid dtoh copy for dequantization of f16/f32 #2424

Avoid dtoh copy for dequantization of f16/f32 #2424

EricLBuehler commented Aug 17, 2024

EricLBuehler commented Sep 25, 2024

Avoid dtoh copy for dequantization of f16/f32 #2424

Avoid dtoh copy for dequantization of f16/f32 #2424

Conversation

EricLBuehler commented Aug 17, 2024

EricLBuehler commented Sep 25, 2024