Optimize quantization process with QTensor::quantize_onto #2408

EricLBuehler · 2024-08-10T00:08:58Z

Motivation:

The current QTensor::quantize quantizes the src tensor onto the same device as src. This behavior is OK for most use cases, but there is a specific condition where this is problematic: anytime you are not quantizing a tensor on the CPU. This is the case because we only support quantization on the CPU.

To implement quantization on non-CPU device, we do the following:

Trigger a synchronizing dtoh copy here (same for Metal):

https://github.com/huggingface/candle/blob/main/candle-core/src/quantized/cuda.rs#L436-L441

Quantize on the CPU
Trigger a synchronizing htod copy here (same for Metal):

https://github.com/huggingface/candle/blob/main/candle-core/src/quantized/cuda.rs#L447

Because of the 2 copies and the fact that we are synchronizing the CUDA device (I'm not sure about the semantics for Metal, but we are certainly copying the data), this hurts performance!

The solution is a simple modification and introduction of a new API. This new API will take a CPU tensor, quantize it on the CPU, and then perform one htod synchronizing copy. This halves the data transfer/synchronizations which take place.

Add quantize_onto

ab3c58e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize quantization process with QTensor::quantize_onto #2408

Optimize quantization process with QTensor::quantize_onto #2408

EricLBuehler commented Aug 10, 2024

Optimize quantization process with QTensor::quantize_onto #2408

Are you sure you want to change the base?

Optimize quantization process with QTensor::quantize_onto #2408

Conversation

EricLBuehler commented Aug 10, 2024