Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize quantization process with QTensor::quantize_onto #2408

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

EricLBuehler
Copy link
Member

Motivation:

The current QTensor::quantize quantizes the src tensor onto the same device as src. This behavior is OK for most use cases, but there is a specific condition where this is problematic: anytime you are not quantizing a tensor on the CPU. This is the case because we only support quantization on the CPU.

To implement quantization on non-CPU device, we do the following:

  • Trigger a synchronizing dtoh copy here (same for Metal):

https://github.com/huggingface/candle/blob/main/candle-core/src/quantized/cuda.rs#L436-L441

  • Quantize on the CPU

  • Trigger a synchronizing htod copy here (same for Metal):

https://github.com/huggingface/candle/blob/main/candle-core/src/quantized/cuda.rs#L447

Because of the 2 copies and the fact that we are synchronizing the CUDA device (I'm not sure about the semantics for Metal, but we are certainly copying the data), this hurts performance!

The solution is a simple modification and introduction of a new API. This new API will take a CPU tensor, quantize it on the CPU, and then perform one htod synchronizing copy. This halves the data transfer/synchronizations which take place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant