Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid dtoh copy for dequantization of f16/f32 #2424

Closed

Conversation

EricLBuehler
Copy link
Member

Currently, we execute a dtoh copy when dequantizing f16/f32 on CUDA when this is not necessary. We can just add a simple cast kernel to ensure that we keep the data on the device.

This PR only updates CUDA dequantization, Metal dequantization executes on the CPU. Generall, dequantization performance shouldn't matter too much, as we should avoid it in a hot loop.

@EricLBuehler
Copy link
Member Author

Closing to avoid excessive stagnant PRs.

@EricLBuehler EricLBuehler deleted the fast_dequant_f32_f16 branch September 25, 2024 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant