-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Running FP8 quantized model fails on NVIDIA L4 (repack_fp8_for_marlin) #2388
Comments
While I did not find the issue already being reported here, I did find a similar issue being reported with vLLM. They even just fixed it as well, so maybe their changes could be ported over to TGI? |
Side note: Why is TGI even falling back on using Marlin kernels? As far as I know NVIDIA L4 is using the Ada Lovelace Architecture with CUDA compute capability 8.9, which should have hardware support for FP: NVIDIA CUDA Docs. Am I missing something? Having a quick look through the code, I found the following PR #2277 which was part of the latest release, basically "blocking" TGI from utilizing the native FP8 support by forcing the Marlin kernels for CC 8.9. I did not find any issues or further explaination pertaining to these changes. Maybe @OlivierDehaene could shed some light on the reasoning behind this change? |
We switched to |
If you want to use llama 3.1 8b instruct in fp8, you can use the original repo: docker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.3.0 --model-id meta-llama/Meta-Llama-3.1-8B-Instruct --quantize fp8 |
System Info
Information
Tasks
Reproduction
To reproduce please run the following shell script:
The follow exception appears during startup:
Expected behavior
I expect no
RuntimeError
during shard initialization. TGI should start up and server the model without problems.The text was updated successfully, but these errors were encountered: