-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enable pytorch xpu support for non-attention models #2561
Conversation
XPU backend is available natively (without IPEX) in pytorch starting from pytorch 2.4. This commit extends TGI to cover the case when user has XPU support thru pytorch 2.4, but does not have IPEX installed. Models which don't require attention can work. For attention required models more work is needed to provide attention implementation. Tested with the following models: * teknium/OpenHermes-2.5-Mistral-7B * bigscience/bloom-560m * google/gemma-7b * google/flan-t5-xxl Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
please hold the PR,the pytorch in the Dockerfile_intel for xpu is 2.3. see https://github.com/huggingface/text-generation-inference/blob/main/Dockerfile_intel#L94 |
and what do you mean for non-attention models. these models all have attention per my understanding, we now use pageattention of ipex to support it. |
In the dockerfile yes. It's possible however to build TGI from sources against different version of pytorch. I have pytorch build from main which is current 2.6 candidate. |
These models have fallback mechanism which is being triggered if there is no attention available. For example, for gemma it's defined like in below code snippet. Fallback is no line 828. Since IPEX container provides attention, you probably did not notice it. As you can see in this PR, text-generation-inference/server/text_generation_server/models/__init__.py Lines 812 to 830 in 7efcb5e
|
To be more precise, it's being triggered if one of the attention enabled models failed to load (which potentially might not be the same as there is no attention available). Logic is defined here: text-generation-inference/server/text_generation_server/models/__init__.py Lines 138 to 145 in 7efcb5e
|
Have you compared the fallback perf with the current one? Per my understanding, fallback path has a a perf limitation, also TensorParallel /GPTQ/AWQ is not support. so usually if I find a model need to fallback to transformers. I will implement the customer model in TGI. |
also we use Dockerfile_intel to generate the text-generation-inference docker image. and user just directly download it from docker hub. see https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference. That's why I suggest you hold the change. |
Following this thread, my understanding is that this PR isn't ready yet, correct ? |
My opinion as the author of the patch is that it's ready. It enables TGI to work with upstream pytorch with xpu support for those models which are currently possible. These are those which don't require custom kernels. On top of this PR further work need to happen to propose implementation of the missing kernels. Few ways would be possible here:
I wonder what's TGI maintainers opinion on this topic. In particular, are there plans to rewrite attention kernels residing in TGI with triton? maybe such step is planned or is being doing by anyone at the moment? I also think that PR improves handling of xpu path regardless of whether IPEX or upstream pytorch XPU is used - some of the conditions which were corrected are now more logical. If someone will want to use TGA against upstream pytorch xpu, he will need to build it from sources. I did not provide dockerfile for this case. Existing IPEX path continues to work. Apparently, there is other opinion expressed above that IPEX to be currently used. This however tights TGI on xpu to the older pytorch version and restricts try outs of TGI against upstream pytorch xpu for those who are willing to do so. I consider proposed PR to be a step in the right direction in any case because ultimate logic should be to enable all possible features with the base stuff (pytorch), then enhance with additional features (custom kernels and additional 3rd party libraries). Current path with IPEX makes this in the upside down way due to the initial nature of IPEX as a plugin for pytorch. Things are changing however with xpu being available right out of the box in pytorch and plugin aspect of IPEX going away. This PR is a step towards this change. |
Triton has serious drawbacks as a technology for production because it's a JIT environment meaning there's unbound compilation time during a runtime initial phase (which could be long or even crash very late because of all the compilations). As much as I respect the work being done over there, it's really hard to use as long as it's not AOT compilation. If this JIT can be done entirely during the warmup then it's better already (still not great because of the super slow startup times but at least you're not having atrocious runtimes initially). So far we've seen that custom made cuda kernels work much better in practice than triton made kernels. We've also tiptoed with |
Thank you for sharing this @Narsil, very interesting. |
XPU backend is available natively (without IPEX) in pytorch starting from pytorch 2.4. This commit extends TGI to cover the case when user has XPU support thru pytorch 2.4, but does not have IPEX installed. Models which don't require attention can work. For attention required models more work is needed to provide attention implementation.
Tested with the following models:
CC: @Narsil