-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Initial Support for AWS Inferentia #1866
Comments
Any targeted ETA for implementation @liangfu ? |
The implement of the initial support (Ocra-Max style w/o paged attention) is actually done in local branch (https://github.com/liangfu/vllm/tree/neuron), we just need time to upstream. For paged attention, my rough estimate is around late March. |
Thanks @liangfu for this update! Would love to discuss some of the code changes you've made. Do you plan to submit a PR of your branch? |
Thanks @miladm for bringing torch-xla into discussion.
I was trying to reduce torch-xla code, since tracing two graphs for prefill and decode could be a headache. The HLO graph for prefill and decode are different in so many aspects, to my observation. Also, the tracing from torch operator to HLO can be sometimes tricky; sometimes, we wanted an exact HLO subgraph, but it's hard to build via tracing. To achieve high-performance, we end up getting manually designed HLO subgraphs. I would like to hear what your idea is. cc @miladm
GPU backend is already here in vLLM, and it is considered to be high-performance, right? I guess there is no backend that is exclusively for neuron. Even with transformers-neuronx, what we built are two separate graphs (prefill and decode) implemented with HLO operators. The HLO operators should be able to run on any other backend other than neuron (e.g. TPU). The tricky part is that it's hard to achieve high-performance with an easy to program interface that builds exact HLO graph we want for high performance. Therefore, i'm open to discuss and find out what could be a better option for integrating with vLLM.
The initial PR that makes lazy import of CUDA function calls: #2065 We are planning to move on from this, and upstream the integration with transformers-neuronx. |
The initial support has been merged. Closing this issue. |
Thanks for this PR! One quick question: it seems initial support only allows for a max input sequence length of 128 tokens because it has to match block size - is my understanding correct? |
The plan is to introduce paged-attention, so that we don't necessarily limit the block size to sequence length. For now, we need to extend the list of choices for the block sizes. |
@liangfu @WoosukKwon |
@liangfu @WoosukKwon |
Proposal
We propose to integrate transformers-neuronx to be the execution engine in vLLM for supporting LLM inference on Inferentia. This would require changes on both transformers-neuronx and vLLM.
Changes to transformers-neuronx
Changes to vLLM
Implementation Details
Model-specific (e.g. llama specific code for neuron) forward function
Model compilation
Model-agnostic (e.g. generic model loader)
Related Resources
Stable release versions of transformers-neuronx packages can be found from https://pip.repos.neuron.amazonaws.com/transformers-neuronx/ . We can install transformers-neuronx pacakge with
The text was updated successfully, but these errors were encountered: