Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Initial Support for AWS Inferentia #1866

Closed
5 tasks done
liangfu opened this issue Nov 30, 2023 · 9 comments
Closed
5 tasks done

[RFC] Initial Support for AWS Inferentia #1866

liangfu opened this issue Nov 30, 2023 · 9 comments

Comments

@liangfu
Copy link
Contributor

liangfu commented Nov 30, 2023

Proposal

We propose to integrate transformers-neuronx to be the execution engine in vLLM for supporting LLM inference on Inferentia. This would require changes on both transformers-neuronx and vLLM.

Changes to transformers-neuronx

  1. Support batch size 1 prompt encoding, while share same cache space with max batch size decoding.
  2. Support batch-dependent KV cache update. Each sequence will have a specified position_id to update cache.
  3. Support virtual dynamic batching. This would enable multi-batch prompt encoding virtually agnostic to vLLM.

Changes to vLLM

Implementation Details

Model-specific (e.g. llama specific code for neuron) forward function

def forward(
    self,
    input_ids: torch.Tensor,
    positions: torch.Tensor,
    kv_caches: List[KVCache],
    input_metadata: InputMetadata,
    cache_events: Optional[List[torch.cuda.Event]],
) -> SamplerOutput:
    batch_size, n_active_tokens = input_ids.shape

    with torch.inference_mode():
        seq_ids = []
        block_size = self.model.context_buckets[-1]
        if input_metadata.num_generation_tokens == 0:
            num_prompts = input_metadata.num_prompts
            seq_ids = torch.zeros(num_prompts, 1, dtype=torch.int64, device='cpu')
            anchor = 0
            for prompt_id in range(num_prompts):
                seq_ids[prompt_id] = input_metadata.slot_mapping[anchor] // block_size
                anchor += input_metadata.prompt_lens[prompt_id]
        else:
            seq_ids = input_metadata.block_tables

        logits = self.model(input_ids, cache_ids=positions, start_ids=seq_ids)
        next_tokens = self.sampler(logits, input_metadata)

    return next_tokens

Model compilation

def load_weights(self,
                    model_name_or_path: str,
                    cache_dir: Optional[str] = None,
                    load_format: str = "auto",
                    revision: Optional[str] = None,
                    **kwargs):
    from transformers_neuronx.llama.model import LlamaForSampling

    if not os.path.exists(f"{model_name_or_path}-split"):
        from transformers.models.llama import LlamaForCausalLM
        from transformers_neuronx.module import save_pretrained_split

        hf_model = LlamaForCausalLM.from_pretrained(model_name_or_path, low_cpu_mem_usage=True)
        save_pretrained_split(hf_model, f"{model_name_or_path}-split")

    self.model = LlamaForSampling.from_pretrained(f"{model_name_or_path}-split", **kwargs)
    self.model.to_neuron()

Model-agnostic (e.g. generic model loader)

# Load the weights from the cached or downloaded files.
from transformers_neuronx.config import NeuronConfig, ContinuousBatchingConfig

continuous_batching_config = ContinuousBatchingConfig(batch_size_for_shared_caches=scheduler_config.max_num_seqs)
neuron_config = NeuronConfig(continuous_batching=continuous_batching_config)
model.load_weights(model_config.model, model_config.download_dir,
                    model_config.load_format, model_config.revision,
                    tp_degree=parallel_config.tp_degree,
                    amp='f32', neuron_config=neuron_config,
                    context_length_estimate=[scheduler_config.max_model_len],
                    n_positions=[scheduler_config.max_model_len],
                    batch_size=scheduler_config.max_num_seqs)

Related Resources

Stable release versions of transformers-neuronx packages can be found from https://pip.repos.neuron.amazonaws.com/transformers-neuronx/ . We can install transformers-neuronx pacakge with

pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com
@anyscalesam
Copy link

Any targeted ETA for implementation @liangfu ?

@liangfu
Copy link
Contributor Author

liangfu commented Jan 4, 2024

The implement of the initial support (Ocra-Max style w/o paged attention) is actually done in local branch (https://github.com/liangfu/vllm/tree/neuron), we just need time to upstream. For paged attention, my rough estimate is around late March.

@miladm
Copy link

miladm commented Jan 8, 2024

Thanks @liangfu for this update!
Does this implementation work on neuron backend or any torch_xla backend? If only neuron, how much work would it be to extend its support to other backends (e.g. GPU, TPU)?

Would love to discuss some of the code changes you've made. Do you plan to submit a PR of your branch?

cc @allenwang28 @JackCaoG @shauheen

@liangfu
Copy link
Contributor Author

liangfu commented Jan 8, 2024

Thanks @miladm for bringing torch-xla into discussion.

Does this implementation work on neuron backend or any torch_xla backend?

I was trying to reduce torch-xla code, since tracing two graphs for prefill and decode could be a headache. The HLO graph for prefill and decode are different in so many aspects, to my observation. Also, the tracing from torch operator to HLO can be sometimes tricky; sometimes, we wanted an exact HLO subgraph, but it's hard to build via tracing. To achieve high-performance, we end up getting manually designed HLO subgraphs.

I would like to hear what your idea is. cc @miladm

If only neuron, how much work would it be to extend its support to other backends (e.g. GPU, TPU)?

GPU backend is already here in vLLM, and it is considered to be high-performance, right?

I guess there is no backend that is exclusively for neuron. Even with transformers-neuronx, what we built are two separate graphs (prefill and decode) implemented with HLO operators. The HLO operators should be able to run on any other backend other than neuron (e.g. TPU).

The tricky part is that it's hard to achieve high-performance with an easy to program interface that builds exact HLO graph we want for high performance. Therefore, i'm open to discuss and find out what could be a better option for integrating with vLLM.

Do you plan to submit a PR of your branch?

The initial PR that makes lazy import of CUDA function calls: #2065

We are planning to move on from this, and upstream the integration with transformers-neuronx.

@liangfu
Copy link
Contributor Author

liangfu commented Mar 2, 2024

The initial support has been merged. Closing this issue.

@liangfu liangfu closed this as completed Mar 2, 2024
@a-creation
Copy link

Thanks for this PR! One quick question: it seems initial support only allows for a max input sequence length of 128 tokens because it has to match block size - is my understanding correct?

@liangfu
Copy link
Contributor Author

liangfu commented Mar 11, 2024

it seems initial support only allows for a max input sequence length of 128 tokens because it has to match block size - is my understanding correct?

The plan is to introduce paged-attention, so that we don't necessarily limit the block size to sequence length. For now, we need to extend the list of choices for the block sizes.

@MojHnd
Copy link

MojHnd commented Mar 14, 2024

@liangfu @WoosukKwon
Thank you. Is there any ETA for supporting Paged Attention in Neuron SDK?

@MojHnd
Copy link

MojHnd commented Mar 30, 2024

@liangfu @WoosukKwon
Thank you for your invaluable effort. Is there any ETA for supporting Paged Attention in Neuron SDK?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants
@miladm @liangfu @MojHnd @WoosukKwon @a-creation @anyscalesam and others