Question about sampler. It takes too much time

I noticed that, the sampler stage uses lots of repeated cuda kernels. Seems you do sampling in a for loop, launch each kernel for a sequence? Why is this?
BTW, do you compare the performance with FasterTransformer? I didn't see about this.
Thank you! 

<img width="525" alt="image" src="https://github.com/vllm-project/vllm/assets/26128514/4853124e-a738-46d2-b7bb-9e6a82e04ca2">

below is my code:
```
path = '/data/llm/hf-llama-7b/'
llm = LLM(model=path)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
sampling_params.max_tokens = 1
cnt = 1
start = time.time()
for i in range(cnt):
    with nvtx.annotate("generate", color="red"):
        outputs = llm.generate(prompt_token_ids = input_ids, sampling_params = sampling_params)
end = time.time()
prefill_ticks = (end - start) / cnt
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Question about sampler. It takes too much time #249

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Question about sampler. It takes too much time #249

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions