Closed
Description
I noticed that, the sampler stage uses lots of repeated cuda kernels. Seems you do sampling in a for loop, launch each kernel for a sequence? Why is this?
BTW, do you compare the performance with FasterTransformer? I didn't see about this.
Thank you!

below is my code:
path = '/data/llm/hf-llama-7b/'
llm = LLM(model=path)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
sampling_params.max_tokens = 1
cnt = 1
start = time.time()
for i in range(cnt):
with nvtx.annotate("generate", color="red"):
outputs = llm.generate(prompt_token_ids = input_ids, sampling_params = sampling_params)
end = time.time()
prefill_ticks = (end - start) / cnt
Metadata
Metadata
Assignees
Labels
No labels