-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Closed
Description
Machine Info
- Ubuntu
- cuda driver 12.1
- T4 GPU x4
Reproduce Step
Only change llm line in examples/offline_inference_with_prefix.py
with
llm = LLM(model="baichuan-inc/Baichuan2-13B-Chat", tensor_parallel_size=4, enforce_eager=True, dtype="half", trust_remote_code=True)
Error message:
triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 65538, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.�
When i use a common llama arch like TInyllama
model, it's ok. Might related Alibi
?
@DouHappy @caoshiyi @zhuohan123 Please take a look!
Metadata
Metadata
Assignees
Labels
No labels