prefix caching error with baichuan model

## Machine Info
- Ubuntu
- cuda driver 12.1
- T4 GPU x4

## Reproduce Step

Only change llm line in `examples/offline_inference_with_prefix.py` with
```
llm = LLM(model="baichuan-inc/Baichuan2-13B-Chat", tensor_parallel_size=4, enforce_eager=True, dtype="half", trust_remote_code=True)
``` 

## Error message:
```
triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 65538, Hardware limit: 65536. Reducing block sizes or `num_stages` may help. 
```

When i use a common llama arch like `TInyllama` model, it's ok. Might related `Alibi`? 

@DouHappy @caoshiyi @zhuohan123  Please take a look!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

prefix caching error with baichuan model #2513

Machine Info

Reproduce Step

Error message:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

prefix caching error with baichuan model #2513

Description

Machine Info

Reproduce Step

Error message:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions