Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure to launch codegeex4-all-9b Using vllm #11910

Open
YongZhuIntel opened this issue Aug 23, 2024 · 9 comments
Open

failure to launch codegeex4-all-9b Using vllm #11910

YongZhuIntel opened this issue Aug 23, 2024 · 9 comments

Comments

@YongZhuIntel
Copy link

We are trying to launch codegeex4-all-9b Using vllm following the CodeGeeX4 github:
https://github.com/THUDM/CodeGeeX4?tab=readme-ov-file#vllm

The scripts are as following:
codegeex_offline_example.py:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# CodeGeeX4-ALL-9B
# max_model_len, tp_size = 1048576, 4
# If OOM,please reduce max_model_len,or increase tp_size
max_model_len, tp_size = 2048, 4
model_name = "/llm/models/codegeex4-all-9b"
prompt = [{"role": "user", "content": "Hello"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # If OOM,try using follong parameters
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

codegeex_offline_example.sh

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export TORCH_LLM_ALLREDUCE=0
export CCL_DG2_ALLREDUCE=1
# Tensor parallel related arguments:
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
python codegeex_offline_example.py

when running codegeex_offline_example.sh on docker we got the an error:

  File "/llm/vllm/vllm/model_executor/layers/attention/backends/torch_sdpa.py", line 112, in for
ward
    output = PagedAttentionImpl.forward_decode(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llm/vllm/vllm/model_executor/layers/attention/ops/paged_attn.py", line 66, in forward_d
ecode
    ops.paged_attention_v1(
RuntimeError: "paged_attention_xpu_v1_impl" not implemented for 'BFloat16'

error log:
codegeex_offline_example_error.log

@gc-fu
Copy link
Contributor

gc-fu commented Aug 23, 2024

Try adding torch_dtype="float16".

For instance:

llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    torch_dtype="float16", # adding this
    # If OOM,try using follong parameters
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)

@YongZhuIntel
Copy link
Author

Unable to recognize torch_dtype

Traceback (most recent call last):
  File "/llm/zhuyong/vllm/codegeex_offline_example.py", line 13, in <module>
    llm = LLM(
          ^^^^
  File "/llm/vllm/vllm/entrypoints/llm.py", line 91, in __init__
    engine_args = EngineArgs(
                  ^^^^^^^^^^^
TypeError: EngineArgs.__init__() got an unexpected keyword argument 'torch_dtype'

@gc-fu
Copy link
Contributor

gc-fu commented Aug 23, 2024

Sry, it is dtype="float16"

@Uxito-Ada
Copy link
Contributor

Hi @YongZhuIntel ,

I successfully run codegex4-all-9b with vllm on a single card or two cards of A770. It is noted that for a single card, max-model-len should be decreased to no more than 6048, which is the size of kv cache store.

@YongZhuIntel
Copy link
Author

@Uxito-Ada I run codegex4-all-9b with vllm on a single card for int4 format

model="/llm/models/codegeex4-all-9b"
served_model_name="codegeex4-all-9b"
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export TORCH_LLM_ALLREDUCE=0
export CCL_DG2_ALLREDUCE=1
# Tensor parallel related arguments:
export CCL_WORKER_COUNT=1
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
source /opt/intel/oneapi/setvars.sh --force
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 2048 \
--max-num-batched-tokens 4000 \
--tensor-parallel-size 1

but got an OOM error when run "python vllm_online_benchmark.py codegeex4-all-9b 2"

    |   File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 769, in f
orward
    |     result = xe_linear.forward_new(x_2d, self.weight.data,
    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | RuntimeError: Allocation is out of device memory on current platform.

@Uxito-Ada
Copy link
Contributor

Hi @YongZhuIntel ,

With the script your provide, I can successfully start vllm server and then execute the inference request in vLLM-Serving's README.

What version of ipex-llm are used in your environment? And please also provide codegeex_offline_example.py content, as request workloads also influence memory footprint.

@YongZhuIntel
Copy link
Author

@Uxito-Ada I run vllm on the docker image: intelanalytics/ipex-llm-serving-vllm-xpu-experiment:latest

The vllm_online_benchmark.py:
vllm_online_benchmark.py.txt

@YongZhuIntel
Copy link
Author

INFO 08-27 09:33:39 gpu_executor.py:100] # GPU blocks: 12587, # CPU blocks: 6553
Error log:
start_codegeex4-all-9b_serving_1card_int4_err.log

@Uxito-Ada
Copy link
Contributor

Hi @YongZhuIntel ,

GPU memory consumption can be decreased by tuning server parameters, e.g. after lowing gpu-memory-utilization from 0.95 to 0.8~0.9, I can successfully execute workloads in vllm_online_benchmark.py with max_seq=2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants