Description
The problem I encountered
After deploying Qwen2-VL-7B-Instruct-GPTQ-Int4 using VLLM, continuous requests from clients will cause CPU memory to continue to rise. Is it because some memory has not been reclaimed?
My specific usage scenario is:
I have two GPUs. When I use the ray framework for distributed deployment, as the number of VL models processed increases, my CPU memory becomes larger, leading to actor crashes in ray.
I have tested the native loading method of Qwen2-VL-7B-Instruct-GPTQ-Int4 and it does not cause CPU memory overflow. Once the VLLM framework is used for loading, there will be continuous CPU overflow
[Special note]: When you test, be sure to change the image each time, so that you can clearly see the CPU memory overflow. If only the same image is used, it will only leak once, causing the memory overflow to appear inconspicuous.
My code and environment
Here is my code
def getMessage(pic_file):
messages = [{'role': 'system', 'content': 'You are a very useful assistant, please strictly follow the requirements to complete the task!'}, {'role': 'user', 'content': [{'type': 'image_url', 'image_url': pic_file, 'min_pixels': 50176, 'max_pixels': 1411200}, {'type': 'text', 'text': 'Don't worry about the prompt words here, they are just examples'}]}]
return messages
def vllm_extract_text(result_list,model_path,temperature,top_p,max_token,min_pixels,max_pixels):
os.environ["CUDA_VISIBLE_DEVICES"] ="0"
model_path = "/mnt/data/programdata/vl_model/Qwen2-VL-7B-Instruct-GPTQ-Int4"
llm = LLM(model=model_path, limit_mm_per_prompt={"image": 5, "video": 0})
sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_token, stop_token_ids=[])
processor = AutoProcessor.from_pretrained(model_path, min_pixels=min_pixels, max_pixels=max_pixels)
#Ignore result_list, they are the return data of MongoDB
for doc in result_list:
messages = getMessage(doc['pic'])
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info(messages)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
llm_inputs = {
"prompt": text,
"multi_modal_data": mm_data,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params, use_tqdm=False)
for output in outputs:
generated_text = output.outputs[0].text
del llm_inputs,outputs
This is vllm version information
Name: vllm
Version: 0.7.2