Closed
Description
Describe the bug
When running the latest version of vllm
(0.8.3), the backend hangs when evaluating with more than 1 GPU.
To Reproduce
First install vllm
:
pip install vllm==0.8.3
Then run:
lighteval vllm "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B,dtype=bfloat16,data_parallel_size=2,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}" "lighteval|$TASK|0|0" --use-chat-template --output-dir $OUTPUT_DIR
The logs will hang at the Ray placement group part:
[2025-04-13 19:07:39,715] [ INFO]: PyTorch version 2.6.0 available. (config.py:54)
INFO 04-13 19:07:48 [__init__.py:239] Automatically detected platform cuda.
[2025-04-13 19:07:51,066] [ INFO]: --- LOADING MODEL --- (pipeline.py:189)
[2025-04-13 19:07:51,529] [ INFO]: --- INIT SEEDS --- (pipeline.py:263)
[2025-04-13 19:07:51,529] [ INFO]: --- LOADING TASKS --- (pipeline.py:216)
[2025-04-13 19:07:51,529] [ INFO]: Found 1 custom tasks in /fsx/lewis/git/hf/lighteval/src/lighteval/tasks/extended/ifeval/main.py (registry.py:142)
[2025-04-13 19:07:51,529] [ INFO]: Found 6 custom tasks in /fsx/lewis/git/hf/lighteval/src/lighteval/tasks/extended/tiny_benchmarks/main.py (registry.py:142)
[2025-04-13 19:07:51,529] [ INFO]: Found 1 custom tasks in /fsx/lewis/git/hf/lighteval/src/lighteval/tasks/extended/mt_bench/main.py (registry.py:142)
[2025-04-13 19:07:51,529] [ INFO]: Found 4 custom tasks in /fsx/lewis/git/hf/lighteval/src/lighteval/tasks/extended/mix_eval/main.py (registry.py:142)
[2025-04-13 19:07:51,529] [ INFO]: Found 5 custom tasks in /fsx/lewis/git/hf/lighteval/src/lighteval/tasks/extended/olympiade_bench/main.py (registry.py:142)
[2025-04-13 19:07:51,530] [ INFO]: Found 1 custom tasks in /fsx/lewis/git/hf/lighteval/src/lighteval/tasks/extended/hle/main.py (registry.py:142)
[2025-04-13 19:07:51,530] [ INFO]: Found 21 custom tasks in /fsx/lewis/git/hf/lighteval/src/lighteval/tasks/extended/lcb/main.py (registry.py:142)
[2025-04-13 19:07:51,534] [ INFO]: HuggingFaceH4/aime_2024 default (lighteval_task.py:187)
[2025-04-13 19:07:51,534] [ WARNING]: Careful, the task lighteval|aime24 is using evaluation data to build the few shot examples. (lighteval_task.py:260)
[2025-04-13 19:07:52,888] [ INFO]: --- RUNNING MODEL --- (pipeline.py:468)
[2025-04-13 19:07:52,888] [ INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:472)
[2025-04-13 19:07:52,906] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:260)
Splits: 0%| | 0/2 [00:00<?, ?it/s][2025-04-13 19:07:52,915] [ WARNING]: context_size + max_new_tokens=33238 which is greater than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:274)
[2025-04-13 19:07:55,258] [ INFO]: Started a local Ray instance. (worker.py:1841)
(pid=1276189) INFO 04-13 19:08:03 [__init__.py:239] Automatically detected platform cuda.
(run_inference_one_model pid=1276191) INFO 04-13 19:08:16 [config.py:600] This model supports multiple tasks: {'embed', 'score', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
(pid=1276191) INFO 04-13 19:08:03 [__init__.py:239] Automatically detected platform cuda.
(run_inference_one_model pid=1276191) INFO 04-13 19:08:16 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=32768.
(run_inference_one_model pid=1276189) INFO 04-13 19:08:16 [config.py:600] This model supports multiple tasks: {'reward', 'generate', 'embed', 'classify', 'score'}. Defaulting to 'generate'.
(run_inference_one_model pid=1276191) INFO 04-13 19:08:18 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
(run_inference_one_model pid=1276191) INFO 04-13 19:08:18 [ray_utils.py:288] Ray is already initialized. Skipping Ray initialization.
(run_inference_one_model pid=1276191) INFO 04-13 19:08:18 [ray_utils.py:335] No current placement group found. Creating a new placement group.
Expected behavior
No hanging.
Version info
Running lighteval@cc95ff274718186f587500556d7001645a273ce8
. Additional info:
- `transformers` version: 4.51.2
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.11.11
- Huggingface_hub version: 0.30.2
- Safetensors version: 0.5.3
- Accelerate version: 1.6.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA H100 80GB HBM3