[CI Failure]: Samplers Test - samplers/test_beam_search.py::test_beam_search_passes_multimodal_data

### Name of failing test

`samplers/test_beam_search.py::test_beam_search_passes_multimodal_data[False-2-64-half]`

### Basic information

- [ ] Flaky test
- [x] Can reproduce locally
- [ ] Caused by external libraries (e.g. bug in `transformers`)

### 🧪 Describe the failing test

It seems the issue is because we are now passing empty lists to _flatten_embeddings

```
FAILED samplers/test_beam_search.py::test_beam_search_passes_multimodal_data[False-2-64-half] - RuntimeError: torch.cat(): expected a non-empty list of Tensors
```

Full output:
```
pytest -s -v "samplers/test_beam_search.py::test_beam_search_passes_multimodal_data[False-2-64-half]"
INFO 06-17 09:19:56 [__init__.py:244] Automatically detected platform cuda.
/home/mgoin/venvs/vllm/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
============================================================================================ test session starts =============================================================================================
platform linux -- Python 3.12.4, pytest-8.3.3, pluggy-1.5.0 -- /home/mgoin/venvs/vllm/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/home/mgoin/code/vllm/tests/.hypothesis/examples'))
rootdir: /home/mgoin/code/vllm
configfile: pyproject.toml
plugins: forked-1.6.0, subtests-0.14.1, asyncio-0.24.0, shard-0.1.2, buildkite-test-collector-0.1.9, timeout-2.3.1, schemathesis-3.39.15, anyio-4.6.2.post1, mock-3.14.0, hypothesis-6.131.0, rerunfailures-14.0
asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 1 item                                                                                                                                                                                             
Running 1 items in this shard: tests/samplers/test_beam_search.py::test_beam_search_passes_multimodal_data[False-2-64-half]

samplers/test_beam_search.py::test_beam_search_passes_multimodal_data[False-2-64-half] WARNING 06-17 09:19:58 [config.py:3273] Casting torch.bfloat16 to torch.float16.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.90it/s]
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
INFO 06-17 09:20:16 [config.py:831] This model supports multiple tasks: {'generate', 'score', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
WARNING 06-17 09:20:16 [config.py:3273] Casting torch.bfloat16 to torch.float16.
INFO 06-17 09:20:16 [config.py:1444] Using max model len 1024
INFO 06-17 09:20:16 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.1.dev287+g89b1388d8) with config: model='Qwen/Qwen2-Audio-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-Audio-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2-Audio-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":false,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=False, 
INFO 06-17 09:20:18 [cuda.py:336] Using Flash Attention backend.
INFO 06-17 09:20:18 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-17 09:20:18 [model_runner.py:1171] Starting to load model Qwen/Qwen2-Audio-7B-Instruct...
INFO 06-17 09:20:19 [weight_utils.py:292] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.59it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.49it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.44it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:02<00:00,  1.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  1.88it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  1.68it/s]

INFO 06-17 09:20:22 [default_loader.py:272] Loading weights took 3.00 seconds
INFO 06-17 09:20:22 [model_runner.py:1203] Model loading took 15.6455 GiB and 3.447517 seconds
INFO 06-17 09:20:25 [worker.py:294] Memory profiling takes 2.68 seconds
INFO 06-17 09:20:25 [worker.py:294] the current vLLM instance can use total_gpu_memory (79.19GiB) x gpu_memory_utilization (0.90) = 71.27GiB
INFO 06-17 09:20:25 [worker.py:294] model weights take 15.65GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.51GiB; the rest of the memory reserved for KV Cache is 55.11GiB.
INFO 06-17 09:20:25 [executor_base.py:113] # cuda blocks: 7054, # CPU blocks: 512
INFO 06-17 09:20:25 [executor_base.py:118] Maximum concurrency for 1024 tokens per request: 110.22x
INFO 06-17 09:20:27 [model_runner.py:1513] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes:   0%|                                                                                                                     | 0/35 [00:00<?, ?it/s]
FAILED

================================================================================== FAILURES ==================================================================================
__________________________________________________________ test_beam_search_passes_multimodal_data[False-2-64-half] __________________________________________________________

hf_runner = <class 'tests.conftest.HfRunner'>, vllm_runner = <class 'tests.conftest.VllmRunner'>, dtype = 'half', max_tokens = 64, beam_width = 2

    @pytest.mark.parametrize("dtype", ["half"])
    @pytest.mark.parametrize("max_tokens", MAX_TOKENS)
    @pytest.mark.parametrize("beam_width", MM_BEAM_WIDTHS)
    def test_beam_search_passes_multimodal_data(
        hf_runner,
        vllm_runner,
        dtype: str,
        max_tokens: int,
        beam_width: int,
    ) -> None:
        """Ensure that beam search passes multimodal data through correctly."""
        # NOTE - this test is primarily to check that mm data is passed to beams
        # correctly. As such, we just need to check one extra modality to make
        # sure things pass through properly.
        audios = [AudioAsset("mary_had_lamb").audio_and_sample_rate]
        model = "Qwen/Qwen2-Audio-7B-Instruct"
        audio_seq = "<|audio_bos|><|AUDIO|><|audio_eos|>"
        prompts = [
            f"<|im_start|>user\n{audio_seq}Can you transcribe this?<|im_end|>\n<|im_start|>assistant\n"  #noqa: E501
        ]
    
        with hf_runner(model, dtype=dtype,
                       auto_cls=AutoModelForSeq2SeqLM) as hf_model:
            audio_token_id = hf_model.config.audio_token_index
            eos_token_id = hf_model.tokenizer.eos_token_id  # <|im_end|>
            hf_outputs = hf_model.generate_beam_search(
                prompts,
                beam_width=beam_width,
                max_tokens=max_tokens,
                audios=audios,
            )
    
>       with vllm_runner(model, dtype=dtype) as vllm_model:

samplers/test_beam_search.py:102: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
conftest.py:782: in __init__
    self.model = LLM(
../vllm/entrypoints/llm.py:262: in __init__
    self.llm_engine = LLMEngine.from_engine_args(
../vllm/engine/llm_engine.py:501: in from_engine_args
    return engine_cls.from_vllm_config(
../vllm/engine/llm_engine.py:477: in from_vllm_config
    return cls(
../vllm/engine/llm_engine.py:268: in __init__
    self._initialize_kv_caches()
../vllm/engine/llm_engine.py:426: in _initialize_kv_caches
    self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
../vllm/executor/executor_base.py:124: in initialize_cache
    self.collective_rpc("initialize_cache",
../vllm/executor/uniproc_executor.py:57: in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
../vllm/utils.py:2690: in run_method
    return func(*args, **kwargs)
../vllm/worker/worker.py:335: in initialize_cache
    self._warm_up_model()
../vllm/worker/worker.py:365: in _warm_up_model
    self.model_runner.capture_model(self.gpu_cache)
../../../venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py:116: in decorate_context
    return func(*args, **kwargs)
../vllm/worker/model_runner.py:1658: in capture_model
    graph_runner.capture(**capture_inputs)
../vllm/worker/model_runner.py:2059: in capture
    self.model(
../../../venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py:1751: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../../venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py:1762: in _call_impl
    return forward_call(*args, **kwargs)
../vllm/model_executor/models/qwen2_audio.py:389: in forward
    inputs_embeds = self.get_input_embeddings(input_ids,
../vllm/model_executor/models/qwen2_audio.py:368: in get_input_embeddings
    inputs_embeds = merge_multimodal_embeddings(
../vllm/model_executor/models/utils.py:498: in merge_multimodal_embeddings
    return _merge_multimodal_embeddings(
../vllm/model_executor/models/utils.py:411: in _merge_multimodal_embeddings
    flattened = _flatten_embeddings(multimodal_embeddings)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

embeddings = []

    def _flatten_embeddings(embeddings: NestedTensors) -> torch.Tensor:
        """
        Recursively flattens and concatenates NestedTensors on all but the last
        dimension.
        """
    
        if isinstance(embeddings, torch.Tensor):
            # Flatten all but the last dimension.
            return embeddings.flatten(0, -2)
    
>       return torch.cat(tuple(_flatten_embeddings(t) for t in embeddings))
E       RuntimeError: torch.cat(): expected a non-empty list of Tensors

../vllm/model_executor/models/utils.py:363: RuntimeError
============================================================================== warnings summary ==============================================================================
../../../venvs/vllm/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/mgoin/venvs/vllm/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

tests/samplers/test_beam_search.py::test_beam_search_passes_multimodal_data[False-2-64-half]
  /home/mgoin/venvs/vllm/lib/python3.12/site-packages/librosa/core/intervals.py:15: DeprecationWarning: path is deprecated. Use files() instead. Refer to https://importlib-resources.readthedocs.io/en/latest/using.html#migrating-from-legacy for migration advice.
    with resources.path("librosa.core", "intervals.msgpack") as imsgpack:

tests/samplers/test_beam_search.py::test_beam_search_passes_multimodal_data[False-2-64-half]
  /home/mgoin/venvs/vllm/lib/python3.12/site-packages/audioread/rawread.py:16: DeprecationWarning: 'aifc' is deprecated and slated for removal in Python 3.13
    import aifc

tests/samplers/test_beam_search.py::test_beam_search_passes_multimodal_data[False-2-64-half]
  /home/mgoin/venvs/vllm/lib/python3.12/site-packages/audioread/rawread.py:17: DeprecationWarning: 'audioop' is deprecated and slated for removal in Python 3.13
    import audioop

tests/samplers/test_beam_search.py::test_beam_search_passes_multimodal_data[False-2-64-half]
  /home/mgoin/venvs/vllm/lib/python3.12/site-packages/audioread/rawread.py:19: DeprecationWarning: 'sunau' is deprecated and slated for removal in Python 3.13
    import sunau

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================== short test summary info ===========================================================================
FAILED samplers/test_beam_search.py::test_beam_search_passes_multimodal_data[False-2-64-half] - RuntimeError: torch.cat(): expected a non-empty list of Tensors
======================================================================= 1 failed, 5 warnings in 29.93s =======================================================================
```

### 📝 History of failing test

This was introduced by https://github.com/vllm-project/vllm/pull/19446

https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests/1f2a99b2-fbc9-89fc-a08f-c4d431429893?period=7days&tags=scm.branch%3Amain
<img width="1213" alt="Image" src="https://github.com/user-attachments/assets/843cfb35-53b6-4db2-b326-c65e1226442b" />

It was not caught because the test wasn't triggered by the change I believe.


### CC List.

@russellb 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI Failure]: Samplers Test - samplers/test_beam_search.py::test_beam_search_passes_multimodal_data #19736

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[CI Failure]: Samplers Test - samplers/test_beam_search.py::test_beam_search_passes_multimodal_data #19736

Description

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions