fix device dismatch issue for pe_audio_video model parallelism #42917

kaixuanliu · 2025-12-17T07:49:38Z

No description provided.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

zucchini-nlp

cc @eustlb for audio PE

zucchini-nlp · 2025-12-17T10:30:17Z

src/transformers/models/pe_audio_video/modeling_pe_audio_video.py

+    _no_split_modules = [
+        "PeAudioVideoEncoderLayer",
+        "TimmWrapperForImageClassification",
+    ]


interesting, timm doesn't support accelerate. Usually we don't add a backbone model since no_split_modules will unwrap recursively for all children.

I think since timm doesn't support accelerate, this is a possible workaround. Though we should add it in TimmWrapperPreTrainedModel._no_split_modules and let it be re-used in other multimodal LLMs

If I put TimmWrapperForImageClassification to TimmWrapperPreTrainedModel._no_split_modules, it will fail here: self.assertSetEqual(set(new_model.hf_device_map.values()), {0, 1}), and throws error AssertionError: Items in the second set but not the first:, and apart from this, it will fail for case pytest -rA tests/models/pe_audio_video/test_modeling_pe_audio_video.py::PeAudioVideoEncoderTest::test_cpu_offload as well, with error:

src/transformers/models/pe_video/modeling_pe_video.py:182: in forward vision_encoder_outputs = self.vision_model(pixel_values_videos) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ /opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1778: in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ /opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1789: in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ /opt/venv/lib/python3.12/site-packages/accelerate/hooks.py:175: in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ src/transformers/models/timm_wrapper/modeling_timm_wrapper.py:360: in forward logits = self.timm_model(pixel_values, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ /opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1778: in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ /opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1789: in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ /opt/venv/lib/python3.12/site-packages/accelerate/hooks.py:170: in new_forward args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ /opt/venv/lib/python3.12/site-packages/accelerate/hooks.py:369: in pre_forward return send_to_device(args, self.execution_device), send_to_device( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ /opt/venv/lib/python3.12/site-packages/accelerate/utils/operations.py:170: in send_to_device return honor_type( /opt/venv/lib/python3.12/site-packages/accelerate/utils/operations.py:82: in honor_type return type(obj)(generator) ^^^^^^^^^^^^^^^^^^^^ /opt/venv/lib/python3.12/site-packages/accelerate/utils/operations.py:171: in <genexpr> tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ tensor = tensor(..., device='meta', size=(288, 3, 14, 14)), device = 0, non_blocking = False, skip_keys = None def send_to_device(tensor, device, non_blocking=False, skip_keys=None): """ Recursively sends the elements in a nested list/tuple/dictionary of tensors to a given device. Args: tensor (nested list/tuple/dictionary of `torch.Tensor`): The data to send to a given device. device (`torch.device`): The device to send the data to. Returns: The same data structure as `tensor` with all tensors sent to the proper device. """ if is_torch_tensor(tensor) or hasattr(tensor, "to"): # `torch.Tensor.to("npu")` could not find context when called for the first time (see this [issue](https://gitee.com/ascend /pytorch/issues/I8KECW?from=project-issue)). if device == "npu": device = "npu:0" try: > return tensor.to(device, non_blocking=non_blocking) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E NotImplementedError: Cannot copy out of meta tensor; no data!

So can we just skip the model parallelism tests here?

It's failing for me even before moving no_split_module under a timm PreTrainedModel, so the issue is not exactly in the location of no_split_module

Looks like the parallelism test is failing because layers are too big to fit in cuda:0, so it is bulking it all in cuda:1. I'd say we can skip and add a reason in description

OK, have updated the code.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

zucchini-nlp · 2025-12-18T10:39:07Z

tests/models/pe_audio_video/test_modeling_pe_audio_video.py

+    @unittest.skip(reason="TimmWrapperModel does not support model parallelism")
+    def test_model_parallelism(self):
+        pass
+


no, no, I meant to keep the changes and skip the tests. With the proposed diff, we can support model parallelism but the tests fail because of the way it is designed
Can you revert the prev diff and "move no_split_module under a timm's PreTrainedModel" instead of PE?

Is it OK now?

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

github-actions · 2025-12-19T03:17:31Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: pe_audio_video, pe_video, timm_wrapper

fix device dismatch issue for pe_audio_video model parallelism

5efe73c

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu mentioned this pull request Dec 17, 2025

model parallelism unit test failed for modeling_pe_audio_video.py and modeling_pe_video.py #42918

Open

4 tasks

zucchini-nlp reviewed Dec 17, 2025

View reviewed changes

skip the model parallelism unit test

bcae7f0

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu marked this pull request as ready for review December 18, 2025 02:07

zucchini-nlp reviewed Dec 18, 2025

View reviewed changes

update

56f7b59

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix device dismatch issue for pe_audio_video model parallelism #42917

fix device dismatch issue for pe_audio_video model parallelism #42917

kaixuanliu commented Dec 17, 2025

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp Dec 17, 2025

Uh oh!

kaixuanliu Dec 17, 2025

Uh oh!

zucchini-nlp Dec 17, 2025

Uh oh!

zucchini-nlp Dec 17, 2025

Uh oh!

kaixuanliu Dec 18, 2025

Uh oh!

zucchini-nlp Dec 18, 2025

Uh oh!

kaixuanliu Dec 19, 2025

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix device dismatch issue for pe_audio_video model parallelism #42917

Are you sure you want to change the base?

fix device dismatch issue for pe_audio_video model parallelism #42917

Conversation

kaixuanliu commented Dec 17, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

kaixuanliu Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

kaixuanliu Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

kaixuanliu Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants