Skip to content

[BUG] TFT + categorical features seems not to be compatible with DDP in some situations. #1825

Open
@mkuiack

Description

@mkuiack

Describe the bug

Ive recently been updating package dependencies in my project (python, pytorch, lighting). Without changing anything else in my code or hardware, aside from the lightning import convention, I now get RunTimeErrors when trainign a TFT model, in with ddp.
Each rank immediatly returns errors similar to this, but with different shapes.

[rank5]: RuntimeError: [5]: params[0] in this process with sizes [53, 15] appears not to match sizes of the same param in process 0.

I believe this is because ddp strategy is to restart processes of the code which then instantiates separate versions of the model on subsets of the data. However the pytorch-forecasting implementation of TFT encodes categorical features internally. If different subsets of the data have different categorical values, the shapes wont match.

Expected behavior

TFT with categorical variables should support ddp training strategy.

Additional context

I'm training on a single EC2 node with 8 GPUs.

Trainer( accelerator="gpu", strategy="ddp", devices=1, ...
works but is slow:

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

This doesn't work:

Trainer( accelerator="gpu", strategy="ddp", devices=8, ...

----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

Versions
doesn't work:

python = "~=3.11.0"
pytorch-forecasting = "~=1.2.0"
pytorch-lightning = "==2.0.0"
torch = [
  { version = "==2.5.1+cu118", source = "pytorch-cuda", markers = "sys_platform =='linux' and platform_machine== 'x86_64'" },
  { version = "==2.5.1", source = "picnic", markers = "sys_platform== 'darwin'" },
]

works:


[tool.poetry.dependencies]
python = "~=3.10.0"
pytorch-forecasting = "~=0.10.2"
pytorch-lightning = "~=1.8.0"
torch = [
  { version = "==1.13.1+cu117", source = "pytorch-cuda", markers = "sys_platform=='linux' and platform_machine == 'x86_64'" },
  { version = "==1.13.1", source = "picnic", markers = "sys_platform == 'darwin'" },
]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Needs triage & validation

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions