[BUG] TFT + categorical features seems not to be compatible with DDP in some situations.

**Describe the bug**

Ive recently been updating package dependencies in my project (python, pytorch, lighting). Without changing anything else in my code or hardware, aside from the lightning import convention, I now get RunTimeErrors when trainign a TFT model, in with ddp.  
Each rank immediatly returns errors similar to this, but with different shapes.
```
[rank5]: RuntimeError: [5]: params[0] in this process with sizes [53, 15] appears not to match sizes of the same param in process 0.
```

I believe this is because ddp strategy is to restart processes of the code which then instantiates separate versions of the model on subsets of the data. However the pytorch-forecasting implementation of TFT encodes categorical features internally. If different subsets of the data have different categorical values, the shapes wont match. 



**Expected behavior**

TFT with categorical variables should support ddp training strategy. 

**Additional context**

I'm training on a single EC2 node with 8 GPUs.

`Trainer(
        accelerator="gpu",
        strategy="ddp",
        devices=1, ...` 
works but is slow:
```
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
```
This doesn't work: 

`Trainer(
        accelerator="gpu",
        strategy="ddp",
        devices=8, ...` 
```
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
```
**Versions**
doesn't work: 
```
python = "~=3.11.0"
pytorch-forecasting = "~=1.2.0"
pytorch-lightning = "==2.0.0"
torch = [
  { version = "==2.5.1+cu118", source = "pytorch-cuda", markers = "sys_platform =='linux' and platform_machine== 'x86_64'" },
  { version = "==2.5.1", source = "picnic", markers = "sys_platform== 'darwin'" },
]
```
works: 
```

[tool.poetry.dependencies]
python = "~=3.10.0"
pytorch-forecasting = "~=0.10.2"
pytorch-lightning = "~=1.8.0"
torch = [
  { version = "==1.13.1+cu117", source = "pytorch-cuda", markers = "sys_platform=='linux' and platform_machine == 'x86_64'" },
  { version = "==1.13.1", source = "picnic", markers = "sys_platform == 'darwin'" },
]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] TFT + categorical features seems not to be compatible with DDP in some situations. #1825

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] TFT + categorical features seems not to be compatible with DDP in some situations. #1825

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions