Multi-GPU Support #463

jw1912 · 2025-09-27T00:32:45Z

No description provided.

Disservin · 2025-09-27T20:40:26Z

this ready to be tested?

jw1912 · 2025-09-27T22:42:13Z

@Disservin it works correctly (verified by Viren on 1,2,3,4x 4090s) but atm it requires a large batch size to be a good speedup because the gradient accumulation code is basic and suboptimal.

Disservin · 2025-09-27T22:51:58Z

so the interconnect is limiting too much for it to be worth much ?
on my server the pytorch multigpu is no speedup but for vondele it is since it has nvlink between the gpus

jw1912 · 2025-09-27T23:03:13Z

@Disservin yes, but if you increase the batch size by several times the transfers become infrequent enough that it is a speedup on normal multigpu (transfers are just weight gradients/values so no dependence on batch size ofc).
It would be cool to get a test on a system with nvlink

jw1912 · 2025-09-27T23:05:30Z

I will be trying some alternative strategies to reduce the latency of the interconnect.

jw1912 added 12 commits September 26, 2025 20:11

GraphLike trait

54dabe6

.

bf1d45f

working with CPU simulation

bb56fd9

fmt

d2a8094

.

6398c89

hm

264e751

.

587fc4a

nccl not enabled by default

6fc6da3

return nccl errors

9263935

fix actions?

76005de

Merge branch 'main' into multi-gpu

5119d7b

remove cuda specific stuff from tests

73288c5

jw1912 marked this pull request as ready for review September 27, 2025 01:44

Nicer Comm code

2564b08

jw1912 added 2 commits September 28, 2025 00:06

Merge branch 'main' into multi-gpu

1f94b78

check no duplicate devices

c3341c7

jw1912 merged commit b06dd9b into main Sep 28, 2025
6 checks passed

jw1912 deleted the multi-gpu branch September 28, 2025 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU Support #463

Multi-GPU Support #463

Uh oh!

jw1912 commented Sep 27, 2025

Uh oh!

Disservin commented Sep 27, 2025

Uh oh!

jw1912 commented Sep 27, 2025 •

edited

Loading

Uh oh!

Disservin commented Sep 27, 2025

Uh oh!

jw1912 commented Sep 27, 2025 •

edited

Loading

Uh oh!

jw1912 commented Sep 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Multi-GPU Support #463

Multi-GPU Support #463

Uh oh!

Conversation

jw1912 commented Sep 27, 2025

Uh oh!

Disservin commented Sep 27, 2025

Uh oh!

jw1912 commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Disservin commented Sep 27, 2025

Uh oh!

jw1912 commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jw1912 commented Sep 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jw1912 commented Sep 27, 2025 •

edited

Loading

jw1912 commented Sep 27, 2025 •

edited

Loading