Skip to content

Conversation

@jw1912
Copy link
Owner

@jw1912 jw1912 commented Sep 27, 2025

No description provided.

@jw1912 jw1912 marked this pull request as ready for review September 27, 2025 01:44
@Disservin
Copy link
Contributor

this ready to be tested?

@jw1912
Copy link
Owner Author

jw1912 commented Sep 27, 2025

@Disservin it works correctly (verified by Viren on 1,2,3,4x 4090s) but atm it requires a large batch size to be a good speedup because the gradient accumulation code is basic and suboptimal.

@Disservin
Copy link
Contributor

so the interconnect is limiting too much for it to be worth much ?
on my server the pytorch multigpu is no speedup but for vondele it is since it has nvlink between the gpus

@jw1912
Copy link
Owner Author

jw1912 commented Sep 27, 2025

@Disservin yes, but if you increase the batch size by several times the transfers become infrequent enough that it is a speedup on normal multigpu (transfers are just weight gradients/values so no dependence on batch size ofc).
It would be cool to get a test on a system with nvlink

@jw1912
Copy link
Owner Author

jw1912 commented Sep 27, 2025

I will be trying some alternative strategies to reduce the latency of the interconnect.

@jw1912 jw1912 merged commit b06dd9b into main Sep 28, 2025
6 checks passed
@jw1912 jw1912 deleted the multi-gpu branch September 28, 2025 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants