Tags: irobert0126/torchrec
Tags
Overlap comms on backward pass (pytorch#2117) Summary: Pull Request resolved: pytorch#2117 Resolves issues around cuda streams / NCCL Deadlock with autograd. Basically create seperate streams per pipelined embedding arch. Reviewed By: sarckk Differential Revision: D58220332 fbshipit-source-id: e203acad4a92702b94a42e2106d6de4f5d89e112
Fwd-Bwd correctness tests for TBEs, kernels (pytorch#2152) Summary: Pull Request resolved: pytorch#2152 Adding more tests for kernels coverage, testing inductor compilation and forward-backward numerical correctness. Reviewed By: TroyGarden, gnahzg Differential Revision: D58869080 fbshipit-source-id: 002a41d88b2435fbc97bb71509d3bf1afec89251
Bump version.txt for 0.8.0 release (pytorch#2121) Summary: Pull Request resolved: pytorch#2121 Bump version in main branch for 0.8.0 release Reviewed By: IvanKobzarev, gnahzg Differential Revision: D58671454 fbshipit-source-id: 361029726b06b9e580320b1ae3dcf6b86c853db1
Revert _regroup in jagged_tensor (pytorch#2089) Summary: Pull Request resolved: pytorch#2089 Fix S422574 backout D57500720 D58001114' Post: https://fb.workplace.com/groups/gpuinference/permalink/2814805982001385/ Example failed job: f567662663 Reviewed By: xush6528 Differential Revision: D58310586 fbshipit-source-id: 1deacc6318298bf5c18e024560b86250b64a8709
unify seq rw input_dist (pytorch#2051) Summary: Pull Request resolved: pytorch#2051 * unify unnecessary branching for input_dist module * fx wrap some splits for honoring non-optional points. Reviewed By: jingsh, gnahzg, yumin829928 Differential Revision: D57876357 fbshipit-source-id: 1baeb35e0280f251cf451dc5d65e5a8cab378555
Sync collectives refactoring (pytorch#2039) Summary: Pull Request resolved: pytorch#2039 Reland of D57564130 **What is changed after revert**: Torch Library can not be used inside Deploy. Guarded in comm_ops.py all operators definitions and autograd registrations with `not torch._running_with_deploy():` **Catching deploy compat on diff test/land**: D57773561 **Previous diff Summary:** The diff refactors torchrec sync collectives and addresses issues with missing wait_tensor() for backward: - Refactoring using latest Torchrec Library Custom Op API with PT2 compatibility - Removing non-Native functional collectives calls (c10d_functional), as only native exist now in pytorch and non-native are redispatched to native. - Adding test cases for compiled-with-noncompiled ranks (in case of compilation failure on one of the ranks) Issues fixed: - Sync collectives eager backward did not produce gradient -> Fixed - Support gradient_division in sync collectives and its compilation -> Done - Test coverage of sync collectives comparing results with async collectives and compilation. - Fixed Missing wait_tensor The warning: ``` W0520 07:16:25.135696 2546100 Functional.cpp:51] Warning: At the time of process termination, there are still 1 unwaited c10d_functional collective calls. Please review your program to ensure c10d_functional.wait_tensor() is invoked on all tensors returned from c10d_functional collective ops before they are used. (function ~WorkRegistry) ok ``` Reviewed By: ezyang Differential Revision: D57774293 fbshipit-source-id: 76da888f4b6e876aa1ad170857e7db76ac418122
Fix device propagation, tests for cpu sharding (pytorch#1512) Summary: Pull Request resolved: pytorch#1512 Fix device propagation, add tests for "cpu". DMP has default device "cpu", keeping the same. Before torchrec/distributed inference code had constants for "cuda" by default. device("cpu") in pytorch can have index 0 or -1 => adding logic to have "cpu" instead of "cpu:{rank}" Introducing testing for device_type "cpu" Not all fbgemm all_to_one, sum_to_reduce have cpu implementations => skipping all_to_one_device for cpu and manually doing sum_to_reduce. Reviewed By: IvanKobzarev Differential Revision: D51309697 Privacy Context Container: L1138451 fbshipit-source-id: f9fcbf723f0508c89ceeb8ee9f4b81541d375e5a
KJT methods test coverage with pt2 checks refactoring (pytorch#1988) Summary: Pull Request resolved: pytorch#1988 Adding dynamo coverage for KJT methods: - permute - split - regroup_as_dict - getitem - todict split and getitem tests need additional checks (similar to pre slice check). Extracted those checks into pt2/utils, pt2_checks_tensor_slice. Reviewed By: PaulZhang12 Differential Revision: D57220897 fbshipit-source-id: 4a6314e6ddbf7b5e5d8ad25f72aa65906cff28d7
add expecttest dependency to allow for pytorch core testing utils (py… …torch#1952) Summary: Pull Request resolved: pytorch#1952 This is to allow for usage of `from torch.testing._internal.common_distributed import spawn_threads_and_init_comms` aka threaded pg for lightweight "distributed" tests. That avoids heavier process based tests when unnecessary. Reviewed By: henrylhtsang Differential Revision: D56960671 fbshipit-source-id: a3eef9ce32626126956f2a5d9d92fe613fd48d09
PreviousNext