Skip to content

Tags: irobert0126/torchrec

Tags

v2024.07.01.00

Toggle v2024.07.01.00's commit message
Overlap comms on backward pass (pytorch#2117)

Summary:
Pull Request resolved: pytorch#2117

Resolves issues around cuda streams / NCCL Deadlock with autograd.

Basically create seperate streams per pipelined embedding arch.

Reviewed By: sarckk

Differential Revision: D58220332

fbshipit-source-id: e203acad4a92702b94a42e2106d6de4f5d89e112

v2024.06.24.00

Toggle v2024.06.24.00's commit message
Fwd-Bwd correctness tests for TBEs, kernels (pytorch#2152)

Summary:
Pull Request resolved: pytorch#2152

Adding more tests for kernels coverage, testing inductor compilation and forward-backward numerical correctness.

Reviewed By: TroyGarden, gnahzg

Differential Revision: D58869080

fbshipit-source-id: 002a41d88b2435fbc97bb71509d3bf1afec89251

v2024.06.17.00

Toggle v2024.06.17.00's commit message
Bump version.txt for 0.8.0 release (pytorch#2121)

Summary:
Pull Request resolved: pytorch#2121

Bump version in main branch for 0.8.0 release

Reviewed By: IvanKobzarev, gnahzg

Differential Revision: D58671454

fbshipit-source-id: 361029726b06b9e580320b1ae3dcf6b86c853db1

v0.8.0-rc1

Toggle v0.8.0-rc1's commit message
Update setup and version for release 0.8.0

v2024.06.10.00

Toggle v2024.06.10.00's commit message
Revert _regroup in jagged_tensor (pytorch#2089)

Summary:
Pull Request resolved: pytorch#2089

Fix S422574
backout D57500720 D58001114'

Post: https://fb.workplace.com/groups/gpuinference/permalink/2814805982001385/
Example failed job: f567662663

Reviewed By: xush6528

Differential Revision: D58310586

fbshipit-source-id: 1deacc6318298bf5c18e024560b86250b64a8709

v2024.06.03.00

Toggle v2024.06.03.00's commit message
unify seq rw input_dist (pytorch#2051)

Summary:
Pull Request resolved: pytorch#2051

* unify unnecessary branching for input_dist module
* fx wrap some splits for honoring non-optional points.

Reviewed By: jingsh, gnahzg, yumin829928

Differential Revision: D57876357

fbshipit-source-id: 1baeb35e0280f251cf451dc5d65e5a8cab378555

v2024.05.27.00

Toggle v2024.05.27.00's commit message
Sync collectives refactoring (pytorch#2039)

Summary:
Pull Request resolved: pytorch#2039

Reland of D57564130

**What is changed after revert**:
Torch Library can not be used inside Deploy.
Guarded in comm_ops.py all operators definitions and autograd registrations with `not torch._running_with_deploy():`

**Catching deploy compat on diff test/land**: D57773561

**Previous diff Summary:**
The diff refactors torchrec sync collectives and addresses issues with missing wait_tensor() for backward:
- Refactoring using latest Torchrec Library Custom Op API with PT2 compatibility
- Removing non-Native functional collectives calls (c10d_functional), as only native exist now in pytorch and non-native are redispatched to native.
- Adding test cases for compiled-with-noncompiled ranks (in case of compilation failure on one of the ranks)

Issues fixed:
- Sync collectives eager backward did not produce gradient -> Fixed
- Support gradient_division in sync collectives and its compilation -> Done
- Test coverage of sync collectives comparing results with async collectives and compilation.
 - Fixed Missing wait_tensor
The warning:
```
W0520 07:16:25.135696 2546100 Functional.cpp:51] Warning: At the time of process termination, there are still 1 unwaited c10d_functional collective calls. Please review your program to ensure c10d_functional.wait_tensor() is invoked on all tensors returned from c10d_functional collective ops before they are used. (function ~WorkRegistry)
ok
```

Reviewed By: ezyang

Differential Revision: D57774293

fbshipit-source-id: 76da888f4b6e876aa1ad170857e7db76ac418122

v2024.05.20.00

Toggle v2024.05.20.00's commit message
Fix device propagation, tests for cpu sharding (pytorch#1512)

Summary:
Pull Request resolved: pytorch#1512

Fix device propagation, add tests for "cpu".

DMP has default device "cpu", keeping the same.
Before torchrec/distributed inference code had constants for "cuda" by default.

device("cpu") in pytorch can have index 0 or -1 => adding logic to have "cpu" instead of "cpu:{rank}"

Introducing testing for device_type "cpu"
Not all fbgemm all_to_one, sum_to_reduce have cpu implementations => skipping all_to_one_device for cpu and manually doing sum_to_reduce.

Reviewed By: IvanKobzarev

Differential Revision:
D51309697

Privacy Context Container: L1138451

fbshipit-source-id: f9fcbf723f0508c89ceeb8ee9f4b81541d375e5a

v2024.05.13.00

Toggle v2024.05.13.00's commit message
KJT methods test coverage with pt2 checks refactoring (pytorch#1988)

Summary:
Pull Request resolved: pytorch#1988

Adding dynamo coverage for KJT methods:
- permute
- split
- regroup_as_dict
- getitem
- todict

split and getitem tests need additional checks (similar to pre slice check).

Extracted those checks into pt2/utils, pt2_checks_tensor_slice.

Reviewed By: PaulZhang12

Differential Revision: D57220897

fbshipit-source-id: 4a6314e6ddbf7b5e5d8ad25f72aa65906cff28d7

v2024.05.06.00

Toggle v2024.05.06.00's commit message
add expecttest dependency to allow for pytorch core testing utils (py…

…torch#1952)

Summary:
Pull Request resolved: pytorch#1952

This is to allow for usage of `from torch.testing._internal.common_distributed import spawn_threads_and_init_comms`
aka threaded pg for lightweight "distributed" tests.

That avoids heavier process based tests when unnecessary.

Reviewed By: henrylhtsang

Differential Revision: D56960671

fbshipit-source-id: a3eef9ce32626126956f2a5d9d92fe613fd48d09