tag:github.com,2008:https://github.com/pytorch/torchrec/releasesTags from torchrec2025-08-23T20:47:20Ztag:github.com,2008:Repository/385408663/v2025.08.25.002025-08-23T20:47:20Zv2025.08.25.00: refactor train_pipeline_tests for tracing (#3314)<p>refactor train_pipeline_tests for tracing (<a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3314">#3314</a>)</p>
<p>Summary:
<br />Pull Request <span class="issue-keyword tooltipped tooltipped-se">resolved</span>: <a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3314">#3314</a></p>
<p># context
<br />* previous work in torchrec train_pipeline to refactoring the utils file
<br />* tracing was separated from the utils
<br />* here we also move the corresponding tests into a new file.</p>
<p>Reviewed By: iamzainhuda</p>
<p>Differential Revision: D80882439</p>
<p>fbshipit-source-id: 572a52f651f381b084be369314fce4bead6e853d</p>TroyGardentag:github.com,2008:Repository/385408663/v2025.08.18.002025-08-17T20:06:43Zv2025.08.18.00<p>ReshardingAPI Host Memory Offloading and BenchmarkReshardingHandler (<a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3291">#…</a></p>
<p><a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3291">…3291</a>)</p>
<p>Summary:
<br />Pull Request <span class="issue-keyword tooltipped tooltipped-se">resolved</span>: <a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3291">#3291</a></p>
<p>- Implements tensor offloading to host memory inside resharding API
<br />- Add BenchmarkReshardingHandler
<br /> - generate random plans
<br /> - calls DDP reshard API by selecting random plans
<br />- Add reset method to train_pipelines.py</p>
<p>Reviewed By: aporialiao</p>
<p>Differential Revision: D80366926</p>
<p>fbshipit-source-id: a137da2f36cbacf21f0c28ae83dfc6eabba29901</p>isururanawakatag:github.com,2008:Repository/385408663/v2025.08.11.002025-08-09T02:14:16Zv2025.08.11.00<p>Replace int(..) with torch.sym_int(...) to make it torch.export compa…</p>
<p>…tible (<a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3270">#3270</a>)</p>
<p>Summary:
<br />Pull Request <span class="issue-keyword tooltipped tooltipped-se">resolved</span>: <a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3270">#3270</a></p>
<p>As title</p>
<p>Reviewed By: angelayi</p>
<p>Differential Revision: D79912672</p>
<p>fbshipit-source-id: b2586764082fdcb3665f486720a0e9ee97724d09</p>malaybagtag:github.com,2008:Repository/385408663/v2025.08.04.002025-08-02T09:25:35Zv2025.08.04.00: Add dtype to kt_regroup input (#3250)<p>Add dtype to kt_regroup input (<a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3250">#3250</a>)</p>
<p>Summary:
<br />Pull Request <span class="issue-keyword tooltipped tooltipped-se">resolved</span>: <a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3250">#3250</a></p>
<p>Currently ir_kt_regroup always returns a float32, which is incorrect if the KTRegroupAsDict's emb_type is set</p>
<p>Reviewed By: malaybag</p>
<p>Differential Revision: D79422720</p>
<p>fbshipit-source-id: 2e711a7dc186b80226b2631ba94af741e05df4b7</p>angelayitag:github.com,2008:Repository/385408663/v2025.07.28.002025-07-26T09:58:48Zv2025.07.28.00: fix EBC optimizer size setting for virtual table (#3239)<p>fix EBC optimizer size setting for virtual table (<a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3239">#3239</a>)</p>
<p>Summary:
<br />Pull Request <span class="issue-keyword tooltipped tooltipped-se">resolved</span>: <a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3239">#3239</a></p>
<p>Add missing optimizer state processing logic similar to what EC is doing here: <a href="https://fburl.com/code/hvzzs3t4">https://fburl.com/code/hvzzs3t4</a>, to make sure optimizer state won't use default metadata which is the virtual table size, not the actual tensor size.</p>
<p>Reviewed By: EddyLXJ</p>
<p>Differential Revision: D78950717</p>
<p>fbshipit-source-id: 45eca79bbff1fe498e2707b51a6845eb603bbdfd</p>emlintag:github.com,2008:Repository/385408663/v2025.07.21.002025-07-18T20:43:57Zv2025.07.21.00: YAML config support for pipeline benchmarking (#3180)<p>YAML config support for pipeline benchmarking (<a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3180">#3180</a>)</p>
<p>Summary:
<br />Pull Request <span class="issue-keyword tooltipped tooltipped-se">resolved</span>: <a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3180">#3180</a></p>
<p>Added a support for YAML file configuration of the pipeline benchmarking. This feature makes easier to reproduce complex configurations without the need to CLI arguments passing.</p>
<p>Example `.yaml ` file should look like:
<br />```
<br />RunOptions:
<br /> world_size: 2</p>
<p>PipelineConfig:
<br /> pipeline: "sparse"
<br />```</p>
<p>Also, configs can be listed in a 'flat' way as well:
<br />```
<br />world_size: 2
<br />pipeline: "sparse"
<br />```</p>
<p>To run, add the `--yaml_config` flag with the `.yaml` file path. Additional flags can overwrite the `yaml` file configs as well if desired.</p>
<p>Reviewed By: iamzainhuda</p>
<p>Differential Revision: D78127340</p>
<p>fbshipit-source-id: 3134144fcd42761f02da81a9438546cae84b4460</p>SSYernartag:github.com,2008:Repository/385408663/v2025.07.14.002025-07-12T04:35:52Zv2025.07.14.00: Dynamic 2D sparse parallel (#3177)<p>Dynamic 2D sparse parallel (<a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3177">#3177</a>)</p>
<p>Summary:
<br />Pull Request <span class="issue-keyword tooltipped tooltipped-se">resolved</span>: <a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3177">#3177</a></p>
<p>We add the ability to set 2D parallel configuration per module (coined as Dynamic 2D parallel). This means an EBC and EC can be sharded differently on the data parallel dimension. We can have 4 replicas per EBC shard and 2 replicas per EC shard. The previous setting requires all modules to have the same replication factor.</p>
<p>To do this, we introduce a lightweight dataclass that is used to provide per module configurations, allowing very granular control should it be required by the user:
<br />```python
<br />class DMPCollectionConfig:
<br /> module: nn.Module # this is expected to be unsharded module
<br /> plan: "ShardingPlan" # sub-tree-specific sharding plan
<br /> sharding_group_size: int
<br /> use_inter_host_allreduce: bool = False
<br />```</p>
<p>The dataclass is used to provide the context for each module we are sharding. And, if configured, create separate process groups and sync logic for each of these modules.</p>
<p>Usage is as follows, suppose we want to use a different 2D configuration for EmbeddingCollection:
<br />```python
<br /># create plan for model tables over Y world size
<br /># create plan for EmbeddingCollection tables over X world size
<br />ec_config = DMPCollectionConfig(EmbeddingCollection, embedding_collection_plan, sharding_group_size)
<br />model = DMPCollection(
<br /> # pass in defaults args
<br /> submodule_configs = [ec_config]
<br />)
<br />```</p>
<p>Future work includes:
<br />- making it easier for users to create seperate sharding plans per module
<br />- per table 2D</p>
<p>Reviewed By: liangbeixu</p>
<p>Differential Revision: D76774334</p>
<p>fbshipit-source-id: 27c7e0bc806d8227d784461a197cd8f1c7f6adfc</p>iamzainhudatag:github.com,2008:Repository/385408663/v2025.07.07.002025-07-07T03:15:53Zv2025.07.07.00: Fix test more generally (#3165)<p>Fix test more generally (<a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3165">#3165</a>)</p>
<p>Summary:
<br />Pull Request <span class="issue-keyword tooltipped tooltipped-se">resolved</span>: <a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3165">#3165</a></p>
<p><a href="https://www.internalfb.com/diff/D77614983">https://www.internalfb.com/diff/D77614983</a>
<br />attempted to fix a test, but I still see it showing up in other tests, so this fixes it in general.</p>
<p>Reviewed By: huydhn</p>
<p>Differential Revision: D77758554</p>
<p>fbshipit-source-id: bd390081b68fa650f1cfd6d2a93a1fbf206aaff7</p>exclamafortetag:github.com,2008:Repository/385408663/v2025.06.30.002025-06-29T08:22:46Zv2025.06.30.00<p>Revert D76476676: Multisect successfully blamed "D76476676: OSS Torch…</p>
<p>…Rec Internal MPZCH modules" for one test failure (<a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3146">#3146</a>)</p>
<p>Summary:
<br />Pull Request <span class="issue-keyword tooltipped tooltipped-se">resolved</span>: <a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3146">#3146</a></p>
<p>This diff reverts D76476676
<br />D76476676: OSS TorchRec Internal MPZCH modules by lizhouyu causes the following test failure:</p>
<p>Tests affected:
<br />- [cogwheel:cogwheel_tps_basic_test#test_tps_basic_latency](<a href="https://www.internalfb.com/intern/test/562950123898458/">https://www.internalfb.com/intern/test/562950123898458/</a>)</p>
<p>Here's the Multisect link:
<br /><a href="https://www.internalfb.com/multisect/30184635">https://www.internalfb.com/multisect/30184635</a>
<br />Here are the tasks that are relevant to this breakage:
<br />T211534727: 100+ tests, 10+ build rules failing for minimal_viable_ai</p>
<p>The backout may land if someone accepts it.</p>
<p>If this diff has been generated in error, you can Commandeer and Abandon it.</p>
<p>Depends on D76476676</p>
<p>Reviewed By: lizhouyu</p>
<p>Differential Revision: D77502155</p>
<p>fbshipit-source-id: eb990251f3276372592c30a7361579e2a3639d6c</p>tag:github.com,2008:Repository/385408663/v2025.06.23.002025-06-21T16:46:30Zv2025.06.23.00: minor refactoring to use list comprehension (#3125)<p>minor refactoring to use list comprehension (<a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3125">#3125</a>)</p>
<p>Summary:
<br />Pull Request <span class="issue-keyword tooltipped tooltipped-se">resolved</span>: <a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/3125">#3125</a></p>
<p># context
<br />* imported from github [<a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/1602">#1602</a>](<a class="issue-link js-issue-link" href="https://github.com/pytorch/torchrec/pull/1602">#1602</a>)
<br />* rebased on trunk</p>
<p>Reviewed By: spmex</p>
<p>Differential Revision: D77031163</p>
<p>fbshipit-source-id: f65168f6ab0b74eca75b72fd60ec0ea7c762f3dc</p>TroyGarden