Enable global weight decay to TBE (Backend) (#2498) #2516

spcyppt · 2024-04-18T17:31:11Z

Summary:

With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets local but not global weight decay.

Usage:
set

optimizer = OptimType.EXACT_ROWWISE_ADAGRAD
weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL

e.g.,

tbe = SplitTableBatchedEmbeddingBagsCodegen(
            embedding_specs=[
                (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds)
            ],
            optimizer=OptimType.EXACT_ROWWISE_ADAGRAD,
            learning_rate=0.1,
            eps=0.1,
            output_dtype=output_dtype,
            pooling_mode=pooling_mode,
            weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL,
        )

facebook-github-bot · 2024-04-18T17:31:20Z

This pull request was exported from Phabricator. Differential Revision: D56285676

netlify · 2024-04-18T17:31:28Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`a7125bc`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/663346376ab1fe0008df8e31
😎 Deploy Preview	https://deploy-preview-2516--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676

facebook-github-bot · 2024-04-19T23:13:47Z

This pull request was exported from Phabricator. Differential Revision: D56285676

Summary: With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676

facebook-github-bot · 2024-04-23T04:40:05Z

This pull request was exported from Phabricator. Differential Revision: D56285676

Summary: With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676

facebook-github-bot · 2024-04-23T04:41:59Z

This pull request was exported from Phabricator. Differential Revision: D56285676

Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676

facebook-github-bot · 2024-05-01T21:08:05Z

This pull request was exported from Phabricator. Differential Revision: D56285676

Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay=0.01, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676

facebook-github-bot · 2024-05-02T05:24:57Z

This pull request was exported from Phabricator. Differential Revision: D56285676

Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay=0.01, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676

facebook-github-bot · 2024-05-02T05:29:51Z

This pull request was exported from Phabricator. Differential Revision: D56285676

Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay=0.01, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676

facebook-github-bot · 2024-05-02T06:37:03Z

This pull request was exported from Phabricator. Differential Revision: D56285676

Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay=0.01, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676

facebook-github-bot · 2024-05-02T07:46:14Z

This pull request was exported from Phabricator. Differential Revision: D56285676

Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay=0.01, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676

facebook-github-bot · 2024-05-03T01:37:22Z

This pull request has been merged in c1f7a66.

facebook-github-bot added the cla signed label Apr 18, 2024

facebook-github-bot added the fb-exported label Apr 18, 2024

spcyppt force-pushed the export-D56285676 branch from 006a74e to 1bad961 Compare April 19, 2024 23:13

spcyppt force-pushed the export-D56285676 branch from 1bad961 to 4ae5271 Compare April 23, 2024 04:39

spcyppt force-pushed the export-D56285676 branch from 4ae5271 to 2d7e699 Compare April 23, 2024 04:41

spcyppt force-pushed the export-D56285676 branch from 2d7e699 to 0f7168e Compare April 24, 2024 23:29

spcyppt force-pushed the export-D56285676 branch from 0f7168e to c98bb6a Compare April 30, 2024 03:20

spcyppt force-pushed the export-D56285676 branch from c98bb6a to 30f56c8 Compare April 30, 2024 19:26

spcyppt force-pushed the export-D56285676 branch from 30f56c8 to 63fbb8a Compare April 30, 2024 21:25

spcyppt force-pushed the export-D56285676 branch from 63fbb8a to 9051b29 Compare April 30, 2024 21:30

spcyppt force-pushed the export-D56285676 branch from 9051b29 to 0297a63 Compare May 1, 2024 21:08

spcyppt force-pushed the export-D56285676 branch from 0579a56 to 42653e9 Compare May 2, 2024 04:00

spcyppt force-pushed the export-D56285676 branch from 42653e9 to ad7e8f9 Compare May 2, 2024 05:24

spcyppt force-pushed the export-D56285676 branch from ad7e8f9 to d5d8101 Compare May 2, 2024 05:29

spcyppt force-pushed the export-D56285676 branch from d5d8101 to 0d16a83 Compare May 2, 2024 06:11

spcyppt force-pushed the export-D56285676 branch from 0d16a83 to 8ea229b Compare May 2, 2024 06:37

spcyppt force-pushed the export-D56285676 branch from 8ea229b to c3b63ef Compare May 2, 2024 07:46

spcyppt force-pushed the export-D56285676 branch from c3b63ef to a7125bc Compare May 2, 2024 07:52

facebook-github-bot closed this in c1f7a66 May 3, 2024

facebook-github-bot added the Merged label May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable global weight decay to TBE (Backend) (#2498) #2516

Enable global weight decay to TBE (Backend) (#2498) #2516

Uh oh!

spcyppt commented Apr 18, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Apr 18, 2024

Uh oh!

netlify bot commented Apr 18, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Apr 19, 2024

Uh oh!

facebook-github-bot commented Apr 23, 2024

Uh oh!

facebook-github-bot commented Apr 23, 2024

Uh oh!

facebook-github-bot commented May 1, 2024

Uh oh!

facebook-github-bot commented May 2, 2024

Uh oh!

facebook-github-bot commented May 2, 2024

Uh oh!

facebook-github-bot commented May 2, 2024

Uh oh!

facebook-github-bot commented May 2, 2024

Uh oh!

facebook-github-bot commented May 3, 2024

Uh oh!

Uh oh!

Enable global weight decay to TBE (Backend) (#2498) #2516

Enable global weight decay to TBE (Backend) (#2498) #2516

Uh oh!

Conversation

spcyppt commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Apr 18, 2024

Uh oh!

netlify bot commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Apr 19, 2024

Uh oh!

facebook-github-bot commented Apr 23, 2024

Uh oh!

facebook-github-bot commented Apr 23, 2024

Uh oh!

facebook-github-bot commented May 1, 2024

Uh oh!

facebook-github-bot commented May 2, 2024

Uh oh!

facebook-github-bot commented May 2, 2024

Uh oh!

facebook-github-bot commented May 2, 2024

Uh oh!

facebook-github-bot commented May 2, 2024

Uh oh!

facebook-github-bot commented May 3, 2024

Uh oh!

Uh oh!

spcyppt commented Apr 18, 2024 •

edited

Loading

netlify bot commented Apr 18, 2024 •

edited

Loading