-
Notifications
You must be signed in to change notification settings - Fork 617
Enable global weight decay to TBE (Backend) (#2498) #2516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D56285676 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Summary: With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
006a74e
to
1bad961
Compare
This pull request was exported from Phabricator. Differential Revision: D56285676 |
1bad961
to
4ae5271
Compare
Summary: With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
This pull request was exported from Phabricator. Differential Revision: D56285676 |
Summary: With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
4ae5271
to
2d7e699
Compare
This pull request was exported from Phabricator. Differential Revision: D56285676 |
2d7e699
to
0f7168e
Compare
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
0f7168e
to
c98bb6a
Compare
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
c98bb6a
to
30f56c8
Compare
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
30f56c8
to
63fbb8a
Compare
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
63fbb8a
to
9051b29
Compare
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
This pull request was exported from Phabricator. Differential Revision: D56285676 |
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay=0.01, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
This pull request was exported from Phabricator. Differential Revision: D56285676 |
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay=0.01, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
This pull request was exported from Phabricator. Differential Revision: D56285676 |
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay=0.01, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay=0.01, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
This pull request was exported from Phabricator. Differential Revision: D56285676 |
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay=0.01, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
This pull request was exported from Phabricator. Differential Revision: D56285676 |
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay=0.01, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
Summary: Pull Request resolved: pytorch#2516 Pull Request resolved: pytorch#2498 With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets *local* but not *global* weight decay. This diff provides option to compensate for weight decay by scaling weight with `global weight decay` value using the formula from csmiler below: ``` global_weight_decay = (1 - learning_rate * weight_decay)^(current_iter - prev_iter - 1) ``` where `prev_iter` is the last iteration this ID (and its corresponding embedding row shows up. --- **Usage:** set ``` optimizer = OptimType.EXACT_ROWWISE_ADAGRAD weight_decay_mode = WeightDecayMode.DECOUPLE_GLOBAL ``` e.g., ``` tbe = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[ (E, D, managed_option, ComputeDevice.CUDA) for (E, D) in zip(Es, Ds) ], optimizer=OptimType.EXACT_ROWWISE_ADAGRAD, learning_rate=0.1, eps=0.1, output_dtype=output_dtype, pooling_mode=pooling_mode, weight_decay=0.01, weight_decay_mode=WeightDecayMode.DECOUPLE_GLOBAL, ) ``` Relevant diffs: D53866750 D55660277 D55660762 Differential Revision: D56285676
This pull request has been merged in c1f7a66. |
Summary:
With existing implementation for sparse embedding tables with rowwise adagrad, weight decay is performed to update the weights only when an ID and its corresponding embedding row appears within a training batch. This means that rows that do not show up won't be updated nor decayed, and hence the embedding table only gets local but not global weight decay.
Usage:
set
e.g.,