Permute values for VBE with keyed_jagged_index_select #1682

joshuadeng · 2024-02-05T21:52:04Z

Summary:
In the case of VBE, due to variable batch per feature, we permute values via length per key. For high pooling factor inputs this shrinks the number of permutes and greatly increases the amount of data needed to be copied per each permute index.

The implementation of permute_1d_sparse_data processes 32 indices (of the recat tensor) in one thread block (each index is processed by 32 threads). So, we use 1 thread block to process all indices.

We further improved this implementation further to separate the block into 4 thread blocks of 256 threads. This allows more threads to process each permute index, however the sm utilization is still low.

keyed_jagged_index_select_dim1 parallelizes work for a single permute index across multiple thread blocks
note that it only works on cuda.

Differential Revision: D53432094

facebook-github-bot · 2024-02-05T21:52:20Z

This pull request was exported from Phabricator. Differential Revision: D53432094

Summary: In the case of VBE, due to variable batch per feature, we permute values via length per key. For high pooling factor inputs this shrinks the number of permutes and greatly increases the amount of data needed to be copied per each permute index. The implementation of `permute_1d_sparse_data` processes 32 indices (of the recat tensor) in one thread block (each index is processed by 32 threads). So, we use 1 thread block to process all indices. We further improved this implementation further to separate the block into 4 thread blocks of 256 threads. This allows more threads to process each permute index, however the sm utilization is still low. `keyed_jagged_index_select_dim1` parallelizes work for a single permute index across multiple thread blocks note that it only works on cuda. Differential Revision: D53432094

…orch#1682) Summary: In the case of VBE, due to variable batch per feature, we permute values via length per key. For high pooling factor inputs this shrinks the number of permutes and greatly increases the amount of data needed to be copied per each permute index. The implementation of `permute_1d_sparse_data` processes 32 indices (of the recat tensor) in one thread block (each index is processed by 32 threads). So, we use 1 thread block to process all indices. We further improved this implementation further to separate the block into 4 thread blocks of 256 threads. This allows more threads to process each permute index, however the sm utilization is still low. `keyed_jagged_index_select_dim1` parallelizes work for a single permute index across multiple thread blocks note that it only works on cuda. Differential Revision: D53432094

facebook-github-bot · 2024-02-05T22:09:50Z

This pull request was exported from Phabricator. Differential Revision: D53432094

facebook-github-bot · 2024-02-05T22:10:04Z

This pull request was exported from Phabricator. Differential Revision: D53432094

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 5, 2024

facebook-github-bot added the fb-exported label Feb 5, 2024

joshuadeng force-pushed the export-D53432094 branch from d6d97ba to f2c6481 Compare February 5, 2024 22:09

joshuadeng force-pushed the export-D53432094 branch from f2c6481 to b96e8d0 Compare February 5, 2024 22:09

facebook-github-bot closed this in f934ec9 Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Permute values for VBE with keyed_jagged_index_select #1682

Permute values for VBE with keyed_jagged_index_select #1682

joshuadeng commented Feb 5, 2024

facebook-github-bot commented Feb 5, 2024

facebook-github-bot commented Feb 5, 2024

facebook-github-bot commented Feb 5, 2024

Permute values for VBE with keyed_jagged_index_select #1682

Permute values for VBE with keyed_jagged_index_select #1682

Conversation

joshuadeng commented Feb 5, 2024

facebook-github-bot commented Feb 5, 2024

facebook-github-bot commented Feb 5, 2024

facebook-github-bot commented Feb 5, 2024