Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds threading support in torchrec train pipeline. #1694

Closed
wants to merge 1 commit into from

Conversation

fenghuizhang
Copy link

Summary:
Motivation

  • In training, we have mostly been focused on optimizing for better GPU utilization. For models that are not GPU bound, we often observe that CPU ops taking nontrivial amount of time. With existing pipelines, these operations are executed in the main thread. Depending on the complexity of the model input, they can take tens of milliseconds in our traces.
  • There are certain applications that are latency sensitive. While the multi-stage pipelines we have improve throughput greatly, they hurt latency by buffering multiple batches in the pipeline.

In this change we add the capability to load data and copy it to gpu in a background thread. This helps reduce the iteration latency for the models we mentioned above, and minimizes the number of batches in the pipeline.

In this diff, we are adding a new eval (forwardonly) sparse-data-dist pipeline with threading enabled.

Reviewed By: dstaay-fb

Differential Revision: D53453429

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 8, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53453429

fenghuizhang pushed a commit to fenghuizhang/torchrec that referenced this pull request Feb 8, 2024
Summary:

Motivation
- In training, we have mostly been focused on optimizing for better GPU utilization. For models that are not GPU bound, we often observe that CPU ops taking nontrivial amount of time. With existing pipelines, these operations are executed in the main thread. Depending on the complexity of the model input, they can take tens of milliseconds in our traces.
- There are certain applications that are latency sensitive. While the multi-stage pipelines we have improve throughput greatly, they hurt latency by buffering multiple batches in the pipeline.

In this change we add the capability to load data and copy it to gpu in a background thread. This helps reduce the iteration latency for the models we mentioned above, and minimizes the number of batches in the pipeline.

In this diff, we are adding a new eval (forwardonly) sparse-data-dist pipeline with threading enabled.

Reviewed By: dstaay-fb, leitian

Differential Revision: D53453429
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53453429

fenghuizhang pushed a commit to fenghuizhang/torchrec that referenced this pull request Feb 8, 2024
Summary:

Motivation
- In training, we have mostly been focused on optimizing for better GPU utilization. For models that are not GPU bound, we often observe that CPU ops taking nontrivial amount of time. With existing pipelines, these operations are executed in the main thread. Depending on the complexity of the model input, they can take tens of milliseconds in our traces.
- There are certain applications that are latency sensitive. While the multi-stage pipelines we have improve throughput greatly, they hurt latency by buffering multiple batches in the pipeline.

In this change we add the capability to load data and copy it to gpu in a background thread. This helps reduce the iteration latency for the models we mentioned above, and minimizes the number of batches in the pipeline.

In this diff, we are adding a new eval (forwardonly) sparse-data-dist pipeline with threading enabled.

Reviewed By: dstaay-fb, leitian

Differential Revision: D53453429
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53453429

Summary:

Motivation
- In training, we have mostly been focused on optimizing for better GPU utilization. For models that are not GPU bound, we often observe that CPU ops taking nontrivial amount of time. With existing pipelines, these operations are executed in the main thread. Depending on the complexity of the model input, they can take tens of milliseconds in our traces.
- There are certain applications that are latency sensitive. While the multi-stage pipelines we have improve throughput greatly, they hurt latency by buffering multiple batches in the pipeline.

In this change we add the capability to load data and copy it to gpu in a background thread. This helps reduce the iteration latency for the models we mentioned above, and minimizes the number of batches in the pipeline.

In this diff, we are adding a new eval (forwardonly) sparse-data-dist pipeline with threading enabled.

Reviewed By: dstaay-fb, leitian, joshuadeng

Differential Revision: D53453429
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53453429

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants