Skip to content

Ryanunderhill/beamscorer gpu #16272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jun 27, 2023
Merged

Ryanunderhill/beamscorer gpu #16272

merged 31 commits into from
Jun 27, 2023

Conversation

RyanUnderhill
Copy link
Contributor

@RyanUnderhill RyanUnderhill commented Jun 7, 2023

Description

Make BeamScorer run on the GPU vs the CPU.

Brief overview:
Adds a CUDA 'CudaBeamSearchScorer' implementation of IBeamScorer
Instead of a 'done' flag per beam, there is one single 'not done' variable that is copied to the CPU every iteration
Removes some of the extra CPU side buffers and parameters that are no longer needed

Remaining future optimizations:
CPU copied beam indices is still used in the non DecoderMaskedSelfAttention case. An extra kernel can be written to avoid PickGptPasteState needing CPU copied beam indices (called from UpdateGptFeeds).

Motivation and Context

It's faster to keep the work on the GPU to avoid GPU->CPU->GPU copies of data.

@RyanUnderhill RyanUnderhill marked this pull request as ready for review June 20, 2023 06:44
tianleiwu
tianleiwu previously approved these changes Jun 23, 2023
tianleiwu
tianleiwu previously approved these changes Jun 26, 2023
@RyanUnderhill RyanUnderhill merged commit 1001ec9 into main Jun 27, 2023
@RyanUnderhill RyanUnderhill deleted the ryanunderhill/beamscorer_gpu branch June 27, 2023 22:08
tianleiwu added a commit that referenced this pull request Feb 7, 2025
### Description
* Pass topk_scores to beam scorer in slow topk path.
* Add an env variable `ORT_BEAM_SEARCH_USE_FAST_TOPK` to enable/disable fast topk.
* Add a test case for slow topk path.

### Motivation and Context

This bug was introduced in
#16272

Beam search uses fast cuda kernel when number of beams <= 32. When beam
size is larger than that threshold, we use another code path (slower
cuda kernel) to get topk. In such `slow topk path`, topk_scores shall be
passed to beam scorer but it is not.

This bug will cause incorrect result when num_beams > 32. It was not
found previously since such large beam size is rarely used.
ashrit-ms pushed a commit that referenced this pull request Feb 11, 2025
### Description
* Pass topk_scores to beam scorer in slow topk path.
* Add an env variable `ORT_BEAM_SEARCH_USE_FAST_TOPK` to enable/disable fast topk.
* Add a test case for slow topk path.

### Motivation and Context

This bug was introduced in
#16272

Beam search uses fast cuda kernel when number of beams <= 32. When beam
size is larger than that threshold, we use another code path (slower
cuda kernel) to get topk. In such `slow topk path`, topk_scores shall be
passed to beam scorer but it is not.

This bug will cause incorrect result when num_beams > 32. It was not
found previously since such large beam size is rarely used.
guschmue pushed a commit that referenced this pull request Mar 6, 2025
### Description
* Pass topk_scores to beam scorer in slow topk path.
* Add an env variable `ORT_BEAM_SEARCH_USE_FAST_TOPK` to enable/disable fast topk.
* Add a test case for slow topk path.

### Motivation and Context

This bug was introduced in
#16272

Beam search uses fast cuda kernel when number of beams <= 32. When beam
size is larger than that threshold, we use another code path (slower
cuda kernel) to get topk. In such `slow topk path`, topk_scores shall be
passed to beam scorer but it is not.

This bug will cause incorrect result when num_beams > 32. It was not
found previously since such large beam size is rarely used.
ashrit-ms pushed a commit that referenced this pull request Mar 17, 2025
### Description
* Pass topk_scores to beam scorer in slow topk path.
* Add an env variable `ORT_BEAM_SEARCH_USE_FAST_TOPK` to enable/disable fast topk.
* Add a test case for slow topk path.

### Motivation and Context

This bug was introduced in
#16272

Beam search uses fast cuda kernel when number of beams <= 32. When beam
size is larger than that threshold, we use another code path (slower
cuda kernel) to get topk. In such `slow topk path`, topk_scores shall be
passed to beam scorer but it is not.

This bug will cause incorrect result when num_beams > 32. It was not
found previously since such large beam size is rarely used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants