Ryanunderhill/beamscorer gpu #16272

RyanUnderhill · 2023-06-07T16:44:49Z

Description

Make BeamScorer run on the GPU vs the CPU.

Brief overview:
Adds a CUDA 'CudaBeamSearchScorer' implementation of IBeamScorer
Instead of a 'done' flag per beam, there is one single 'not done' variable that is copied to the CPU every iteration
Removes some of the extra CPU side buffers and parameters that are no longer needed

Remaining future optimizations:
CPU copied beam indices is still used in the non DecoderMaskedSelfAttention case. An extra kernel can be written to avoid PickGptPasteState needing CPU copied beam indices (called from UpdateGptFeeds).

Motivation and Context

It's faster to keep the work on the GPU to avoid GPU->CPU->GPU copies of data.

…ryanunderhill/beamscorer_gpu

Added parallel batch processing

Remove now unnecessary cpu memory buffer to hold it

…inned CPU memory from the kernel.

onnxruntime/contrib_ops/cpu/transformers/beam_search_scorer.h

…ryanunderhill/beamscorer_gpu

onnxruntime/contrib_ops/cuda/transformers/generation_device_helper.cc

onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_base.h

onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_t5.h

onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_whisper.h

onnxruntime/contrib_ops/cuda/transformers/generation_cuda_impl.cu

onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_gpt.h

onnxruntime/contrib_ops/cpu/transformers/generation_shared.h

…ryanunderhill/beamscorer_gpu

### Description * Pass topk_scores to beam scorer in slow topk path. * Add an env variable `ORT_BEAM_SEARCH_USE_FAST_TOPK` to enable/disable fast topk. * Add a test case for slow topk path. ### Motivation and Context This bug was introduced in #16272 Beam search uses fast cuda kernel when number of beams <= 32. When beam size is larger than that threshold, we use another code path (slower cuda kernel) to get topk. In such `slow topk path`, topk_scores shall be passed to beam scorer but it is not. This bug will cause incorrect result when num_beams > 32. It was not found previously since such large beam size is rarely used.

RyanUnderhill added 14 commits May 22, 2023 19:11

Stage to merge with main

4717839

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

ca052a3

…ryanunderhill/beamscorer_gpu

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

f7c0227

…ryanunderhill/beamscorer_gpu

Stash changes

fd98783

Works in all cases now

75b42ef

Added parallel batch processing

Remove another gpu->cpu copy

3d288df

Remove now unnecessary cpu memory buffer to hold it

Remove unused parameter

7180b5a

Remove unused parameters

eba3c46

Remove file not meant to be part of PR

1cbf45d

Remove file not meant to be in PR

ab7d3ee

Optimized GPU kernels

8a57830

Cuda build issue

8d1059a

Linux build strictness

d9203dc

Remove cudaMemcpyAsync in favor of just writing a single value into p…

3a24d5a

…inned CPU memory from the kernel.

tianleiwu reviewed Jun 14, 2023

View reviewed changes

onnxruntime/contrib_ops/cpu/transformers/beam_search_scorer.h Outdated Show resolved Hide resolved

RyanUnderhill added 8 commits June 15, 2023 14:01

Type fix

196d94c

Merge with main

6f6110a

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

b48e494

…ryanunderhill/beamscorer_gpu

Build fixes

8f16382

Fix CPU mistake

2534239

Fix warnings

759469f

Convert T5 & Whisper

5a531b1

Merge with main

50ca52b

RyanUnderhill marked this pull request as ready for review June 20, 2023 06:44

RyanUnderhill added 2 commits June 19, 2023 23:58

Merge conflict

2f3a3b9

Lint

30f4e7a