Add openmp based implementation for TryParallelFor to fix Bert model perf regression, Fix Gelu implementation. #3651

pranavsharma · 2020-04-23T07:20:01Z

Description: Add openmp based implementation for TryParallelFor to fix Bert model perf regression, Fix Gelu implementation.

Motivation and Context
@tianleiwu reported recently that the Bert model regressed in perf after changes in PR 3153.

…perf regression. Fix Gelu implementation.

pranavsharma · 2020-04-23T07:21:52Z

onnxruntime/contrib_ops/cpu/activations.h

-      MlasComputeErf(output, output, len);
-      ym = xm * 0.5f * (ym + 1.0f);
-    });
+    concurrency::ThreadPool* tp = context->GetOperatorThreadPool();


This basically reverts the previous set of changes to be in line with BiasGelu. Plus it was reported that it was throwing some accuracy errors.

BiasGelu uses different logic (no Eigen any more). See here for FastGelu. It need slight change for Gelu.

Sorry, I don't quite understand it. What's wrong in my previous one, which changed? I saw the single thread part is the same?

* Update TopK implementation. - add faster heap - special case k=1 - update selector for when to use heap and when to use nth_element based on performance testing - parallelize if enough work to do - reduce templatized code - add some extra unit tests. Perf tested vs. master. Average speedup is 3.75x using this combination of input sizes: ``` batches = [10, 25, 50] batch_size = [8, 16, 32, 64, 128, 256, 512, 1024, 2048] k = [1, 2, 4, 6, 8, 16, 24, 32, 48, 64, 128] ``` For larger batches (e.g. 50x2048) the speedup is over 20x.

* Remove paramters like --gpu_only --sequence_length. Update bert GPU notebook accordingly. * Remove input_int32 and float16 parameters from constructors of BertOnnxModel class and other classes derived from it. * Update gpt2 benchmark. Add comments in gpt2 notebook to indicate work in progress. Clear notebook output before official 1.3.0 release is ready.

* checkin * fix MSVC build error * test changes * split pivot output into multiple tensors * add horizon tensor * Support multiple types for non-pivot tensor * limit horizon tensor type to int32_t as max_horizon type * work around some conversion warnings for local machine * support variadic shape for non-pivot input * dropping all rows is an exception * fix a bug * fix the way that generates horizon tensor * more tests added * add TypeConstraint() in ONNX_OPERATOR_KERNEL_EX * update Featurizerslibrary

* add features to short_grain_dropper for ONNX export * update FeaturizersLibrary * fix warnings

* Removes omp to use ThreadPool * removes unnecessary old OMP code * rename compute_agg, use ThreadPool::NumThreads Co-authored-by: xavier dupré <xavier.dupre@gmail.com>

pranavsharma · 2020-04-23T20:26:40Z

Something got messed up in my latest push. I'll close this and open a separate pr.

pranavsharma · 2020-04-23T21:02:06Z

Closed in favor of #3667

Add openmp based implementation for TryParallelFor to fix Bert model …

d8dc4e8

…perf regression. Fix Gelu implementation.

pranavsharma requested a review from a team as a code owner April 23, 2020 07:20

pranavsharma commented Apr 23, 2020

View reviewed changes

pranavsharma requested a review from tracysh April 23, 2020 07:22

skottmckay and others added 10 commits April 23, 2020 13:16

Add new docs around how to bind to the onnxruntime.dll (#3539)

5b08bcb

Add Features to ShortGrainDropper for ONNX export (#3628)

63b6e3e

* add features to short_grain_dropper for ONNX export * update FeaturizersLibrary * fix warnings

Replaced spaces on tabs (#3555)

7bacd7c

Downgrade numpy requirement to 1.16.6 (#3635)

c68d773

Not use OpenMP for android build (#3636)

4fcf834

Removes omp for ThreadPool in TreeEsemble* (#3596)

3b04e8a

* Removes omp to use ThreadPool * removes unnecessary old OMP code * rename compute_agg, use ThreadPool::NumThreads Co-authored-by: xavier dupré <xavier.dupre@gmail.com>

Use FastGelu based impl for Gelu

e1f4274

pranavsharma closed this Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add openmp based implementation for TryParallelFor to fix Bert model perf regression, Fix Gelu implementation. #3651

Add openmp based implementation for TryParallelFor to fix Bert model perf regression, Fix Gelu implementation. #3651

pranavsharma commented Apr 23, 2020

pranavsharma Apr 23, 2020

tianleiwu Apr 23, 2020

snnn Apr 23, 2020

pranavsharma commented Apr 23, 2020

pranavsharma commented Apr 23, 2020

Add openmp based implementation for TryParallelFor to fix Bert model perf regression, Fix Gelu implementation. #3651

Add openmp based implementation for TryParallelFor to fix Bert model perf regression, Fix Gelu implementation. #3651

Conversation

pranavsharma commented Apr 23, 2020

pranavsharma Apr 23, 2020

Choose a reason for hiding this comment

tianleiwu Apr 23, 2020

Choose a reason for hiding this comment

snnn Apr 23, 2020

Choose a reason for hiding this comment

pranavsharma commented Apr 23, 2020

pranavsharma commented Apr 23, 2020