Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add openmp based implementation for TryParallelFor to fix Bert model perf regression, Fix Gelu implementation. #3651

Closed
wants to merge 11 commits into from

Conversation

pranavsharma
Copy link
Contributor

Description: Add openmp based implementation for TryParallelFor to fix Bert model perf regression, Fix Gelu implementation.

Motivation and Context
@tianleiwu reported recently that the Bert model regressed in perf after changes in PR 3153.

@pranavsharma pranavsharma requested a review from a team as a code owner April 23, 2020 07:20
MlasComputeErf(output, output, len);
ym = xm * 0.5f * (ym + 1.0f);
});
concurrency::ThreadPool* tp = context->GetOperatorThreadPool();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically reverts the previous set of changes to be in line with BiasGelu. Plus it was reported that it was throwing some accuracy errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BiasGelu uses different logic (no Eigen any more). See here for FastGelu. It need slight change for Gelu.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't quite understand it. What's wrong in my previous one, which changed? I saw the single thread part is the same?

@pranavsharma pranavsharma requested a review from tracysh April 23, 2020 07:22
skottmckay and others added 10 commits April 23, 2020 13:16
* Update TopK implementation.
  - add faster heap
  - special case k=1
  - update selector for when to use heap and when to use nth_element based on performance testing
  - parallelize if enough work to do
  - reduce templatized code
  - add some extra unit tests.

Perf tested vs. master. Average speedup is 3.75x using this combination of input sizes:

```
    batches = [10, 25, 50]
    batch_size = [8, 16, 32, 64, 128, 256, 512, 1024, 2048]
    k = [1, 2, 4, 6, 8, 16, 24, 32, 48, 64, 128]
```

For larger batches (e.g. 50x2048) the speedup is over 20x.
* Remove paramters like --gpu_only --sequence_length. Update bert GPU notebook accordingly.
* Remove input_int32 and float16 parameters from constructors of BertOnnxModel class and other classes derived from it. 
* Update gpt2 benchmark. Add comments in gpt2 notebook to indicate work in progress. Clear notebook output before official 1.3.0 release is ready.
* checkin

* fix MSVC build error

* test changes

* split pivot output into multiple tensors

* add horizon tensor

* Support multiple types for non-pivot tensor

* limit horizon tensor type to int32_t as max_horizon type

* work around some conversion warnings for local machine

* support variadic shape for non-pivot input

* dropping all rows is an exception

* fix a bug

* fix the way that generates horizon tensor

* more tests added

* add TypeConstraint() in ONNX_OPERATOR_KERNEL_EX

* update Featurizerslibrary
* add features to short_grain_dropper for ONNX export

* update FeaturizersLibrary

* fix warnings
* Removes omp to use ThreadPool

* removes unnecessary old OMP code

* rename compute_agg, use ThreadPool::NumThreads

Co-authored-by: xavier dupré <xavier.dupre@gmail.com>
@pranavsharma
Copy link
Contributor Author

Something got messed up in my latest push. I'll close this and open a separate pr.

@pranavsharma
Copy link
Contributor Author

Closed in favor of #3667

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants