-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add openmp based implementation for TryParallelFor to fix Bert model perf regression, Fix Gelu implementation. #3651
Conversation
…perf regression. Fix Gelu implementation.
MlasComputeErf(output, output, len); | ||
ym = xm * 0.5f * (ym + 1.0f); | ||
}); | ||
concurrency::ThreadPool* tp = context->GetOperatorThreadPool(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically reverts the previous set of changes to be in line with BiasGelu. Plus it was reported that it was throwing some accuracy errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BiasGelu uses different logic (no Eigen any more). See here for FastGelu. It need slight change for Gelu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't quite understand it. What's wrong in my previous one, which changed? I saw the single thread part is the same?
* Update TopK implementation. - add faster heap - special case k=1 - update selector for when to use heap and when to use nth_element based on performance testing - parallelize if enough work to do - reduce templatized code - add some extra unit tests. Perf tested vs. master. Average speedup is 3.75x using this combination of input sizes: ``` batches = [10, 25, 50] batch_size = [8, 16, 32, 64, 128, 256, 512, 1024, 2048] k = [1, 2, 4, 6, 8, 16, 24, 32, 48, 64, 128] ``` For larger batches (e.g. 50x2048) the speedup is over 20x.
* Remove paramters like --gpu_only --sequence_length. Update bert GPU notebook accordingly. * Remove input_int32 and float16 parameters from constructors of BertOnnxModel class and other classes derived from it. * Update gpt2 benchmark. Add comments in gpt2 notebook to indicate work in progress. Clear notebook output before official 1.3.0 release is ready.
* checkin * fix MSVC build error * test changes * split pivot output into multiple tensors * add horizon tensor * Support multiple types for non-pivot tensor * limit horizon tensor type to int32_t as max_horizon type * work around some conversion warnings for local machine * support variadic shape for non-pivot input * dropping all rows is an exception * fix a bug * fix the way that generates horizon tensor * more tests added * add TypeConstraint() in ONNX_OPERATOR_KERNEL_EX * update Featurizerslibrary
* add features to short_grain_dropper for ONNX export * update FeaturizersLibrary * fix warnings
* Removes omp to use ThreadPool * removes unnecessary old OMP code * rename compute_agg, use ThreadPool::NumThreads Co-authored-by: xavier dupré <xavier.dupre@gmail.com>
Something got messed up in my latest push. I'll close this and open a separate pr. |
Closed in favor of #3667 |
Description: Add openmp based implementation for TryParallelFor to fix Bert model perf regression, Fix Gelu implementation.
Motivation and Context
@tianleiwu reported recently that the Bert model regressed in perf after changes in PR 3153.