-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Port all relevant AVX hardware intrinsics C# APIs from SIMD native algorithms #691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
0dab6d1
Implemented AVX intrinsics
briancylui 3d76fb1
Implemented performance tests for AVX intrinsics, with some fixes to …
briancylui 6a51bd8
Changes to perf tests in response to feedback
briancylui 1b2cea9
Fixes across multiple files to make unit tests and perf tests work fo…
briancylui f471726
Implemented new AVX intrinsics that do not involve matrix operations,…
briancylui 41d65f5
Implemented perf tests for AVX via CpuMathUtils class
briancylui 8c34e87
Implemented switching logic for Vector128/256Alignment between SSE an…
briancylui ddeb655
Changed perf tests to reveal SSE and AVX intrinsics perf separately
briancylui c776fb0
Fixed access modifiers of private fields
briancylui df09fe3
Implemented all unit tests for AVX intrinsics that do not involve mat…
briancylui a9c481f
Implemented unit tests for AVX intrinsics
briancylui c692a6f
Fixed errors on the RffTransform.CfltAlign const-expression requirement
briancylui 40528e4
Fixed Debug errors by making RffTransform.CfltAlign read-only
briancylui 4d7d8ef
Fixed errors by making CfltAlign static (and read-only)
briancylui 75e4cde
Developed two unit tests for netcoreapp and netstandard to deal with …
briancylui a763059
Kept only the most recent unit tests which are sufficient for both ne…
briancylui 8bc8cc8
Respond to PR feedback: Style changes
briancylui f1664fa
Respond to PR feedback: More style changes
briancylui 26ed884
Implemented class inheritance in perf tests to reduce overlapping code
briancylui 31de895
Respond to PR feedback: Changed Sse/AvxIntrinsics from public to inte…
briancylui f07afb2
Respond to PR feedback: Used env vars to determine whether to use AVX…
briancylui 9a9d272
Respond to PR feedback: Included 0 into consideration for parsing env…
briancylui c249d88
Respond to PR feedback: env vars, InternalsVisibleTo, and abstract
briancylui f606432
Respond to PR feedback: Added new comparer class specifically for MatMul
briancylui 27ad829
Respond to PR feedback: Changes to intrinsics
briancylui 0fd78a6
Respond to PR comment: Makes alignment checking consistent in externa…
briancylui 3380ded
Respond to PR feedback: Refactored Sse/AvxIntrinsics helper functions
briancylui b8d63cc
Made two Sse/AvxIntrinsics helper functions about AlignedArray inline…
briancylui 32a3704
Respond to PR feedback: styles for Vector256Alignment and Avx.GetLowe…
briancylui File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ensure that using 256-bit operations is a net-win for the entire application before "enabling" it by default. For certain workloads, it can downclock the processor and while this generally results in a higher throughput for the specific 256-bit workload, it can potentially cause a perf hit for any non 256-bit workloads that get executed before the processor goes back to its "highest" frequency level.
As per 15.26 Skylake Server Power Management (from the "Intel® 64 and IA-32 Architectures Optimization Reference Manual")

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same logic applies, but to a lesser extent, to AVX-256 instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@briancylui - there is one end-to-end benchmark in the repo today - https://github.com/dotnet/machinelearning/tree/master/test/Microsoft.ML.Benchmarks. Can you try running it with/without your changes and post the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to @tannergooding 's concern, what will be the end-to-end scenarios or applications to drive the optimizations? https://github.com/dotnet/machinelearning/tree/master/test/Microsoft.ML.Benchmarks lists one end-to-end benchmark, but it seems iris.txt is a small data set. Do we have more end-to-end scenarios to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are currently formulating more benchmarks to add to the repo. See #711.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eerhardt: Feel free to check out the perf results for end-to-end perf scenarios on briancylui#7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding suggested earlier about checking if the input array length is greater than a certain
AvxLengthLimit
along with checkingAvx.IsSupported
. I could do some perf tests that accept aParam
that is the length of the input array to see what valueAvxLengthLimit
should take. Same forSseLengthLimit
.I think one of the problems about AVX that impacts perf right now is that we haven't implemented the "double-compute" to take care of alignment that @tannergooding mentioned in briancylui#2 (bottom item). I look forward to implementing it after this PR, since it would change the layout of every function in a significant way.