-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify and rewrite reference packing kernels. #610
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
These are defined in sub-configuration-specific header files, which are only included by reference kernels.
All kernels have been combined into a single array (level-1v/1f, (un)packm, level-3, and sup), and similarly with preferences (only ukr row-storage preferences for now) and block sizes (which now include sup thresholds and block sizes). These changes are necessary for future support of user-defined kernels. The context initialization functions used by bli_cntx_init_* have also been reworked to use a sentinel instead of an explicit count in order to prevent errors. Note that mostly these changes make the cntx_t code oblivious to BLAS level, but some l3-specific functions remain for compatibility.
1. The generic gemm kernel breaks on armsve because there is no compile-time MR/NR. The refernce gemm kernels has been modified to detect this and fallback to a "dumb" version. 2. For some reason, adding an optimization for writing back full microtiles in row-major storage to the reference gemm kernel results in a segfault for armv7a/gcc-9.3. I can't tell if I'm doing something wrong of if there is a compiler bug. This optimization has been removed for the time being.
…vailable as macros. The array of reference packing kernels (0--31) are replaced by exactly two kernels for each config/datatype combination, one to pack MRxK micropanels and one to pack NRxK micropanels. *IMPORTANT*: the "bb" reference kernels have been merged into the "standard" kernels (packm [incl. 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 need testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro.
This change also includes a new level-0 macro: set0s_edge, which helps to simplify the packm kernels.
So what was the source of the breakage in #609? |
Nevermind, just saw your email. |
NOT ready to merge yet, there are problems when broadcast packing is used. |
BTW, we should update the |
That document will need many updates when I'm done with it.
|
- bli_packm_struc_cxk has been completely rewritten to combine nat/1m execution and use a special packing kernel for diagonal blocks. - *all* reference kernels now respect broadcast packing for A and/or B. This works for all l3 operations (even trsm!) and with 1m.
# Conflicts: # ref_kernels/3/bli_gemmtrsm_ref.c # ref_kernels/ind/bli_gemmtrsm1m_ref.c
@fgvanzee this should be ready to merge now, whenever 0.9.0 is out. |
Cool. Thanks for the update. |
Due to missing `break`s in a switch statement (warn me, dammit!), the virtual gemm ukernels were not getting set to the optimized versions. [ci skip]
fgvanzee
added a commit
that referenced
this pull request
Jul 11, 2023
Details: - Reorganized the way kernels are stored within the cntx_t structure so that rather than having a function pointer for every supported size of unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm kernels per datatype: one to pack MRxk micropanels and one to pack NRxk micropanels. - NOTE: The "bb" (broadcast B) reference kernels have been merged into the "standard" kernels (packm [including 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro. - Simplified the bli_cntx_*() APIs to conform to the new unified kernel array within the cntx_t. Updated existing bli_cntx_init_<subconfig>() function definitions for all subconfigurations. - Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t, etc.) into one kernel id type: ukr_t. - Various edits, updates, and rewrites of reference kernels pursuant to the aforementioned changes. - Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz], and friends) in bli_kernel_macro_defs.h, but only when the macro BLIS_IN_REF_KERNEL is defined by the build system. - Loose ends: - Still need to update documentation, including: - docs/ConfigurationHowTo.md - docs/KernelsHowTo.md to reflect changes made in this commit. - (cherry picked from commit ae10d94) Fixed type bug in bli_cntx_set_ukr_prefs(). Details: - Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the num_t value read from va_args() down to a bool before being stored within the cntx_t. This bug was introduced on April 6th 2022, in ae10d94. This caused the ukernel preferences for double real and double complex to go unchanged while the preferences for single real and single complex were corrupted by the former datatypes' preference values. The bug manifested as degraded performance for subconfigurations that registered column-preferential ukernels. The reason is that the erroneous preferences trigger unnecessary transpositions in the operation, which forces the gemm ukernel to compute on matrices that are not stored according to its preference. Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their extensive efforts and assistance in tracking down this issue. - Augmented the informational header that is output by the testsuite to include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu]. - CREDITS file update. - (cherry picked from commit 667f201)
fgvanzee
added a commit
that referenced
this pull request
Jul 14, 2023
Details: - Reorganized the way kernels are stored within the cntx_t structure so that rather than having a function pointer for every supported size of unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm kernels per datatype: one to pack MRxk micropanels and one to pack NRxk micropanels. - NOTE: The "bb" (broadcast B) reference kernels have been merged into the "standard" kernels (packm [including 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro. - Simplified the bli_cntx_*() APIs to conform to the new unified kernel array within the cntx_t. Updated existing bli_cntx_init_<subconfig>() function definitions for all subconfigurations. - Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t, etc.) into one kernel id type: ukr_t. - Various edits, updates, and rewrites of reference kernels pursuant to the aforementioned changes. - Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz], and friends) in bli_kernel_macro_defs.h, but only when the macro BLIS_IN_REF_KERNEL is defined by the build system. - Loose ends: - Still need to update documentation, including: - docs/ConfigurationHowTo.md - docs/KernelsHowTo.md to reflect changes made in this commit. - (cherry picked from commit ae10d94) Fixed type bug in bli_cntx_set_ukr_prefs(). Details: - Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the num_t value read from va_args() down to a bool before being stored within the cntx_t. This bug was introduced on April 6th 2022, in ae10d94. This caused the ukernel preferences for double real and double complex to go unchanged while the preferences for single real and single complex were corrupted by the former datatypes' preference values. The bug manifested as degraded performance for subconfigurations that registered column-preferential ukernels. The reason is that the erroneous preferences trigger unnecessary transpositions in the operation, which forces the gemm ukernel to compute on matrices that are not stored according to its preference. Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their extensive efforts and assistance in tracking down this issue. - Augmented the informational header that is output by the testsuite to include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu]. - CREDITS file update. - (cherry picked from commit 667f201)
fgvanzee
added a commit
that referenced
this pull request
Jul 31, 2023
Details: - Reorganized the way kernels are stored within the cntx_t structure so that rather than having a function pointer for every supported size of unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm kernels per datatype: one to pack MRxk micropanels and one to pack NRxk micropanels. - NOTE: The "bb" (broadcast B) reference kernels have been merged into the "standard" kernels (packm [including 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro. - Simplified the bli_cntx_*() APIs to conform to the new unified kernel array within the cntx_t. Updated existing bli_cntx_init_<subconfig>() function definitions for all subconfigurations. - Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t, etc.) into one kernel id type: ukr_t. - Various edits, updates, and rewrites of reference kernels pursuant to the aforementioned changes. - Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz], and friends) in bli_kernel_macro_defs.h, but only when the macro BLIS_IN_REF_KERNEL is defined by the build system. - Loose ends: - Still need to update documentation, including: - docs/ConfigurationHowTo.md - docs/KernelsHowTo.md to reflect changes made in this commit. - (cherry picked from commit ae10d94) Fixed type bug in bli_cntx_set_ukr_prefs(). Details: - Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the num_t value read from va_args() down to a bool before being stored within the cntx_t. This bug was introduced on April 6th 2022, in ae10d94. This caused the ukernel preferences for double real and double complex to go unchanged while the preferences for single real and single complex were corrupted by the former datatypes' preference values. The bug manifested as degraded performance for subconfigurations that registered column-preferential ukernels. The reason is that the erroneous preferences trigger unnecessary transpositions in the operation, which forces the gemm ukernel to compute on matrices that are not stored according to its preference. Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their extensive efforts and assistance in tracking down this issue. - Augmented the informational header that is output by the testsuite to include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu]. - CREDITS file update. - (cherry picked from commit 667f201)
fgvanzee
added a commit
that referenced
this pull request
Aug 4, 2023
Details: - Reorganized the way kernels are stored within the cntx_t structure so that rather than having a function pointer for every supported size of unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm kernels per datatype: one to pack MRxk micropanels and one to pack NRxk micropanels. - NOTE: The "bb" (broadcast B) reference kernels have been merged into the "standard" kernels (packm [including 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro. - Simplified the bli_cntx_*() APIs to conform to the new unified kernel array within the cntx_t. Updated existing bli_cntx_init_<subconfig>() function definitions for all subconfigurations. - Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t, etc.) into one kernel id type: ukr_t. - Various edits, updates, and rewrites of reference kernels pursuant to the aforementioned changes. - Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz], and friends) in bli_kernel_macro_defs.h, but only when the macro BLIS_IN_REF_KERNEL is defined by the build system. - Loose ends: - Still need to update documentation, including: - docs/ConfigurationHowTo.md - docs/KernelsHowTo.md to reflect changes made in this commit. - (cherry picked from commit ae10d94) Fixed type bug in bli_cntx_set_ukr_prefs(). Details: - Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the num_t value read from va_args() down to a bool before being stored within the cntx_t. This bug was introduced on April 6th 2022, in ae10d94. This caused the ukernel preferences for double real and double complex to go unchanged while the preferences for single real and single complex were corrupted by the former datatypes' preference values. The bug manifested as degraded performance for subconfigurations that registered column-preferential ukernels. The reason is that the erroneous preferences trigger unnecessary transpositions in the operation, which forces the gemm ukernel to compute on matrices that are not stored according to its preference. Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their extensive efforts and assistance in tracking down this issue. - Augmented the informational header that is output by the testsuite to include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu]. - CREDITS file update. - (cherry picked from commit 667f201)
fgvanzee
added a commit
that referenced
this pull request
Aug 7, 2023
Details: - Reorganized the way kernels are stored within the cntx_t structure so that rather than having a function pointer for every supported size of unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm kernels per datatype: one to pack MRxk micropanels and one to pack NRxk micropanels. - NOTE: The "bb" (broadcast B) reference kernels have been merged into the "standard" kernels (packm [including 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro. - Simplified the bli_cntx_*() APIs to conform to the new unified kernel array within the cntx_t. Updated existing bli_cntx_init_<subconfig>() function definitions for all subconfigurations. - Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t, etc.) into one kernel id type: ukr_t. - Various edits, updates, and rewrites of reference kernels pursuant to the aforementioned changes. - Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz], and friends) in bli_kernel_macro_defs.h, but only when the macro BLIS_IN_REF_KERNEL is defined by the build system. - Loose ends: - Still need to update documentation, including: - docs/ConfigurationHowTo.md - docs/KernelsHowTo.md to reflect changes made in this commit. - (cherry picked from commit ae10d94) Fixed type bug in bli_cntx_set_ukr_prefs(). Details: - Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the num_t value read from va_args() down to a bool before being stored within the cntx_t. This bug was introduced on April 6th 2022, in ae10d94. This caused the ukernel preferences for double real and double complex to go unchanged while the preferences for single real and single complex were corrupted by the former datatypes' preference values. The bug manifested as degraded performance for subconfigurations that registered column-preferential ukernels. The reason is that the erroneous preferences trigger unnecessary transpositions in the operation, which forces the gemm ukernel to compute on matrices that are not stored according to its preference. Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their extensive efforts and assistance in tracking down this issue. - Augmented the informational header that is output by the testsuite to include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu]. - CREDITS file update. - (cherry picked from commit 667f201)
fgvanzee
added a commit
that referenced
this pull request
Aug 7, 2023
Details: - Reorganized the way kernels are stored within the cntx_t structure so that rather than having a function pointer for every supported size of unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm kernels per datatype: one to pack MRxk micropanels and one to pack NRxk micropanels. - NOTE: The "bb" (broadcast B) reference kernels have been merged into the "standard" kernels (packm [including 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro. - Simplified the bli_cntx_*() APIs to conform to the new unified kernel array within the cntx_t. Updated existing bli_cntx_init_<subconfig>() function definitions for all subconfigurations. - Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t, etc.) into one kernel id type: ukr_t. - Various edits, updates, and rewrites of reference kernels pursuant to the aforementioned changes. - Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz], and friends) in bli_kernel_macro_defs.h, but only when the macro BLIS_IN_REF_KERNEL is defined by the build system. - Loose ends: - Still need to update documentation, including: - docs/ConfigurationHowTo.md - docs/KernelsHowTo.md to reflect changes made in this commit. - (cherry picked from commit ae10d94) Fixed type bug in bli_cntx_set_ukr_prefs(). Details: - Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the num_t value read from va_args() down to a bool before being stored within the cntx_t. This bug was introduced on April 6th 2022, in ae10d94. This caused the ukernel preferences for double real and double complex to go unchanged while the preferences for single real and single complex were corrupted by the former datatypes' preference values. The bug manifested as degraded performance for subconfigurations that registered column-preferential ukernels. The reason is that the erroneous preferences trigger unnecessary transpositions in the operation, which forces the gemm ukernel to compute on matrices that are not stored according to its preference. Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their extensive efforts and assistance in tracking down this issue. - Augmented the informational header that is output by the testsuite to include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu]. - CREDITS file update. - (cherry picked from commit 667f201)
fgvanzee
added a commit
that referenced
this pull request
Aug 22, 2023
Details: - Reorganized the way kernels are stored within the cntx_t structure so that rather than having a function pointer for every supported size of unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm kernels per datatype: one to pack MRxk micropanels and one to pack NRxk micropanels. - NOTE: The "bb" (broadcast B) reference kernels have been merged into the "standard" kernels (packm [including 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro. - Simplified the bli_cntx_*() APIs to conform to the new unified kernel array within the cntx_t. Updated existing bli_cntx_init_<subconfig>() function definitions for all subconfigurations. - Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t, etc.) into one kernel id type: ukr_t. - Various edits, updates, and rewrites of reference kernels pursuant to the aforementioned changes. - Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz], and friends) in bli_kernel_macro_defs.h, but only when the macro BLIS_IN_REF_KERNEL is defined by the build system. - Loose ends: - Still need to update documentation, including: - docs/ConfigurationHowTo.md - docs/KernelsHowTo.md to reflect changes made in this commit. - (cherry picked from commit ae10d94) Fixed type bug in bli_cntx_set_ukr_prefs(). Details: - Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the num_t value read from va_args() down to a bool before being stored within the cntx_t. This bug was introduced on April 6th 2022, in ae10d94. This caused the ukernel preferences for double real and double complex to go unchanged while the preferences for single real and single complex were corrupted by the former datatypes' preference values. The bug manifested as degraded performance for subconfigurations that registered column-preferential ukernels. The reason is that the erroneous preferences trigger unnecessary transpositions in the operation, which forces the gemm ukernel to compute on matrices that are not stored according to its preference. Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their extensive efforts and assistance in tracking down this issue. - Augmented the informational header that is output by the testsuite to include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu]. - CREDITS file update. - (cherry picked from commit 667f201) Added 'kappa == 1' branches to bli_packm_cxk_ref.c. Details: - Modified bli_packm_cxk_ref function templates so that separate loops that perform copys (instead of scal2s) are instantiated for cases where kappa == 1. - Guarded 'GCC push_options' and 'GCC optimize ("O2")' pragmas with BLIS_FORCE_ROLL_PACKM_REF_KERNEL to preserve the previous behavior whereby loop unrolling is forced off for the packm reference kernels.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The array of reference packing kernels (0--31) are replaced by exactly two kernels for each config/datatype combination, one to pack MRxK micropanels and one to pack NRxK micropanels. IMPORTANT: the "bb" reference kernels have been merged into the "standard" kernels (packm [incl. 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 need testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro.
Depends on #609.