Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and rewrite reference packing kernels. #610

Merged
merged 17 commits into from
Apr 7, 2022
Merged

Conversation

devinamatthews
Copy link
Member

@devinamatthews devinamatthews commented Feb 9, 2022

The array of reference packing kernels (0--31) are replaced by exactly two kernels for each config/datatype combination, one to pack MRxK micropanels and one to pack NRxK micropanels. IMPORTANT: the "bb" reference kernels have been merged into the "standard" kernels (packm [incl. 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 need testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro.

Depends on #609.

These are defined in sub-configuration-specific header files, which are only included by reference kernels.
The gemm reference kernel now uses the configuration-dependent BLIS_MR_x/BLIS_NR_x macros to control unrolling, rather than fixed values. This fixes #259 and replaces PR #547.
All kernels have been combined into a single array (level-1v/1f, (un)packm, level-3, and sup), and similarly with preferences (only ukr row-storage preferences for now) and block sizes (which now include sup thresholds and block sizes). These changes are necessary for future support of user-defined kernels. The context initialization functions used by bli_cntx_init_* have also been reworked to use a sentinel instead of an explicit count in order to prevent errors. Note that mostly these changes make the cntx_t code oblivious to BLAS level, but some l3-specific functions remain for compatibility.
1. The generic gemm kernel breaks on armsve because there is no
   compile-time MR/NR. The refernce gemm kernels has been modified
   to detect this and fallback to a "dumb" version.
2. For some reason, adding an optimization for writing back full
   microtiles in row-major storage to the reference gemm kernel
   results in a segfault for armv7a/gcc-9.3. I can't tell if I'm
   doing something wrong of if there is a compiler bug. This
   optimization has been removed for the time being.
…vailable as macros.

The array of reference packing kernels (0--31) are replaced by exactly two kernels for each config/datatype combination, one to pack MRxK micropanels and one to pack NRxK micropanels. *IMPORTANT*: the "bb" reference kernels have been merged into the "standard" kernels (packm [incl. 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 need testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro.
This change also includes a new level-0 macro: set0s_edge, which helps to simplify the packm kernels.
@fgvanzee
Copy link
Member

So what was the source of the breakage in #609?

@fgvanzee
Copy link
Member

So what was the source of the breakage in #609?

Nevermind, just saw your email.

@devinamatthews
Copy link
Member Author

NOT ready to merge yet, there are problems when broadcast packing is used.

@fgvanzee
Copy link
Member

BTW, we should update the KernelsHowTo documentation to reflect the changes to the cntx_t APIs.

@devinamatthews
Copy link
Member Author

devinamatthews commented Feb 15, 2022 via email

- bli_packm_struc_cxk has been completely rewritten to combine nat/1m execution and use a special packing kernel for diagonal blocks.
- *all* reference kernels now respect broadcast packing for A and/or B. This works for all l3 operations (even trsm!) and with 1m.
# Conflicts:
#	ref_kernels/3/bli_gemmtrsm_ref.c
#	ref_kernels/ind/bli_gemmtrsm1m_ref.c
@devinamatthews
Copy link
Member Author

@fgvanzee this should be ready to merge now, whenever 0.9.0 is out.

@fgvanzee
Copy link
Member

Cool. Thanks for the update.

devinamatthews and others added 5 commits March 15, 2022 12:52
Due to missing `break`s in a switch statement (warn me, dammit!), the virtual gemm ukernels were not getting set to the optimized versions. [ci skip]
@fgvanzee fgvanzee merged commit ae10d94 into master Apr 7, 2022
@devinamatthews devinamatthews deleted the mr_nr_macros branch April 11, 2022 17:42
fgvanzee added a commit that referenced this pull request Jul 11, 2023
Details:
- Reorganized the way kernels are stored within the cntx_t structure so
  that rather than having a function pointer for every supported size of
  unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm
  kernels per datatype: one to pack MRxk micropanels and one to pack
  NRxk micropanels.
  - NOTE: The "bb" (broadcast B) reference kernels have been merged into
    the "standard" kernels (packm [including 1er and unpackm], gemm,
    trsm, gemmtrsm). This replication factor is controlled by
    BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a
    replication factor of 1 has been tested. armsve also needs testing
    since the MR value isn't available as a macro.
- Simplified the bli_cntx_*() APIs to conform to the new unified kernel
  array within the cntx_t. Updated existing bli_cntx_init_<subconfig>()
  function definitions for all subconfigurations.
- Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t,
  etc.) into one kernel id type: ukr_t.
- Various edits, updates, and rewrites of reference kernels pursuant to
  the aforementioned changes.
- Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz],
  and friends) in bli_kernel_macro_defs.h, but only when the macro
  BLIS_IN_REF_KERNEL is defined by the build system.
- Loose ends:
  - Still need to update documentation, including:
    - docs/ConfigurationHowTo.md
    - docs/KernelsHowTo.md
    to reflect changes made in this commit.
- (cherry picked from commit ae10d94)

Fixed type bug in bli_cntx_set_ukr_prefs().

Details:
- Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the
  num_t value read from va_args() down to a bool before being stored
  within the cntx_t. This bug was introduced on April 6th 2022, in
  ae10d94. This caused the ukernel preferences for double real and
  double complex to go unchanged while the preferences for single real
  and single complex were corrupted by the former datatypes'
  preference values. The bug manifested as degraded performance for
  subconfigurations that registered column-preferential ukernels. The
  reason is that the erroneous preferences trigger unnecessary
  transpositions in the operation, which forces the gemm ukernel to
  compute on matrices that are not stored according to its preference.
  Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their
  extensive efforts and assistance in tracking down this issue.
- Augmented the informational header that is output by the testsuite to
  include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu].
- CREDITS file update.
- (cherry picked from commit 667f201)
fgvanzee added a commit that referenced this pull request Jul 14, 2023
Details:
- Reorganized the way kernels are stored within the cntx_t structure so
  that rather than having a function pointer for every supported size of
  unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm
  kernels per datatype: one to pack MRxk micropanels and one to pack
  NRxk micropanels.
  - NOTE: The "bb" (broadcast B) reference kernels have been merged into
    the "standard" kernels (packm [including 1er and unpackm], gemm,
    trsm, gemmtrsm). This replication factor is controlled by
    BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a
    replication factor of 1 has been tested. armsve also needs testing
    since the MR value isn't available as a macro.
- Simplified the bli_cntx_*() APIs to conform to the new unified kernel
  array within the cntx_t. Updated existing bli_cntx_init_<subconfig>()
  function definitions for all subconfigurations.
- Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t,
  etc.) into one kernel id type: ukr_t.
- Various edits, updates, and rewrites of reference kernels pursuant to
  the aforementioned changes.
- Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz],
  and friends) in bli_kernel_macro_defs.h, but only when the macro
  BLIS_IN_REF_KERNEL is defined by the build system.
- Loose ends:
  - Still need to update documentation, including:
    - docs/ConfigurationHowTo.md
    - docs/KernelsHowTo.md
    to reflect changes made in this commit.
- (cherry picked from commit ae10d94)

Fixed type bug in bli_cntx_set_ukr_prefs().

Details:
- Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the
  num_t value read from va_args() down to a bool before being stored
  within the cntx_t. This bug was introduced on April 6th 2022, in
  ae10d94. This caused the ukernel preferences for double real and
  double complex to go unchanged while the preferences for single real
  and single complex were corrupted by the former datatypes'
  preference values. The bug manifested as degraded performance for
  subconfigurations that registered column-preferential ukernels. The
  reason is that the erroneous preferences trigger unnecessary
  transpositions in the operation, which forces the gemm ukernel to
  compute on matrices that are not stored according to its preference.
  Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their
  extensive efforts and assistance in tracking down this issue.
- Augmented the informational header that is output by the testsuite to
  include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu].
- CREDITS file update.
- (cherry picked from commit 667f201)
fgvanzee added a commit that referenced this pull request Jul 31, 2023
Details:
- Reorganized the way kernels are stored within the cntx_t structure so
  that rather than having a function pointer for every supported size of
  unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm
  kernels per datatype: one to pack MRxk micropanels and one to pack
  NRxk micropanels.
  - NOTE: The "bb" (broadcast B) reference kernels have been merged into
    the "standard" kernels (packm [including 1er and unpackm], gemm,
    trsm, gemmtrsm). This replication factor is controlled by
    BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a
    replication factor of 1 has been tested. armsve also needs testing
    since the MR value isn't available as a macro.
- Simplified the bli_cntx_*() APIs to conform to the new unified kernel
  array within the cntx_t. Updated existing bli_cntx_init_<subconfig>()
  function definitions for all subconfigurations.
- Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t,
  etc.) into one kernel id type: ukr_t.
- Various edits, updates, and rewrites of reference kernels pursuant to
  the aforementioned changes.
- Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz],
  and friends) in bli_kernel_macro_defs.h, but only when the macro
  BLIS_IN_REF_KERNEL is defined by the build system.
- Loose ends:
  - Still need to update documentation, including:
    - docs/ConfigurationHowTo.md
    - docs/KernelsHowTo.md
    to reflect changes made in this commit.
- (cherry picked from commit ae10d94)

Fixed type bug in bli_cntx_set_ukr_prefs().

Details:
- Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the
  num_t value read from va_args() down to a bool before being stored
  within the cntx_t. This bug was introduced on April 6th 2022, in
  ae10d94. This caused the ukernel preferences for double real and
  double complex to go unchanged while the preferences for single real
  and single complex were corrupted by the former datatypes'
  preference values. The bug manifested as degraded performance for
  subconfigurations that registered column-preferential ukernels. The
  reason is that the erroneous preferences trigger unnecessary
  transpositions in the operation, which forces the gemm ukernel to
  compute on matrices that are not stored according to its preference.
  Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their
  extensive efforts and assistance in tracking down this issue.
- Augmented the informational header that is output by the testsuite to
  include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu].
- CREDITS file update.
- (cherry picked from commit 667f201)
fgvanzee added a commit that referenced this pull request Aug 4, 2023
Details:
- Reorganized the way kernels are stored within the cntx_t structure so
  that rather than having a function pointer for every supported size of
  unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm
  kernels per datatype: one to pack MRxk micropanels and one to pack
  NRxk micropanels.
  - NOTE: The "bb" (broadcast B) reference kernels have been merged into
    the "standard" kernels (packm [including 1er and unpackm], gemm,
    trsm, gemmtrsm). This replication factor is controlled by
    BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a
    replication factor of 1 has been tested. armsve also needs testing
    since the MR value isn't available as a macro.
- Simplified the bli_cntx_*() APIs to conform to the new unified kernel
  array within the cntx_t. Updated existing bli_cntx_init_<subconfig>()
  function definitions for all subconfigurations.
- Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t,
  etc.) into one kernel id type: ukr_t.
- Various edits, updates, and rewrites of reference kernels pursuant to
  the aforementioned changes.
- Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz],
  and friends) in bli_kernel_macro_defs.h, but only when the macro
  BLIS_IN_REF_KERNEL is defined by the build system.
- Loose ends:
  - Still need to update documentation, including:
    - docs/ConfigurationHowTo.md
    - docs/KernelsHowTo.md
    to reflect changes made in this commit.
- (cherry picked from commit ae10d94)

Fixed type bug in bli_cntx_set_ukr_prefs().

Details:
- Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the
  num_t value read from va_args() down to a bool before being stored
  within the cntx_t. This bug was introduced on April 6th 2022, in
  ae10d94. This caused the ukernel preferences for double real and
  double complex to go unchanged while the preferences for single real
  and single complex were corrupted by the former datatypes'
  preference values. The bug manifested as degraded performance for
  subconfigurations that registered column-preferential ukernels. The
  reason is that the erroneous preferences trigger unnecessary
  transpositions in the operation, which forces the gemm ukernel to
  compute on matrices that are not stored according to its preference.
  Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their
  extensive efforts and assistance in tracking down this issue.
- Augmented the informational header that is output by the testsuite to
  include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu].
- CREDITS file update.
- (cherry picked from commit 667f201)
fgvanzee added a commit that referenced this pull request Aug 7, 2023
Details:
- Reorganized the way kernels are stored within the cntx_t structure so
  that rather than having a function pointer for every supported size of
  unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm
  kernels per datatype: one to pack MRxk micropanels and one to pack
  NRxk micropanels.
  - NOTE: The "bb" (broadcast B) reference kernels have been merged into
    the "standard" kernels (packm [including 1er and unpackm], gemm,
    trsm, gemmtrsm). This replication factor is controlled by
    BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a
    replication factor of 1 has been tested. armsve also needs testing
    since the MR value isn't available as a macro.
- Simplified the bli_cntx_*() APIs to conform to the new unified kernel
  array within the cntx_t. Updated existing bli_cntx_init_<subconfig>()
  function definitions for all subconfigurations.
- Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t,
  etc.) into one kernel id type: ukr_t.
- Various edits, updates, and rewrites of reference kernels pursuant to
  the aforementioned changes.
- Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz],
  and friends) in bli_kernel_macro_defs.h, but only when the macro
  BLIS_IN_REF_KERNEL is defined by the build system.
- Loose ends:
  - Still need to update documentation, including:
    - docs/ConfigurationHowTo.md
    - docs/KernelsHowTo.md
    to reflect changes made in this commit.
- (cherry picked from commit ae10d94)

Fixed type bug in bli_cntx_set_ukr_prefs().

Details:
- Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the
  num_t value read from va_args() down to a bool before being stored
  within the cntx_t. This bug was introduced on April 6th 2022, in
  ae10d94. This caused the ukernel preferences for double real and
  double complex to go unchanged while the preferences for single real
  and single complex were corrupted by the former datatypes'
  preference values. The bug manifested as degraded performance for
  subconfigurations that registered column-preferential ukernels. The
  reason is that the erroneous preferences trigger unnecessary
  transpositions in the operation, which forces the gemm ukernel to
  compute on matrices that are not stored according to its preference.
  Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their
  extensive efforts and assistance in tracking down this issue.
- Augmented the informational header that is output by the testsuite to
  include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu].
- CREDITS file update.
- (cherry picked from commit 667f201)
fgvanzee added a commit that referenced this pull request Aug 7, 2023
Details:
- Reorganized the way kernels are stored within the cntx_t structure so
  that rather than having a function pointer for every supported size of
  unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm
  kernels per datatype: one to pack MRxk micropanels and one to pack
  NRxk micropanels.
  - NOTE: The "bb" (broadcast B) reference kernels have been merged into
    the "standard" kernels (packm [including 1er and unpackm], gemm,
    trsm, gemmtrsm). This replication factor is controlled by
    BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a
    replication factor of 1 has been tested. armsve also needs testing
    since the MR value isn't available as a macro.
- Simplified the bli_cntx_*() APIs to conform to the new unified kernel
  array within the cntx_t. Updated existing bli_cntx_init_<subconfig>()
  function definitions for all subconfigurations.
- Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t,
  etc.) into one kernel id type: ukr_t.
- Various edits, updates, and rewrites of reference kernels pursuant to
  the aforementioned changes.
- Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz],
  and friends) in bli_kernel_macro_defs.h, but only when the macro
  BLIS_IN_REF_KERNEL is defined by the build system.
- Loose ends:
  - Still need to update documentation, including:
    - docs/ConfigurationHowTo.md
    - docs/KernelsHowTo.md
    to reflect changes made in this commit.
- (cherry picked from commit ae10d94)

Fixed type bug in bli_cntx_set_ukr_prefs().

Details:
- Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the
  num_t value read from va_args() down to a bool before being stored
  within the cntx_t. This bug was introduced on April 6th 2022, in
  ae10d94. This caused the ukernel preferences for double real and
  double complex to go unchanged while the preferences for single real
  and single complex were corrupted by the former datatypes'
  preference values. The bug manifested as degraded performance for
  subconfigurations that registered column-preferential ukernels. The
  reason is that the erroneous preferences trigger unnecessary
  transpositions in the operation, which forces the gemm ukernel to
  compute on matrices that are not stored according to its preference.
  Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their
  extensive efforts and assistance in tracking down this issue.
- Augmented the informational header that is output by the testsuite to
  include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu].
- CREDITS file update.
- (cherry picked from commit 667f201)
fgvanzee added a commit that referenced this pull request Aug 22, 2023
Details:
- Reorganized the way kernels are stored within the cntx_t structure so
  that rather than having a function pointer for every supported size of
  unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm
  kernels per datatype: one to pack MRxk micropanels and one to pack
  NRxk micropanels.
  - NOTE: The "bb" (broadcast B) reference kernels have been merged into
    the "standard" kernels (packm [including 1er and unpackm], gemm,
    trsm, gemmtrsm). This replication factor is controlled by
    BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a
    replication factor of 1 has been tested. armsve also needs testing
    since the MR value isn't available as a macro.
- Simplified the bli_cntx_*() APIs to conform to the new unified kernel
  array within the cntx_t. Updated existing bli_cntx_init_<subconfig>()
  function definitions for all subconfigurations.
- Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t,
  etc.) into one kernel id type: ukr_t.
- Various edits, updates, and rewrites of reference kernels pursuant to
  the aforementioned changes.
- Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz],
  and friends) in bli_kernel_macro_defs.h, but only when the macro
  BLIS_IN_REF_KERNEL is defined by the build system.
- Loose ends:
  - Still need to update documentation, including:
    - docs/ConfigurationHowTo.md
    - docs/KernelsHowTo.md
    to reflect changes made in this commit.
- (cherry picked from commit ae10d94)

Fixed type bug in bli_cntx_set_ukr_prefs().

Details:
- Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the
  num_t value read from va_args() down to a bool before being stored
  within the cntx_t. This bug was introduced on April 6th 2022, in
  ae10d94. This caused the ukernel preferences for double real and
  double complex to go unchanged while the preferences for single real
  and single complex were corrupted by the former datatypes'
  preference values. The bug manifested as degraded performance for
  subconfigurations that registered column-preferential ukernels. The
  reason is that the erroneous preferences trigger unnecessary
  transpositions in the operation, which forces the gemm ukernel to
  compute on matrices that are not stored according to its preference.
  Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their
  extensive efforts and assistance in tracking down this issue.
- Augmented the informational header that is output by the testsuite to
  include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu].
- CREDITS file update.
- (cherry picked from commit 667f201)

Added 'kappa == 1' branches to bli_packm_cxk_ref.c.

Details:
- Modified bli_packm_cxk_ref function templates so that separate loops
  that perform copys (instead of scal2s) are instantiated for cases
  where kappa == 1.
- Guarded 'GCC push_options' and 'GCC optimize ("O2")' pragmas with
  BLIS_FORCE_ROLL_PACKM_REF_KERNEL to preserve the previous behavior
  whereby loop unrolling is forced off for the packm reference kernels.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants