-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Squash-merge 'pr' into 'squash'. #457
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Details: - Fixed the Makefile in test/3 so that it no longer incorrectly labels the matlab output variables from Eigen-linked hemm, herk, trmm, and trsm driver output as "vendor". (The gemm drivers were already correctly outputing matlab variables containing the "eigen" label.)
Details: - Updated matlab scripts in test/3/matlab to optionally plot/display Eigen performance curves. Whether Eigen is plotted is determined by a new boolean function parameter, with_eigen. - Updated runme.m scratchpad to reflect the latest invocations of the plot_panel_4x5() function (with Eigen plotting enabled).
Details: - Updated the Haswell, SkylakeX, and Epyc performance graphs in docs/graphs to report on Eigen implementations, where applicable. Specifically, Eigen implements all level-3 operations sequentially, however, of those operations it only provides multithreaded gemm. Thus, mt results for symm/hemm, syrk/herk, trmm, and trsm are omitted. Thanks to Sameer Agarwal for his help configuring and using Eigen. - Updated docs/Performance.md to note the new implementation tested. - CREDITS file update.
Details: - Added/updated a few more details, mostly regarding Eigen.
Details: - Updated the level-3 performance graphs in docs/graphs with new Eigen results, this time using a development version cloned from their git mirror on March 27, 2019 (version 3.3.90). Performance is improved over 3.3.7, though still noticeably short of BLIS/MKL in most cases. - Very minor updates to docs/Performance.md and matlab scripts in test/3/matlab.
Details: - Renamed kernels/armv8a/3/bli_gemm_armv8a_opt_4x4.c to kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c. This follows the naming convention used by other kernel sets, most notably haswell.
Change void*-typed function pointers to void_fp. - Updated all instances of void* variables that store function pointers to variables of a new type, void_fp. Originally, I wanted to define the type of void_fp as "void (*void_fp)( void )"--that is, a pointer to a function with no return value and no arguments. However, once I did this, I realized that gcc complains with incompatible pointer type (-Wincompatible-pointer-types) warnings every time any such a pointer is being assigned to its final, type-accurate function pointer type. That is, gcc will silently typecast a void* to another defined function pointer type (e.g. dscalv_ker_ft) during an assignment from the former to the latter, but the same statement will trigger a warning when typecasting from a void_fp type. I suspect an explicit typecast is needed in order to avoid the warning, which I'm not willing to insert at this time. - Added a typedef to bli_type_defs.h defining void_fp as void*, along with a commented-out version of the aborted definition described above. (Note that POSIX requires that void* and function pointers be interchangeable; it is the C standard that does not provide this guarantee.) - Comment updates to various _oapi.c files.
Details: - Added more details and clarifying language to implications of 1m and the recycling of microkernels between microarchitectures.
Details: - Fixed a minor bug in flatten-headers.py whereby the script, upon encountering a #include directive for the root header file, would erroneously recurse and inline the conents of that root header. The script has been modified to avoid recursion into any headers that share the same name as the root-level header that was passed into the script. (Note: this bug didn't actually manifest in BLIS, so it's merely a precaution for usage of flatten-headers.py in other contexts.)
Details: - Changed the default installation prefix from $HOME/lib to /usr/local. - Modified the way configure internally handles the prefix, libdir, includedir, and sharedir (and also added an --exec-prefix option). The defaults to these variables are set as follows: prefix: /usr/local exec_prefix: ${prefix} libdir: ${exec_prefix}/lib includedir: ${prefix}/include sharedir: ${prefix}/share The key change, aside from the addition of exec_prefix and its use to define the default to libdir, is that the variables are substituted into config.mk with quoting that delays evaluation, meaning the substituted values may contain unevaluated references to other variables (namely, ${prefix} and ${exec_prefix}). This more closely follows GNU conventions, including those used by GNU autoconf, and also allows make to override any one of the variables *after* configure has already been run (e.g. during 'make install'). - Updates to build/config.mk.in pursuant to above changes. - Updates to output of 'configure --help' pursuant to above changes. - Updated docs/BuildSystem.md to reflect the new default installation prefix, as well as mention EXECPREFIX and SHAREDIR. - Changed the definitions of the UNINSTALL_OLD_* variables in the top-level Makefile to use $(wildcard ...) instead of 'find'. This was motivated by the new way of handling prefix and friends, which leads to the 'find' command being run on /usr/local (by default), which can take a while almost never yielding any benefit (since the user will very rarely use the uninstall-old targets). - Removed periods from the end of descriptive output statements (i.e., non-verbose output) since those statements often end with file or directory paths, which get confusing to read when puctuated by a period. - Trival change to 'make showconfig' output. - Removed my name from 'configure --help'. (Many have contributed to it over the years.) - In configure script, changed the default state of threading_model variable from 'no' to 'off' to match that of debug_type, where there are similarly more than two valid states. ('no' is still accepted if given via the --enable-debug= option, though it will be standardized to 'off' prior to config.mk being written out.) - Minor variable name change in flatten-headers.py that was intended for 32812ff. - CREDITS file update.
Details: - Somehow the variable name change (root_file_name -> root_inputname) in flatten-headers.py mentioned in the commit log entry for 89a70cc didn't make it into the actual commit. This commit applies that change.
Details: - Added preprocessor branches to test/3/test_gemm.c to explicitly support row-stored matrices. Column-stored matrices are also still supported (and is the default for now). (This is mainly residual work leftover from initial integration of Eigen into the test drivers, so if we ever want to test Eigen with row-stored matrices, the code will be ready to use, even if it is not yet integrated into the Makefile in test/3.)
Details: - Implemented a new sub-framework within BLIS to support the management of code and kernels that specifically target matrix problems for which at least one dimension is deemed to be small, which can result in long and skinny matrix operands that are ill-suited for the conventional level-3 implementations in BLIS. The new framework tackles the problem in two ways. First the stripped-down algorithmic loops forgo the packing that is famously performed in the classic code path. That is, the computation is performed by a new family of kernels tailored specifically for operating on the source matrices as-is (unpacked). Second, these new kernels will typically (and in the case of haswell and zen, do in fact) include separate assembly sub-kernels for handling of edge cases, which helps smooth performance when performing problems whose m and n dimension are not naturally multiples of the register blocksizes. In a reference to the sub-framework's purpose of supporting skinny/unpacked level-3 operations, the "sup" operation suffix (e.g. gemmsup) is typically used to denote a separate namespace for related code and kernels. NOTE: Since the sup framework does not perform any packing, it targets row- and column-stored matrices A, B, and C. For now, if any matrix has non-unit strides in both dimensions, the problem is computed by the conventional implementation. - Implemented the default sup handler as a front-end to two variants. bli_gemmsup_ref_var2() provides a block-panel variant (in which the 2nd loop around the microkernel iterates over n and the 1st loop iterates over m), while bli_gemmsup_ref_var1() provides a panel-block variant (2nd loop over m and 1st loop over n). However, these variants are not used by default and provided for reference only. Instead, the default sup handler calls _var2m() and _var1n(), which are similar to _var2() and _var1(), respectively, except that they defer to the sup kernel itself to iterate over the m and n dimension, respectively. In other words, these variants rely not on microkernels, but on so-called "millikernels" that iterate along m and k, or n and k. The benefit of using millikernels is a reduction of function call and related (local integer typecast) overhead as well as the ability for the kernel to know which micropanel (A or B) will change during the next iteration of the 1st loop, which allows it to focus its prefetching on that micropanel. (In _var2m()'s millikernel, the upanel of A changes while the same upanel of B is reused. In _var1n()'s, the upanel of B changes while the upanel of A is reused.) - Added a new configure option, --[en|dis]able-sup-handling, which is enabled by default. However, the default thresholds at which the default sup handler is activated are set to zero for each of the m, n, and k dimensions, which effectively disables the implementation. (The default sup handler only accepts the problem if at least one dimension is smaller than or equal to its corresponding threshold. If all dimensions are larger than their thresholds, the problem is rejected by the sup front-end and control is passed back to the conventional implementation, which proceeds normally.) - Added support to the cntx_t structure to track new fields related to the sup framework, most notably: - sup thresholds: the thresholds at which the sup handler is called. - sup handlers: the address of the function to call to implement the level-3 skinny/unpacked matrix implementation. - sup blocksizes: the register and cache blocksizes used by the sup implementation (which may be the same or different from those used by the conventional packm-based approach). - sup kernels: the kernels that the handler will use in implementing the sup functionality. - sup kernel prefs: the IO preference of the sup kernels, which may differ from the preferences of the conventional gemm microkernels' IO preferences. - Added a bool_t to the rntm_t structure that indicates whether sup handling should be enabled/disabled. This allows per-call control of whether the sup implementation is used, which is useful for test drivers that wish to switch between the conventional and sup codes without having to link to different copies of BLIS. The corresponding accessor functions for this new bool_t are defined in bli_rntm.h. - Implemented several row-preferential gemmsup kernels in a new directory, kernels/haswell/3/sup. These kernels include two general implementation types--'rd' and 'rv'--for the 6x8 base shape, with two specialized millikernels that embed the 1st loop within the kernel itself. - Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference gemmsup microkernels. NOTE: These microkernels, unlike the current crop of conventional (pack-based) microkernels, do not use constant loop bounds. Additionally, their inner loop iterates over the k dimension. - Defined new typedef enums: - stor3_t: captures the effective storage combination of the level-3 problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A special value of BLIS_XXX is used to denote an arbitrary combination which, in practice, means that at least one of the operands is stored according to general stride. - threshid_t: captures each of the three dimension thresholds. - Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create() can be passed "-1, -1" as a lazy request for row storage. (Note that "0, 0" is still accepted as a lazy request for column storage.) - Added support for various instructions to bli_x86_asm_macros.h, including imul, vhaddps/pd, and other instructions related to integer vectors. - Disabled the older small matrix handling code inserted by AMD in bli_gemm_front.c, since the sup framework introduced in this commit is intended to provide a more generalized solution. - Added test/sup directory, which contains standalone performance test drivers, a Makefile, a runme.sh script, and an 'octave' directory containing scripts compatible with GNU Octave. (They also may work with matlab, but if not, they are probably close to working.) - Reinterpret the storage combination string (sc_str) in the various level-3 testsuite modules (e.g. src/test_gemm.c) so that the order of each matrix storage char is "cab" rather than "abc". - Comment updates in level-3 BLAS API wrappers in frame/compat.
Details: - Fixed an incorrectly-named macro guard that is intended to allow disabling of the sup framework via the configure option --disable-sup-handling. In this case, the preprocessor macro, BLIS_DISABLE_SUP_HANDLING, was still named by its name from an older uncommitted version of the code (BLIS_DISABLE_SM_HANDLING).
Details: - Removed already limited use of the BLIS_ENABLE_SUP_MR_EXT and BLIS_ENABLE_SUP_NR_EXT macros in bli_gemmsup_ref_var1n() and bli_gemmsup_ref_var2m(). Their purpose was merely to avoid a long conditional that would determine whether to allow the last iteration to be merged with the second-to-last iteration. Functionally, the macros were not needed, and they ended up causing problems when building configuration families such as intel64 and x86_64.
Details: - Fixed an obscure but in the bli_dgemmsup_rv_haswell_asm_5x8n() kernel that only affected the beta == 0, column-storage output case. Thanks to the BLAS test drivers for catching this bug. - Previously, bli_gemmsup_ref_var1n() and _var2m() were returning if k = 0, when the correct action would be to scale by beta (and then return). Thanks to the BLAS test drivers to catching this bug. - Changed the sup threshold behavior such that the sup implementation only kicks in if a matrix dimension is strictly less than (rather than less than or equal to) the threshold in question. - Initialize all thresholds to zero (instead of 10) by default in ref_kernels/bli_cntx_ref.c. This, combined with the above change to threshold testing means that calls to BLIS or BLAS with one or more matrix dimensions of zero will no longer trigger the sup implementation. - Added disabled debugging output to frame/3/bli_l3_sup.c (for future use, perhaps).
Details: - Added #ifndef _POSIX_C_SOURCE #define _POSIX_C_SOURCE 200809L #endif to bli_system.h so that an application that uses BLIS (specifically, an application that #includes blis.h) does not need to remember to #define the macro itself (either on the command line or in the code that includes blis.h) in order to activate things like the pthreads. Thanks to Christos Psarras for reporting this issue and suggesting this fix. - Commented out #include <sys/time.h> in bli_system.h, since I don't think this header is used/needed anymore. - Comment update to function macro for bli_?normiv_unb_var1() in frame/util/bli_util_unb_var1.c.
Details: - Commented out redundant setting of LIBBLIS_LINK within all driver- level Makefiles. This variable is already set within common.mk, and so the only time it should be overridden is if the user wants to link to a different copy of libblis. - Very minor changes to build/gen-make-frags/gen-make-frag.sh. - Whitespace and inconsequential quoting change to configure. - Moved top-level 'windows' directory into a new 'attic' directory.
Details: - Increased the double-precision real MT threshold (which controls whether the sup implementation kicks for smaller m dimension values) from 80 to 180, and this change was made for both haswell and zen subconfigurations. This is less about the m dimension in particular and more about facilitating a smoother performance transition when m = n = k.
Details: - Documented the BLIS environment variables that were set (e.g. BLIS_JC_NT, BLIS_IC_NT, BLIS_JR_NT) for each machine and threading configuration in order to achieve the parallelism reported on in docs/Performance.md.
Details: - Attempted a fix to issue #313, which reports that when building only a shared library (ie: static library build is disabled), running the BLAS test drivers can fail because those drivers provide their own local version of xerbla_() as a clever (albeit still rather hackish) way of checking the error codes that result from the individual tests. This local xerbla_() function is never found at link-time because the BLAS test drivers' Makefile imports BLIS compilation flags via the get-user-cflags-for() function, which currently conveys the -fvisibility=hidden flag, which hides symbols unless they are explicitly annotated for export. The -fvisibility=hidden flag was only ever intended for use when building BLIS (not for applications), and so the attempted solution here is to omit the symbol export flag(s) from get-user-cflags-for() by storing the symbol export flag(s) to a new BULID_SYMFLAGS variable instead of appending it to the subconfigurations' CMISCFLAGS variable (which is returned by every get-*-cflags-for() function). Thanks to M. Zhou for reporting this issue and also to Isuru Fernando for suggesting the fix. - Renamed BUILD_FLAGS to BUILD_CPPFLAGS to harmonize with the newly created BUILD_SYMFLAGS. - Fixed typo in entry for --export-shared flag in 'configure --help' text.
Details: - Fine-tuned the double-precision real MT threshold (which controls whether the sup implementation kicks for smaller m dimension values) from 180 to 201 for haswell and 180 to 256 for zen. - Updated octave scripts in test/sup/octave to include a seventh column to display performance for m = n = k.
Details: - Added a new markdown document, docs/PerformanceSmall.md, which publishes new performance graphs for Kaby Lake and Epyc showcasing the new BLIS sup (small/skinny/unpacked) framework logic and kernels. For now, only single-threaded dgemm performance is shown. - Reorganized graphs in docs/graphs into docs/graphs/large, with new graphs being placed in docs/graphs/sup. - Updates to scripts in test/sup/octave, mostly to allow decent output in both GNU octave and Matlab. - Updated README.md to mention and refer to the new PerformanceSmall.md document.
Details: - Added performance analysis to "Comments" section of both Kaby Lake and Epyc sections. - Added emphasis to certain passages.
Details: - Updated ReleaseNotes.md in preparation for next version. - CREDITS file update.
Details: - Since GEMM kernel prefers row-storage, if input C matrix is in col-major order, entire operation is transposed. In that case uplo(c) needs to be toggled before kernel-variant selection. - disabled "bli_gemmsup_ref_var1n2m_opt_cases" inside gemmtsup. - Updated version number to 2.2.1 Change-Id: I0a85df1141fc4a98d98ea4e0c3d42db8602fa69b
Details: - BLIS test application throws an error when built with dynamic library as "Undefined reference to bli_abort". This happens because bli_abort is hidden and cannot be linkable from outside. Annotating prototype with BLIS_EXPORT_BLAS to make it public. Change-Id: I0d7aec046e8871ba6491024694ed06f883b005ac AMD Internal: [CPUPL-1030]
…nels Change-Id: Ib309aba0cb08161877fd1a720ed65222d3b303f3
Details: - Since C is triangular, in order to maintain load balance among threads, we need to use weighted range partitioning. Change-Id: I03d8ff71ac7af843acd787f1389b5907b56453ee
Details: - Unlike default path, storage scheme of C is not always row-major in SUP. - Whenever C is col-major, the temporary buffer 'ct' is also chosen to be col-major. - Since update routines only support row-major order, a transpose is induced for c and ct buffers before passing them to update routine. Change-Id: I3fea10860f39632df7540c9399786e7aa1cfba37
Details: - If there are any zero rows or columns along the edges of MCxNC block of C, shrink the dimensions to avoid "no-op" iterations. - For lower-triangle kernel variant, Added a flag to determine if a block that is strictly below triangle is reached. Once such block is reached, the flag is set and all the blocks that are below it are strictly below the diagonal and flag is used to make decision. - For upper-triangle kernel-variant, whenever a block that is strictly below the triangle is reached, break the for loop and go for next iteration of JR loop because all the blocks below it will also be strictly below diagonal and are filled with zeroes which requires no computation. Change-Id: I606b0f900509aab6ed7ff30cefee9d7207b7b010
The testsuite coveres all combinations of upper, lower, transpose and API formats. AMD Internal: [CPUPL-1021] Change-Id: I2a1d79eba1dcaf4217fd9c2c346bd6173b80a782
Details: - Problem: If row major, first four elements of last column on output matrix C was not updated If col major, first four elements of last row on output matrix C was not updated - Solution: Updating elements after computation is done on right offset in bli_dgemmsup_rv_haswell_asm_5x8() Change-Id: I588c60f2f3cd5f51e475cfc140e3bf0e9d5a4dae
…ixed" This reverts commit 725bf5a. Reason for revert: <INSERT REASONING HERE> Change-Id: I7dd6b84731f091c8b39080ed9321a708fa5f11d8
GEMMT changes porting on to Windows AMD Internal : [CPUPL-1061] Change-Id: I587d1789cd29ea18b04f8ab43e5742b4d902067a
Details: - Added support for level-3 operation gemmt, which performs a gemm on only the lower or upper triangle of a square matrix C. For now, only the conventional/large code path will be supported (in vanilla BLIS). This was accomplished by leveraging the existing variant logic for herk. However, some of the infrastructure to support a gemmtsup is included in this commit, including - A bli_gemmtsup() front-end, similar to bli_gemmsup(). - A bli_gemmtsup_ref() reference handler function. - A bli_gemmtsup_int() variant chooser function (with variant calls commented out). - Added support for inducing complex domain gemmt via the 1m method. - Added gemmt APIs to the BLAS and CBLAS compatiblity layers. - Added gemmt test module to testsuite. - Added standalone gemmt test driver to 'test' directory. - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md. - Added a C++ template header (blis.hh) containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header (cblas.hh). These two headers are installed if running the 'install' target with INSTALL_HH is set to 'yes'. (Also added a set of unit tests that exercise blis.hh, although they are disabled for now because they aren't compatible with out-of-tree builds.) These files now live in the 'vendor' top-level directory. - Various updates to 'zen' and 'zen2' subconfigurations, particularly within the context initialization functions. - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and various minor updates to dotv and scalv kernels. Also added various sup kernels contributed by AMD to kernels/zen/3. However, these kernels are (for now) not yet used since I have not found time to review and vet them. - Output the python found during configure into the definition of PYTHON in build/config.mk (via build/config.mk.in). - Added early-return checks (A, B, or C with zero dimension; alpha = 0) to bli_gemm_front.c. - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to accommodate C++'s stricter type checking. - Added cpp guard to test/*.c drivers that facilitate compilation on Windows systems. - Various whitespace changes.
Details: - Removed 'zen' and 'zen2' subconfigurations from the 'amd64' umbrella configuration defined in the configuration registry. I'm not sure why this was a problem to begin with, but the SDE test on Travis CI is now failing and this may be a way around it.
Details: - Removed a few flags that slipped into the recent merge of #448 which *may* be causing breakage. This commit moves amd_config.mk back to the state it is in, more or less, in the 'master' branch.
Details: - Disabled registration of copyv, setv, and swapv kernels in the 'zen' subconfiguration as part of ongoing debug efforts via Travis CI and AppVeyor.
Details: - Added 'zen' and 'zen2' subconfigs back to the 'amd64' family after realizing their absence had (likely) caused a problem with the kernel-to-config map wherein level-1v zen kernels were being compiled with the 'skx' subconfig's compiler flags.
Details: - Disabled testing of gemmt in the testsuite in an attempt to debug a Travis CI failure for the 'cortexa15' subconfig.
Details: - Fixed a bug whereby the implementation for bli_herk_determine_kc(), in bli_l3_blocksize.c, was inadvertantly being defined in terms of helper functions meant for trmm. This is probably not the source of the cortexa15 bug showing up in Travis CI, but it's worth fixing now while I'm looking at it.
Details: - Re-enabled gemmt testing in the testsuite after establishing that forgoing the gemmt tests circumvents the cortexa15 failures in Travis CI.
Details: - Disabled registration of the bli_sgemm_armv7a_int_4x4() kernel and sgemm blocksizes in bli_cntx_init_cortexa15.c. This is part of my continued attempt to isolate the cause of the cortexa15 failures in Travis CI.
Details: - Implemented explicit beta = 0 handling in for the sgemm ukernel in bli_gemm_armv7a_int_d4x4.c, which was previously missing. This is very likely contributed to the sudden Travis CI test failures for the cortexa15 subconfig when running the gemmt test module. It turns out that the gemmt module verifies its computation using gemm with beta set to zero, which, on a cortexa15 system, caused the gemm kernel code to unconditionally multiply the uninitialized C data by beta. If C contained non-numeric values such as NaN, this would have resulted in a false failure. Thus, the new code in 33f75df was not the cause, per se, of the breakage. Rather, it simply introduced a test that exposed breakage that had been introduced long ago. - Re-registered the troublesome sgemm ukernel mentioned above in bli_cntx_init_cortexa15.c.
Details: - Reverted temporary disabling of copyv, setv, and swapv kernels in the 'zen' subconfig in 7116c4c.
Details: - Re-enabling these kernels for the 'zen' subconfig seems to have pissed off a clang build in AppVeyor. Le sigh.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Due to a minor conflict in my attempted squash-merge of
amd
intopr
, I was unable to squash. Reattempting.