-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelization of the PC loop #537
Comments
Since the main thread takes the "real" C and original |
@hominhquan by broadcast I just mean pointer broadcast. Only the main thread includes the beta*C part. |
@fgvanzee I welcome your comments. |
Sure, though I'm going to need some time to think about this and how it compares and contrasts with my own ideas. |
@devinamatthews Can you explain the thinking behind this? |
I'm reconsidering this now, but my thinking was that a) you only want to parallelize over k if n is smallish anyways and b) you only want to checkout a block for C and do a reduction once. But, on second though I don't think there's really a downside to b), and maybe you do still want to parallelize over k for some bizarre situation like m=20, n=5000, k=10000. |
I think if @rvdg were here, he would likely say that this isn't bizarre at all. We want to have maximum flexibility in the dimensions along which we parallelize, and oftentimes that means parallelizing two dimension simultaneously, especially when you have large numbers of cores. And if only one of m and n is large (and k is also large) that means (potentially) parallelizing k. |
Agreed. Drop that requirement. |
I think I know how to solve this for generalized problems. It builds upon the idea of carouseling, which @tlrmchlsmth developed and prototyped years ago. The key feature of that solution is that it avoids using large workspace, but it will also require (Historical aside: I found Tyler's original implementation to be a bit rough around the edges. I also did not fully understand it at the time, and those two things contributed to it going unmerged. But the idea, which was delightfully simple, stuck in my head.) So, I think it's time to reimplement Tyler's carouseling. I'll likely start off in the |
@fgvanzee let's talk about this; IIRC carouseling doesn't take full advantage of parallelism in both dimensions. |
I’d like to be part of that discussion
… On Aug 29, 2021, at 10:48 AM, Devin Matthews ***@***.***> wrote:
@fgvanzee <https://github.com/fgvanzee> let's talk about this; IIRC carouseling doesn't take full advantage of parallelism in both dimensions.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#537 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLLYJ3QOBLRDOYWE4NOUVTT7JJEXANCNFSM5C56JJGA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Tuesday then. |
The thing about carouselling is that it doesn't actually increase the number of FMA instructions that you can execute in parallel. Instead, the advantage comes from not messing up your block sizes. So instead of just parallelizing along the Anyway, I think there is probably enough interesting stuff here to explore the tradeoffs between the two approaches and situations where each does better than the other! |
Many applications (e.g. in machine learning) end up with matrix products that have small(ish)
m
andn
dimensions, but large K. For these cases, one would need to parallelize thepc
loop to take advantage of threading in an efficient way. An example application is #536. The downside of parallelizing thepc
loop is that extra workspace is required to store the partial C matrices, and then these matrices have to be reduced (in parallel probably).To support parallelizing this loop, I propose the following plan:
max_size_c
of the temporary C matrices that we are willing to store. This could be a constant (e.g. 8MiB) or it could depend on the cache blocking parameters in some way (e.g. not larger than the sum of the A block and B panel sizes).nt_pc
(which is a factor ofnt
?) for which(nt_pc-1)*m*n <= max_size_c
, and then assign the remaining ways of parallelism based onnt -> nt/nt_pc
. Alternatively, the current algortihm could be extended to handle nt_ic, nt_jc, and nt_pc simultaneously and also obey the(nt_pc-1)*m*n <= max_size_c
restriction. If the user setsBLIS_PC_NT
then that can be used as-is (without obeying the size restriction?).bli_gemm_blk_var3.c
:pc
direction), except the "main" thread group, should check out a block from the existing pool for C, which is of sizem*n
.beta
as-is, all other threads get a chunk of the C buffer and setbeta = 0
.pc
loop as invar1
andvar2
.m
andn
, and use e.g.axpym
.C
should be of most sizem_c
along them
dimension, then the reduction step must happen inbli_gemm_blk_var1.c
.The text was updated successfully, but these errors were encountered: