-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better kernels for row sum and tr(AB^T) #3954
Better kernels for row sum and tr(AB^T) #3954
Conversation
…d added more NVTX markers
…n-multithreaded-numerator
… chain-training.cc
…ldi into persistent_add_row_sum_mat
…nel when the two matrices are the same
Oh... was going to ask someone from NVidia to review this, but I see you are from NVidia. Will check it over for my own edification. |
// Reduce a matrix 'data' to a row vector 'dots' | ||
template <EnumTransformReduce TransReduceType, typename Real, int unroll_count> | ||
__global__ void _strided_reduction_fused_kernel(Real * __restrict__ dots, const Real * __restrict__ data, | ||
void * __restrict__ scratch, const MatrixDim d, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is scratch void*
and not Real*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scratch
is a workspace that is split into two portions. One to store the intermediate results of type Real
and the other to store unsigned int
flags to guarantee ordering of operations across cuda blocks. Since all of this is internal to the kernel, I thought it made more sense to expose it as a single void *
argument rather than have two separate Real *
and unsigned int *
arguments.
|
||
#pragma unroll | ||
for (int u = 0; u < unroll_count; ++u) { | ||
thread_data += A_vals[u] * B_vals[u]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why you made this a separate loop, why is this faster than just doing, in the loop, thread_data += A[idx_A] * B[idx_B]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming no compiler optimizations, having only one loop with thread_data += A[idx_A] * B[idx_B]
would mean that you would have sequential dependencies between the load and the compute instructions. Having them in two separate loops means you have multiple independent loads into separate registers followed by multiple independent compute ops that only depend on the corresponding load op thus allowing for better instruction level parallelism. Of course, compilers are much smarter and can usually do this automatically especially with #pragma unroll
, but I typically like to structure the code so it helps the compiler and also makes it easier to understand the performance reasoning for unrolling this loop.
Thanks. If you are assuming that sizeof(unsigned int) <= sizeof(Real), it
might be good to add
a compile time assertion, see KALDI_COMPILE_TIME_ASSERT.. better than a
segfault.
…On Wed, Feb 26, 2020 at 2:46 AM Akshay Subramaniam ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/cudamatrix/cu-kernels.cu
<#3954 (comment)>:
> + int j = rowStart;
+ for (; j < Nmod; j += unroll_stride) {
+ Real A_vals[unroll_count];
+ Real B_vals[unroll_count];
+
+ #pragma unroll
+ for (int u = 0; u < unroll_count; ++u) {
+ int idx_A = colStart + (j + u*stride) * dA.stride;
+ int idx_B = colStart + (j + u*stride) * B_stride;
+ A_vals[u] = A[idx_A];
+ B_vals[u] = B[idx_B];
+ }
+
+ #pragma unroll
+ for (int u = 0; u < unroll_count; ++u) {
+ thread_data += A_vals[u] * B_vals[u];
Assuming no compiler optimizations, having only one loop with thread_data
+= A[idx_A] * B[idx_B] would mean that you would have sequential
dependencies between the load and the compute instructions. Having them in
two separate loops means you have multiple independent loads into separate
registers followed by multiple independent compute ops that only depend on
the corresponding load op thus allowing for better instruction level
parallelism. Of course, compilers are much smarter and can usually do this
automatically especially with #pragma unroll, but I typically like to
structure the code so it helps the compiler and also makes it easier to
understand the performance reasoning for unrolling this loop.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3954?email_source=notifications&email_token=AAZFLO5LE5YY6KJG65S4SNTREVRPZA5CNFSM4K2YB3G2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCW32Z2I#discussion_r384054166>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO7PW5EJLKXK22FZUJTREVRPZANCNFSM4K2YB3GQ>
.
|
@akshaysubr I had to revert this in #3964 because it caused a test failure! I suspect a bug. |
I'm not sure that issue stems from a bug in this code. I've noticed this failure as well and it always happens when the true value is exactly 0. In such cases, the test will always fail as in #3949. The thread linked to above also shows the same: The absolute error is at machine precision level which I suspect comes from the fact that the new kernels use atomics, so you don't have guaranteed ordering of summations. One potential solution is to add 1 to both values in the test to avoid the relative difference of a 0 value. |
OK, I had a careful look at the code. It looks to me like the two things we're multiplying together have the property that the elements of the sum are always zero, so the elements of measured_objf_change should all reduce to the form 0 + 0 + 0 + .... 0. Does the test failure show up for you, and is it possible for you to let it hit it, and then print out the matrices (or their elementwise product) and see if there are any nonzero elements in those sums? If there are no nonzero elements, I think it would be a bug in the new TraceMatMat code. |
I took a look at the test code. You are right that the test failures happen when the
Since
My concern with this approach is that it is still relying on exact floating point arithmetic in |
Oh, I see. Yeah, the 2nd option is what I want.
The problem with adding 1.0 is that those values are normally of the order
10^-6, so you'd have to set the delta very tiny to get failures when needed.
It uses a statistical method to ensure exact zero comparisons never happen,
that's why it compares a vector of 5 (independent) elements.
The exact-zero thing is a special case for the LSTM test.
Dan
…On Tue, Mar 3, 2020 at 7:02 AM Akshay Subramaniam ***@***.***> wrote:
I took a look at the test code. You are right that the test failures
happen when the predicted_objf_change variable ends up being 0 + 0 + 0 +
... + 0. However, the measured_objf_change is not a sum of zeros but
instead, a difference between two nominally equal non-zero values:
objf_change = new_objf - baseline_objf;
measured_objf_change(i) = objf_change;
Since new_objf and baseline_objf come from two separate TraceMatMat
calls, they can differ at the machine epsilon level and hence the test
fails. I was able to fix this in two ways
1. Add 1.0 to both predicted_objf_change and measured_objf_change
before comparing them since this avoids the 0 issue with ApproxEqual
and
2. Restructuring objf_change to also be a sum of zeros as:
CuMatrix<BaseFloat> delta_output(perturbed_output);
delta_output.AddMat(-1.0, output);
objf_change = TraceMatMat(delta_output, output_deriv, kTrans);
My concern with this approach is that it is still relying on exact
floating point arithmetic in ComputeLstmNonlinearity and might be
deferring the issue to a future date.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#3954?email_source=notifications&email_token=AAZFLOYLBZTIJOGB52YJQM3RFQ3IHA5CNFSM4K2YB3G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENRKSUY#issuecomment-593668435>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOZ5GMJGATETUISMKCTRFQ3IHANCNFSM4K2YB3GQ>
.
|
Added fused kernel
_strided_reduction_fused_kernel
to compute sums of matrix columns and kernels to compute trace of product of matrices_trace_mat_mat_trans_atomic
and_frobenius_norm_atomic
when the two matrices are the same.I've compared the performance and accuracy of the new kernels against the old ones for many different problem sizes and for the
swbd/s5c
example. They are just as accurate (apart from some floating point round off errors arising from using atomics) and are faster in all cases. The matrix column reduction kernel is 3-5x faster and speeds up the training pipeline by 30%. The tr(AB^T) kernel is ~25% faster in general and ~2x faster when A == B which is the case that's most important in the training pipeline.