Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better kernels for row sum and tr(AB^T) #3954

Merged
merged 22 commits into from
Feb 26, 2020

Conversation

akshaysubr
Copy link
Contributor

Added fused kernel _strided_reduction_fused_kernel to compute sums of matrix columns and kernels to compute trace of product of matrices _trace_mat_mat_trans_atomic and _frobenius_norm_atomic when the two matrices are the same.

I've compared the performance and accuracy of the new kernels against the old ones for many different problem sizes and for the swbd/s5c example. They are just as accurate (apart from some floating point round off errors arising from using atomics) and are faster in all cases. The matrix column reduction kernel is 3-5x faster and speeds up the training pipeline by 30%. The tr(AB^T) kernel is ~25% faster in general and ~2x faster when A == B which is the case that's most important in the training pipeline.

akshaysubr and others added 22 commits November 7, 2019 11:17
@danpovey
Copy link
Contributor

Oh... was going to ask someone from NVidia to review this, but I see you are from NVidia. Will check it over for my own edification.

// Reduce a matrix 'data' to a row vector 'dots'
template <EnumTransformReduce TransReduceType, typename Real, int unroll_count>
__global__ void _strided_reduction_fused_kernel(Real * __restrict__ dots, const Real * __restrict__ data,
void * __restrict__ scratch, const MatrixDim d,
Copy link
Contributor

@danpovey danpovey Feb 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is scratch void* and not Real*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scratch is a workspace that is split into two portions. One to store the intermediate results of type Real and the other to store unsigned int flags to guarantee ordering of operations across cuda blocks. Since all of this is internal to the kernel, I thought it made more sense to expose it as a single void * argument rather than have two separate Real * and unsigned int * arguments.


#pragma unroll
for (int u = 0; u < unroll_count; ++u) {
thread_data += A_vals[u] * B_vals[u];
Copy link
Contributor

@danpovey danpovey Feb 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why you made this a separate loop, why is this faster than just doing, in the loop, thread_data += A[idx_A] * B[idx_B] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming no compiler optimizations, having only one loop with thread_data += A[idx_A] * B[idx_B] would mean that you would have sequential dependencies between the load and the compute instructions. Having them in two separate loops means you have multiple independent loads into separate registers followed by multiple independent compute ops that only depend on the corresponding load op thus allowing for better instruction level parallelism. Of course, compilers are much smarter and can usually do this automatically especially with #pragma unroll, but I typically like to structure the code so it helps the compiler and also makes it easier to understand the performance reasoning for unrolling this loop.

@danpovey
Copy link
Contributor

danpovey commented Feb 26, 2020 via email

@danpovey danpovey merged commit 673c6fc into kaldi-asr:master Feb 26, 2020
danpovey added a commit that referenced this pull request Feb 29, 2020
@danpovey
Copy link
Contributor

danpovey commented Mar 1, 2020

@akshaysubr I had to revert this in #3964 because it caused a test failure! I suspect a bug.

@danpovey
Copy link
Contributor

danpovey commented Mar 1, 2020

@akshaysubr
Copy link
Contributor Author

I'm not sure that issue stems from a bug in this code. I've noticed this failure as well and it always happens when the true value is exactly 0. In such cases, the test will always fail as in #3949. The thread linked to above also shows the same:
LOG ([5.5.655~1-91ff6]:UnitTestLstmNonlinearity():cu-math-test.cc:305) ... predicted_objf_change= [ 0 0 0 0 0 ] , measured_objf_change= [ -2.86102e-06 9.53674e-07 9.53674e-07 -1.90735e-06 9.53674e-07 ]

The absolute error is at machine precision level which I suspect comes from the fact that the new kernels use atomics, so you don't have guaranteed ordering of summations. One potential solution is to add 1 to both values in the test to avoid the relative difference of a 0 value.

@danpovey
Copy link
Contributor

danpovey commented Mar 1, 2020

OK, I had a careful look at the code.

It looks to me like the two things we're multiplying together have the property that the elements of the sum are always zero, so the elements of measured_objf_change should all reduce to the form 0 + 0 + 0 + .... 0.

Does the test failure show up for you, and is it possible for you to let it hit it, and then print out the matrices (or their elementwise product) and see if there are any nonzero elements in those sums? If there are no nonzero elements, I think it would be a bug in the new TraceMatMat code.

@akshaysubr
Copy link
Contributor Author

akshaysubr commented Mar 2, 2020

I took a look at the test code. You are right that the test failures happen when the predicted_objf_change variable ends up being 0 + 0 + 0 + ... + 0. However, measured_objf_change is not a sum of zeros but instead, a difference between two nominally equal non-zero values:

objf_change = new_objf - baseline_objf;
measured_objf_change(i) = objf_change;

Since new_objf and baseline_objf come from two separate TraceMatMat calls, they can differ at the machine epsilon level and hence the test fails. I was able to fix this in two ways

  1. Add 1.0 to both predicted_objf_change and measured_objf_change before comparing them since this avoids the 0 issue with ApproxEqual and
  2. Restructuring objf_change to also be a sum of zeros as:
CuMatrix<BaseFloat> delta_output(perturbed_output);
delta_output.AddMat(-1.0, output);
objf_change = TraceMatMat(delta_output, output_deriv, kTrans);

My concern with this approach is that it is still relying on exact floating point arithmetic in ComputeLstmNonlinearity and might be deferring the issue to a future date.

@danpovey
Copy link
Contributor

danpovey commented Mar 3, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants