Better kernels for row sum and tr(AB^T) #3954

akshaysubr · 2020-02-25T01:16:35Z

Added fused kernel _strided_reduction_fused_kernel to compute sums of matrix columns and kernels to compute trace of product of matrices _trace_mat_mat_trans_atomic and _frobenius_norm_atomic when the two matrices are the same.

I've compared the performance and accuracy of the new kernels against the old ones for many different problem sizes and for the swbd/s5c example. They are just as accurate (apart from some floating point round off errors arising from using atomics) and are faster in all cases. The matrix column reduction kernel is 3-5x faster and speeds up the training pipeline by 30%. The tr(AB^T) kernel is ~25% faster in general and ~2x faster when A == B which is the case that's most important in the training pipeline.

…the timeline

…d added more NVTX markers

…n-multithreaded-numerator

… chain-training.cc

…ldi into persistent_add_row_sum_mat

…nel when the two matrices are the same

…y the code

danpovey · 2020-02-25T03:41:31Z

Oh... was going to ask someone from NVidia to review this, but I see you are from NVidia. Will check it over for my own edification.

danpovey · 2020-02-25T03:56:48Z

src/cudamatrix/cu-kernels.cu

+// Reduce a matrix 'data' to a row vector 'dots'
+template <EnumTransformReduce TransReduceType, typename Real, int unroll_count>
+__global__ void _strided_reduction_fused_kernel(Real * __restrict__ dots, const Real * __restrict__ data,
+                                                void * __restrict__ scratch, const MatrixDim d,


Why is scratch void* and not Real*?

scratch is a workspace that is split into two portions. One to store the intermediate results of type Real and the other to store unsigned int flags to guarantee ordering of operations across cuda blocks. Since all of this is internal to the kernel, I thought it made more sense to expose it as a single void * argument rather than have two separate Real * and unsigned int * arguments.

danpovey · 2020-02-25T05:22:18Z

src/cudamatrix/cu-kernels.cu

+
+      #pragma unroll
+      for (int u = 0; u < unroll_count; ++u) {
+        thread_data += A_vals[u] * B_vals[u];


I'm curious why you made this a separate loop, why is this faster than just doing, in the loop, thread_data += A[idx_A] * B[idx_B] ?

Assuming no compiler optimizations, having only one loop with thread_data += A[idx_A] * B[idx_B] would mean that you would have sequential dependencies between the load and the compute instructions. Having them in two separate loops means you have multiple independent loads into separate registers followed by multiple independent compute ops that only depend on the corresponding load op thus allowing for better instruction level parallelism. Of course, compilers are much smarter and can usually do this automatically especially with #pragma unroll, but I typically like to structure the code so it helps the compiler and also makes it easier to understand the performance reasoning for unrolling this loop.

danpovey · 2020-02-26T03:30:30Z

Thanks. If you are assuming that sizeof(unsigned int) <= sizeof(Real), it might be good to add a compile time assertion, see KALDI_COMPILE_TIME_ASSERT.. better than a segfault.

…

On Wed, Feb 26, 2020 at 2:46 AM Akshay Subramaniam ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/cudamatrix/cu-kernels.cu <#3954 (comment)>: > + int j = rowStart; + for (; j < Nmod; j += unroll_stride) { + Real A_vals[unroll_count]; + Real B_vals[unroll_count]; + + #pragma unroll + for (int u = 0; u < unroll_count; ++u) { + int idx_A = colStart + (j + u*stride) * dA.stride; + int idx_B = colStart + (j + u*stride) * B_stride; + A_vals[u] = A[idx_A]; + B_vals[u] = B[idx_B]; + } + + #pragma unroll + for (int u = 0; u < unroll_count; ++u) { + thread_data += A_vals[u] * B_vals[u]; Assuming no compiler optimizations, having only one loop with thread_data += A[idx_A] * B[idx_B] would mean that you would have sequential dependencies between the load and the compute instructions. Having them in two separate loops means you have multiple independent loads into separate registers followed by multiple independent compute ops that only depend on the corresponding load op thus allowing for better instruction level parallelism. Of course, compilers are much smarter and can usually do this automatically especially with #pragma unroll, but I typically like to structure the code so it helps the compiler and also makes it easier to understand the performance reasoning for unrolling this loop. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3954?email_source=notifications&email_token=AAZFLO5LE5YY6KJG65S4SNTREVRPZA5CNFSM4K2YB3G2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCW32Z2I#discussion_r384054166>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO7PW5EJLKXK22FZUJTREVRPZANCNFSM4K2YB3GQ> .

This reverts commit 673c6fc.

danpovey · 2020-03-01T04:24:46Z

@akshaysubr I had to revert this in #3964 because it caused a test failure! I suspect a bug.

danpovey · 2020-03-01T04:25:32Z

See https://groups.google.com/d/msgid/kaldi-help/97e2c75a-5185-4ef0-a19e-3e763e6a0c93%40googlegroups.com.

akshaysubr · 2020-03-01T07:16:54Z

I'm not sure that issue stems from a bug in this code. I've noticed this failure as well and it always happens when the true value is exactly 0. In such cases, the test will always fail as in #3949. The thread linked to above also shows the same:
LOG ([5.5.655~1-91ff6]:UnitTestLstmNonlinearity():cu-math-test.cc:305) ... predicted_objf_change= [ 0 0 0 0 0 ] , measured_objf_change= [ -2.86102e-06 9.53674e-07 9.53674e-07 -1.90735e-06 9.53674e-07 ]

The absolute error is at machine precision level which I suspect comes from the fact that the new kernels use atomics, so you don't have guaranteed ordering of summations. One potential solution is to add 1 to both values in the test to avoid the relative difference of a 0 value.

danpovey · 2020-03-01T07:32:08Z

OK, I had a careful look at the code.

It looks to me like the two things we're multiplying together have the property that the elements of the sum are always zero, so the elements of measured_objf_change should all reduce to the form 0 + 0 + 0 + .... 0.

Does the test failure show up for you, and is it possible for you to let it hit it, and then print out the matrices (or their elementwise product) and see if there are any nonzero elements in those sums? If there are no nonzero elements, I think it would be a bug in the new TraceMatMat code.

akshaysubr · 2020-03-02T23:02:26Z

I took a look at the test code. You are right that the test failures happen when the predicted_objf_change variable ends up being 0 + 0 + 0 + ... + 0. However, measured_objf_change is not a sum of zeros but instead, a difference between two nominally equal non-zero values:

objf_change = new_objf - baseline_objf;
measured_objf_change(i) = objf_change;

Since new_objf and baseline_objf come from two separate TraceMatMat calls, they can differ at the machine epsilon level and hence the test fails. I was able to fix this in two ways

Add 1.0 to both predicted_objf_change and measured_objf_change before comparing them since this avoids the 0 issue with ApproxEqual and
Restructuring objf_change to also be a sum of zeros as:

CuMatrix<BaseFloat> delta_output(perturbed_output);
delta_output.AddMat(-1.0, output);
objf_change = TraceMatMat(delta_output, output_deriv, kTrans);

My concern with this approach is that it is still relying on exact floating point arithmetic in ComputeLstmNonlinearity and might be deferring the issue to a future date.

danpovey · 2020-03-03T01:56:39Z

Oh, I see. Yeah, the 2nd option is what I want. The problem with adding 1.0 is that those values are normally of the order 10^-6, so you'd have to set the delta very tiny to get failures when needed. It uses a statistical method to ensure exact zero comparisons never happen, that's why it compares a vector of 5 (independent) elements. The exact-zero thing is a special case for the LSTM test. Dan

…

On Tue, Mar 3, 2020 at 7:02 AM Akshay Subramaniam ***@***.***> wrote: I took a look at the test code. You are right that the test failures happen when the predicted_objf_change variable ends up being 0 + 0 + 0 + ... + 0. However, the measured_objf_change is not a sum of zeros but instead, a difference between two nominally equal non-zero values: objf_change = new_objf - baseline_objf; measured_objf_change(i) = objf_change; Since new_objf and baseline_objf come from two separate TraceMatMat calls, they can differ at the machine epsilon level and hence the test fails. I was able to fix this in two ways 1. Add 1.0 to both predicted_objf_change and measured_objf_change before comparing them since this avoids the 0 issue with ApproxEqual and 2. Restructuring objf_change to also be a sum of zeros as: CuMatrix<BaseFloat> delta_output(perturbed_output); delta_output.AddMat(-1.0, output); objf_change = TraceMatMat(delta_output, output_deriv, kTrans); My concern with this approach is that it is still relying on exact floating point arithmetic in ComputeLstmNonlinearity and might be deferring the issue to a future date. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#3954?email_source=notifications&email_token=AAZFLOYLBZTIJOGB52YJQM3RFQ3IHA5CNFSM4K2YB3G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENRKSUY#issuecomment-593668435>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLOZ5GMJGATETUISMKCTRFQ3IHANCNFSM4K2YB3GQ> .

akshaysubr and others added 22 commits November 7, 2019 11:17

Added NVTX markers to nnet3-chain-train to track down CPU portion of …

3ba4c2c

…the timeline

Adding tensor core support to nnet3-chain-train

79ad328

Added multithreaded version of the numerator graph in chain models an…

170bef0

…d added more NVTX markers

Added a few more markers to track natural gradients

198eb2f

Merge branch 'master' of https://github.com/kaldi-asr/kaldi into chai…

f6f7fc5

…n-multithreaded-numerator

Fixed a few issues and cleaned up chain-generic-numerator

94d6512

Changed multithreded default to true and removed rogue NVTX marker in…

7879808

… chain-training.cc

Made number of threads command line configurable and set default to 1

ae31d97

Added MKL LAPACK call to sp-matrix Eig

de852a9

More performant persistent reduction kernel for matrix row reductions

939865b

Removed some bad nvtx markers

13ec94a

Updating column reductions to not require persistent blocks

ae7de1c

Merge branch 'persistent_add_row_sum_mat' of github.com:akshaysubr/ka…

8286a23

…ldi into persistent_add_row_sum_mat

Fixed atomicInc issue and renamed kernel

caf3617

Added atomic version of _trace_mat_mat_trans and a frobenius norm ker…

6a314d8

…nel when the two matrices are the same

Cleaned up some unnecessary changes and comments

cd1cc38

Removing NVTX marker from natural online gradient code

ad60d52

Removing NVTX marker from natural online gradient code

e42e247

Added alignment check for vectorization

ec7618e

Removed vectorization from the trace_mat_mat_trans kernels to simplif…

ddd0517

…y the code

Added comment for scratch space requirements

b9835f3

Merged with upstream master and fixed conflicts

df9a5c1

danpovey reviewed Feb 25, 2020

View reviewed changes

danpovey merged commit 673c6fc into kaldi-asr:master Feb 26, 2020

danpovey added a commit that referenced this pull request Feb 29, 2020

Revert "[src] Better kernels for CUDA row sum and tr(AB^T) (#3954)"

d37eab8

This reverts commit 673c6fc.

danpovey mentioned this pull request Feb 29, 2020

Revert "Better kernels for row sum and tr(AB^T)" #3964

Merged

akshaysubr mentioned this pull request Mar 3, 2020

Better kernels for row sum and tr(AB^T): ver. 2 #3973

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better kernels for row sum and tr(AB^T) #3954

Better kernels for row sum and tr(AB^T) #3954

akshaysubr commented Feb 25, 2020

danpovey commented Feb 25, 2020

danpovey Feb 25, 2020 •

edited

Loading

akshaysubr Feb 25, 2020

danpovey Feb 25, 2020 •

edited

Loading

akshaysubr Feb 25, 2020

danpovey commented Feb 26, 2020 via email

danpovey commented Mar 1, 2020

danpovey commented Mar 1, 2020

akshaysubr commented Mar 1, 2020

danpovey commented Mar 1, 2020

akshaysubr commented Mar 2, 2020 •

edited

Loading

danpovey commented Mar 3, 2020 via email

Better kernels for row sum and tr(AB^T) #3954

Better kernels for row sum and tr(AB^T) #3954

Conversation

akshaysubr commented Feb 25, 2020

danpovey commented Feb 25, 2020

danpovey Feb 25, 2020 • edited Loading

Choose a reason for hiding this comment

akshaysubr Feb 25, 2020

Choose a reason for hiding this comment

danpovey Feb 25, 2020 • edited Loading

Choose a reason for hiding this comment

akshaysubr Feb 25, 2020

Choose a reason for hiding this comment

danpovey commented Feb 26, 2020 via email

danpovey commented Mar 1, 2020

danpovey commented Mar 1, 2020

akshaysubr commented Mar 1, 2020

danpovey commented Mar 1, 2020

akshaysubr commented Mar 2, 2020 • edited Loading

danpovey commented Mar 3, 2020 via email

danpovey Feb 25, 2020 •

edited

Loading

danpovey Feb 25, 2020 •

edited

Loading

akshaysubr commented Mar 2, 2020 •

edited

Loading