Optimized sequence reverse operator #6946

mkolod · 2017-07-07T00:23:16Z

@piiswrong @mli @szha This implementation provides some degree of parallelism on the GPU. The version it replaces would do 1 kernel launch per DType instance (int, float) to be copied, which would make perf really bad on the GPU (kernel launch latency and no parallelism). In Sockeye, this would take up 20% of the timeline. With this change, the time consumed went down to less than 0.5%. I also added tests, since this operator didn't have either CPU or GPU tests.

piiswrong · 2017-07-07T03:08:52Z

Would it be better to remove the use of IndexTensorToVector?
It's copying gpu memory to cpu.

I think you can read the indices in reversekernel

fhieber · 2017-07-07T09:32:56Z

The test should also include a case where a vector of indices is passed to SequenceReverse:
mx.sym.SequenceReverse(data=data, sequence_length=indices, use_sequence_length=True)

This is an important use case for batches of sentences with variable sequence length.

To me its not clear whether the updated code still contains this functionality.

tdomhan · 2017-07-07T09:36:43Z

src/operator/sequence_reverse-inl.h

 template <typename xpu, typename DType>
 class SequenceReverseOp : public Operator {
 public:
  explicit SequenceReverseOp(SequenceReverseParam p) { this->param_ = p; }
  void sequence_reverse(const mshadow::Tensor<xpu, 3, DType> data,
                        const mshadow::Tensor<xpu, 3, DType> &out,
-                        std::vector<index_t> indices, OpReqType req) {
+                        std::vector<index_t> indices, OpReqType req,


the indices don't seem to be used anymore in the updated code.

Yep, good catch, will fix that, thanks!

mkolod · 2017-07-07T15:28:20Z

@fhieber Sounds good. And yes, currently the padding is also reversed, so "this is a test PAD" becomes PAD test a is this." I assume you want to preserve the logic that would generate "test a is this PAD," correct? If so, I can certainly do it. Please confirm if this is the concern you are raising here. I can also add a test for that case.

fhieber · 2017-07-07T15:57:32Z

Hi Marek,

yes, if sequence_length=[3], and use_sequence_length=True reversing [1 2 3 0] should yield [3 2 1 0] and not [0 3 2 1]

(assuming 0 is the padding symbol in this example)

This is important for properly concatenating states from a forward and reverse RNN, for example.

mkolod · 2017-07-07T16:02:32Z

@fhieber Sounds good, I will make that change, and add tests to verify the correct padding behavior.

fhieber · 2017-07-07T16:40:58Z

@mkolod great, thanks a ton for working on this, btw!
On a slightly related note, would it be hard to have the operator support the specification of an axis to reverse (similar to mx.sym.reverse but with default_axis=0)? Currently, all Sequence* operators require time-major input which is a bit limiting.

mkolod · 2017-07-07T16:55:01Z

@fhieber Good question. I'm pretty sure that it would be possible. On the GPU, the perf always depends on the memory access pattern, with 128-byte contiguous accesses being preferred. But yes, axis change should be doable. That would require another rewrite compared to the current PR, and so would probably be best left for another PR if we are to quickly o ptimize time-major at least. As far as I remember though, there are some things that are only implemented completely for either time-major or batch-major in MxNet anyway, one example being batch-dot. This is especially true for sequence models (TNC vs. NTC), maybe the axis issue is less pronounced in terms of support for images (NCHW vs. NHWC). But yes, I agree, that's a good next step, to support axis selection, for as many operators as possible.

mkolod · 2017-07-12T03:27:01Z

@piiswrong I think I just addressed your concern about not using IndexTensorToVector to remove the unnecessary device-to-host memcpy. Please let me know if the new push that I just made satisfactorily addresses this issue.

@fhieber @tdomhan @KellenSunderland @sbodenstein I included tests for cases of both even and uneven padding (different padding amounts per example in the batch). The tests run on both CPU and GPU. I ran them locally on a Pascal GPU (P100) on Linux, hopefully the CI will pass everywhere else.

BTW, I used clang-format with the "Google" setting since it seems that "make lint" uses cpplint, which generally follows that layout. I also ran the linter locally prior to submission.

piiswrong · 2017-07-12T05:16:14Z

src/operator/sequence_reverse-inl.h

+         is provided as an argument to the Map() call as data used
+         to determine the offset.
+       */
+      mxnet_op::Kernel<ReverseKernel, xpu>::Launch(


I would launch a single kernel with size batch_size*max_seq_len

@piiswrong OK, done, thanks for pointing it out!

mkolod · 2017-07-12T22:07:16Z

@piiswrong It seems that one step of the pipeline blew up due to a timeout. Is there a way you guys can restart this step manually?

Pull request #6946 updated
Connecting to https://api.github.com using lxn2/******
Loading trusted files from base branch master at 908b3c5047cf3bc21f9ef59cd8d4ccd255f06cf6 rather than 6da21815579b9cc9959146bdef4670f89ba42f97
Obtained Jenkinsfile from 908b3c5047cf3bc21f9ef59cd8d4ccd255f06cf6
[Pipeline] stage
[Pipeline] { (Sanity Check)
[Pipeline] timeout
Timeout set to expire in 1 hr 0 min
[Pipeline] {
[Pipeline] node
Still waiting to schedule task
Waiting for next available executor on mxnetlinux
Cancelling nested steps due to timeout
[Pipeline] // node
[Pipeline] }
[Pipeline] // timeout
[Pipeline] }
[Pipeline] // stage
[Pipeline] End of Pipeline

GitHub has been notified of this commit’s build result

ERROR: Queue task was cancelled
Finished: FAILURE

piiswrong · 2017-07-17T00:41:39Z

Thanks. Merged

mkolod · 2017-07-17T16:30:53Z

Thanks @piiswrong !!!

fhieber · 2017-07-28T12:45:49Z

@mkolod @piiswrong
I managed to run a quick test with Sockeye that uses SequenceReverse in its RNN bi-directional encoders on different versions of mxnet: 0.10.0, latest master, master@8144361, and master@a8804a1b7 . The latter two correspond to mxnet just before this PR was merged and right after.

The table shows wall time after 2k batches and 10k batches in seconds on a single Tesla K80 (AWS p2.xlarge)

MXNet version	wall-time @ batch 2k	wall-time @ 10k batches
0.10.0	1357.811677	7224.344484
before PR (`814436`)	1389.247020	7370.476751
after PR (`a8804a`)	1686.342626	8827.216907
latest	1667.805436	8783.761625

This is obviously not a detailed profile but its pretty clear that the revised SequenceReverse operator slows things down quite a bit. Any idea what's going on there?

mkolod · 2017-07-28T14:04:40Z

@fhieber @piiswrong I'll look into it today and will let you know what I find.

mkolod · 2017-07-28T21:32:57Z

@fhieber @piiswrong

I tried to repro the slowdown and was unable to. I compared the top of tree to the TOT with the commit in questions (a8804a) unapplied, on the following machine:

Core i7-5930K CPU @ 3.50GHz (6 cores)
32 GB RAM
Tesla P100 (Pascal)

While I don't see a slowdown, the speed-up is very minor, after the various refactorings done in the process of the code review for this PR. Initially, it was a difference between 20% of the execution time (prior to PR version 1) to 0.5% of the execution time. Unfortunately, since padding cannot be inverted and requires branching, and due to some other changes since the first commit, the gain is now 2% at best.

The Sockeye parameters I used were as follows (on WMT15 DE-EN dataset):

python3 -m sockeye.train \
        --source data/train2.de \
        --target data/train2.en \
        --validation-source data/valid2.de \
        --validation-target data/valid2.en \
        --output models \
        --num-words 50000 \
        --rnn-num-layers 2 \
        --rnn-num-hidden 512 \
        --num-embed-source 512 \
        --num-embed-target 512 \
        --max-seq-len 50 \
        --batch-size 64 \
        --bucket-width 10 \
        --checkpoint-frequency 20000 \
        --max-num-checkpoint-not-improved 8 \
        --dropout 0.3 \
        --optimizer adam \
        --clip-gradient 1.0 \
        --learning-rate-scheduler-type plateau-reduce \
        --learning-rate-reduce-factor 0.5 \
        --learning-rate-reduce-num-not-improved 3 \
        --attention-type dot \
        --learning-rate-half-life 10 \
        --monitor-bleu 10000 \
        --seed 42 \
        --use-fused-rnn \
        --device-ids 1 \
        --max-updates 2000

MxNet version	1k batches	2k batches
TOT	225s	420s
TOT with commit a8804a unapplied	221s	419s

So, I can still see a speed-up (on Pascal), but it's much smaller than in the original commit, due to a different design (single MxNet kernel launch (but not single CUDA launch), honoring padding non-reversal as discussed earlier in the ticket, etc.).

I wonder if there is a different way to improve perf more, e.g. via a raw kernel rather than the magic macros, or if the current implementation is the best it can be from a performance perspective while using things such as MSHADOW_XINLINE. Maybe I'm missing some magic step here that is happening behind the scenes, since it's not an actual CUDA kernel written the traditional way, so maybe I missed some opportunities for perf improvement via these wrappers. In any case, I can't repro the slowdown on my end.

fhieber · 2017-07-30T15:14:22Z

Thanks for diving deep into this once more ! Could the differences between outlr runs be related to the type of GPU ?
In any case, I will rerun your benchmark setup to get another data point.

tdomhan reviewed Jul 7, 2017

View reviewed changes

piiswrong reviewed Jul 12, 2017

View reviewed changes

Optimized sequence reverse operator

dabd4cd

piiswrong merged commit a8804a1 into apache:master Jul 17, 2017

Guneet-Dhillon pushed a commit to Guneet-Dhillon/mxnet that referenced this pull request Sep 13, 2017

Optimized sequence reverse operator (apache#6946)

e568e6c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized sequence reverse operator #6946

Optimized sequence reverse operator #6946

mkolod commented Jul 7, 2017

piiswrong commented Jul 7, 2017

fhieber commented Jul 7, 2017

tdomhan Jul 7, 2017

mkolod Jul 7, 2017

mkolod commented Jul 7, 2017 •

edited

Loading

fhieber commented Jul 7, 2017

mkolod commented Jul 7, 2017 •

edited

Loading

fhieber commented Jul 7, 2017

mkolod commented Jul 7, 2017 •

edited

Loading

mkolod commented Jul 12, 2017 •

edited

Loading

piiswrong Jul 12, 2017

mkolod Jul 12, 2017

mkolod commented Jul 12, 2017 •

edited

Loading

piiswrong commented Jul 17, 2017

mkolod commented Jul 17, 2017

fhieber commented Jul 28, 2017

mkolod commented Jul 28, 2017

mkolod commented Jul 28, 2017 •

edited

Loading

fhieber commented Jul 30, 2017

Optimized sequence reverse operator #6946

Optimized sequence reverse operator #6946

Conversation

mkolod commented Jul 7, 2017

piiswrong commented Jul 7, 2017

fhieber commented Jul 7, 2017

tdomhan Jul 7, 2017

Choose a reason for hiding this comment

mkolod Jul 7, 2017

Choose a reason for hiding this comment

mkolod commented Jul 7, 2017 • edited Loading

fhieber commented Jul 7, 2017

mkolod commented Jul 7, 2017 • edited Loading

fhieber commented Jul 7, 2017

mkolod commented Jul 7, 2017 • edited Loading

mkolod commented Jul 12, 2017 • edited Loading

piiswrong Jul 12, 2017

Choose a reason for hiding this comment

mkolod Jul 12, 2017

Choose a reason for hiding this comment

mkolod commented Jul 12, 2017 • edited Loading

piiswrong commented Jul 17, 2017

mkolod commented Jul 17, 2017

fhieber commented Jul 28, 2017

mkolod commented Jul 28, 2017

mkolod commented Jul 28, 2017 • edited Loading

fhieber commented Jul 30, 2017

mkolod commented Jul 7, 2017 •

edited

Loading

mkolod commented Jul 7, 2017 •

edited

Loading

mkolod commented Jul 7, 2017 •

edited

Loading

mkolod commented Jul 12, 2017 •

edited

Loading

mkolod commented Jul 12, 2017 •

edited

Loading

mkolod commented Jul 28, 2017 •

edited

Loading