Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Optimized sequence reverse operator #6946

Merged
merged 1 commit into from
Jul 17, 2017
Merged

Optimized sequence reverse operator #6946

merged 1 commit into from
Jul 17, 2017

Conversation

mkolod
Copy link
Contributor

@mkolod mkolod commented Jul 7, 2017

@piiswrong @mli @szha This implementation provides some degree of parallelism on the GPU. The version it replaces would do 1 kernel launch per DType instance (int, float) to be copied, which would make perf really bad on the GPU (kernel launch latency and no parallelism). In Sockeye, this would take up 20% of the timeline. With this change, the time consumed went down to less than 0.5%. I also added tests, since this operator didn't have either CPU or GPU tests.

@piiswrong
Copy link
Contributor

Would it be better to remove the use of IndexTensorToVector?
It's copying gpu memory to cpu.

I think you can read the indices in reversekernel

@fhieber
Copy link
Contributor

fhieber commented Jul 7, 2017

The test should also include a case where a vector of indices is passed to SequenceReverse:
mx.sym.SequenceReverse(data=data, sequence_length=indices, use_sequence_length=True)

This is an important use case for batches of sentences with variable sequence length.

To me its not clear whether the updated code still contains this functionality.

template <typename xpu, typename DType>
class SequenceReverseOp : public Operator {
public:
explicit SequenceReverseOp(SequenceReverseParam p) { this->param_ = p; }
void sequence_reverse(const mshadow::Tensor<xpu, 3, DType> data,
const mshadow::Tensor<xpu, 3, DType> &out,
std::vector<index_t> indices, OpReqType req) {
std::vector<index_t> indices, OpReqType req,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the indices don't seem to be used anymore in the updated code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good catch, will fix that, thanks!

@mkolod
Copy link
Contributor Author

mkolod commented Jul 7, 2017

@fhieber Sounds good. And yes, currently the padding is also reversed, so "this is a test PAD" becomes PAD test a is this." I assume you want to preserve the logic that would generate "test a is this PAD," correct? If so, I can certainly do it. Please confirm if this is the concern you are raising here. I can also add a test for that case.

@fhieber
Copy link
Contributor

fhieber commented Jul 7, 2017

Hi Marek,

yes, if sequence_length=[3], and use_sequence_length=True reversing [1 2 3 0] should yield [3 2 1 0] and not [0 3 2 1]

(assuming 0 is the padding symbol in this example)

This is important for properly concatenating states from a forward and reverse RNN, for example.

@mkolod
Copy link
Contributor Author

mkolod commented Jul 7, 2017

@fhieber Sounds good, I will make that change, and add tests to verify the correct padding behavior.

@fhieber
Copy link
Contributor

fhieber commented Jul 7, 2017

@mkolod great, thanks a ton for working on this, btw!
On a slightly related note, would it be hard to have the operator support the specification of an axis to reverse (similar to mx.sym.reverse but with default_axis=0)? Currently, all Sequence* operators require time-major input which is a bit limiting.

@mkolod
Copy link
Contributor Author

mkolod commented Jul 7, 2017

@fhieber Good question. I'm pretty sure that it would be possible. On the GPU, the perf always depends on the memory access pattern, with 128-byte contiguous accesses being preferred. But yes, axis change should be doable. That would require another rewrite compared to the current PR, and so would probably be best left for another PR if we are to quickly o ptimize time-major at least. As far as I remember though, there are some things that are only implemented completely for either time-major or batch-major in MxNet anyway, one example being batch-dot. This is especially true for sequence models (TNC vs. NTC), maybe the axis issue is less pronounced in terms of support for images (NCHW vs. NHWC). But yes, I agree, that's a good next step, to support axis selection, for as many operators as possible.

@mkolod
Copy link
Contributor Author

mkolod commented Jul 12, 2017

@piiswrong I think I just addressed your concern about not using IndexTensorToVector to remove the unnecessary device-to-host memcpy. Please let me know if the new push that I just made satisfactorily addresses this issue.

@fhieber @tdomhan @KellenSunderland @sbodenstein I included tests for cases of both even and uneven padding (different padding amounts per example in the batch). The tests run on both CPU and GPU. I ran them locally on a Pascal GPU (P100) on Linux, hopefully the CI will pass everywhere else.

BTW, I used clang-format with the "Google" setting since it seems that "make lint" uses cpplint, which generally follows that layout. I also ran the linter locally prior to submission.

is provided as an argument to the Map() call as data used
to determine the offset.
*/
mxnet_op::Kernel<ReverseKernel, xpu>::Launch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would launch a single kernel with size batch_size*max_seq_len

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piiswrong OK, done, thanks for pointing it out!

@mkolod
Copy link
Contributor Author

mkolod commented Jul 12, 2017

@piiswrong It seems that one step of the pipeline blew up due to a timeout. Is there a way you guys can restart this step manually?

Pull request #6946 updated
Connecting to https://api.github.com using lxn2/******
Loading trusted files from base branch master at 908b3c5047cf3bc21f9ef59cd8d4ccd255f06cf6 rather than 6da21815579b9cc9959146bdef4670f89ba42f97
Obtained Jenkinsfile from 908b3c5047cf3bc21f9ef59cd8d4ccd255f06cf6
[Pipeline] stage
[Pipeline] { (Sanity Check)
[Pipeline] timeout
Timeout set to expire in 1 hr 0 min
[Pipeline] {
[Pipeline] node
Still waiting to schedule task
Waiting for next available executor on mxnetlinux
Cancelling nested steps due to timeout
[Pipeline] // node
[Pipeline] }
[Pipeline] // timeout
[Pipeline] }
[Pipeline] // stage
[Pipeline] End of Pipeline

GitHub has been notified of this commit’s build result

ERROR: Queue task was cancelled
Finished: FAILURE

@piiswrong piiswrong merged commit a8804a1 into apache:master Jul 17, 2017
@piiswrong
Copy link
Contributor

Thanks. Merged

@mkolod
Copy link
Contributor Author

mkolod commented Jul 17, 2017

Thanks @piiswrong !!!

@fhieber
Copy link
Contributor

fhieber commented Jul 28, 2017

@mkolod @piiswrong
I managed to run a quick test with Sockeye that uses SequenceReverse in its RNN bi-directional encoders on different versions of mxnet: 0.10.0, latest master, master@8144361, and master@a8804a1b7 . The latter two correspond to mxnet just before this PR was merged and right after.

The table shows wall time after 2k batches and 10k batches in seconds on a single Tesla K80 (AWS p2.xlarge)

MXNet version wall-time @ batch 2k wall-time @ 10k batches
0.10.0 1357.811677 7224.344484
before PR (814436) 1389.247020 7370.476751
after PR (a8804a) 1686.342626 8827.216907
latest 1667.805436 8783.761625

This is obviously not a detailed profile but its pretty clear that the revised SequenceReverse operator slows things down quite a bit. Any idea what's going on there?

@mkolod
Copy link
Contributor Author

mkolod commented Jul 28, 2017

@fhieber @piiswrong I'll look into it today and will let you know what I find.

@mkolod
Copy link
Contributor Author

mkolod commented Jul 28, 2017

@fhieber @piiswrong

I tried to repro the slowdown and was unable to. I compared the top of tree to the TOT with the commit in questions (a8804a) unapplied, on the following machine:

  • Core i7-5930K CPU @ 3.50GHz (6 cores)
  • 32 GB RAM
  • Tesla P100 (Pascal)

While I don't see a slowdown, the speed-up is very minor, after the various refactorings done in the process of the code review for this PR. Initially, it was a difference between 20% of the execution time (prior to PR version 1) to 0.5% of the execution time. Unfortunately, since padding cannot be inverted and requires branching, and due to some other changes since the first commit, the gain is now 2% at best.

The Sockeye parameters I used were as follows (on WMT15 DE-EN dataset):

python3 -m sockeye.train \
        --source data/train2.de \
        --target data/train2.en \
        --validation-source data/valid2.de \
        --validation-target data/valid2.en \
        --output models \
        --num-words 50000 \
        --rnn-num-layers 2 \
        --rnn-num-hidden 512 \
        --num-embed-source 512 \
        --num-embed-target 512 \
        --max-seq-len 50 \
        --batch-size 64 \
        --bucket-width 10 \
        --checkpoint-frequency 20000 \
        --max-num-checkpoint-not-improved 8 \
        --dropout 0.3 \
        --optimizer adam \
        --clip-gradient 1.0 \
        --learning-rate-scheduler-type plateau-reduce \
        --learning-rate-reduce-factor 0.5 \
        --learning-rate-reduce-num-not-improved 3 \
        --attention-type dot \
        --learning-rate-half-life 10 \
        --monitor-bleu 10000 \
        --seed 42 \
        --use-fused-rnn \
        --device-ids 1 \
        --max-updates 2000
MxNet version 1k batches 2k batches
TOT 225s 420s
TOT with commit a8804a unapplied 221s 419s

So, I can still see a speed-up (on Pascal), but it's much smaller than in the original commit, due to a different design (single MxNet kernel launch (but not single CUDA launch), honoring padding non-reversal as discussed earlier in the ticket, etc.).

I wonder if there is a different way to improve perf more, e.g. via a raw kernel rather than the magic macros, or if the current implementation is the best it can be from a performance perspective while using things such as MSHADOW_XINLINE. Maybe I'm missing some magic step here that is happening behind the scenes, since it's not an actual CUDA kernel written the traditional way, so maybe I missed some opportunities for perf improvement via these wrappers. In any case, I can't repro the slowdown on my end.

@fhieber
Copy link
Contributor

fhieber commented Jul 30, 2017

Thanks for diving deep into this once more ! Could the differences between outlr runs be related to the type of GPU ?
In any case, I will rerun your benchmark setup to get another data point.

Guneet-Dhillon pushed a commit to Guneet-Dhillon/mxnet that referenced this pull request Sep 13, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants