-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Would it be better to remove the use of IndexTensorToVector? I think you can read the indices in reversekernel |
The test should also include a case where a vector of indices is passed to SequenceReverse: This is an important use case for batches of sentences with variable sequence length. To me its not clear whether the updated code still contains this functionality. |
src/operator/sequence_reverse-inl.h
Outdated
template <typename xpu, typename DType> | ||
class SequenceReverseOp : public Operator { | ||
public: | ||
explicit SequenceReverseOp(SequenceReverseParam p) { this->param_ = p; } | ||
void sequence_reverse(const mshadow::Tensor<xpu, 3, DType> data, | ||
const mshadow::Tensor<xpu, 3, DType> &out, | ||
std::vector<index_t> indices, OpReqType req) { | ||
std::vector<index_t> indices, OpReqType req, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the indices don't seem to be used anymore in the updated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good catch, will fix that, thanks!
@fhieber Sounds good. And yes, currently the padding is also reversed, so "this is a test PAD" becomes PAD test a is this." I assume you want to preserve the logic that would generate "test a is this PAD," correct? If so, I can certainly do it. Please confirm if this is the concern you are raising here. I can also add a test for that case. |
Hi Marek, yes, if sequence_length= (assuming 0 is the padding symbol in this example) This is important for properly concatenating states from a forward and reverse RNN, for example. |
@fhieber Sounds good, I will make that change, and add tests to verify the correct padding behavior. |
@mkolod great, thanks a ton for working on this, btw! |
@fhieber Good question. I'm pretty sure that it would be possible. On the GPU, the perf always depends on the memory access pattern, with 128-byte contiguous accesses being preferred. But yes, axis change should be doable. That would require another rewrite compared to the current PR, and so would probably be best left for another PR if we are to quickly o ptimize time-major at least. As far as I remember though, there are some things that are only implemented completely for either time-major or batch-major in MxNet anyway, one example being batch-dot. This is especially true for sequence models (TNC vs. NTC), maybe the axis issue is less pronounced in terms of support for images (NCHW vs. NHWC). But yes, I agree, that's a good next step, to support axis selection, for as many operators as possible. |
@piiswrong I think I just addressed your concern about not using IndexTensorToVector to remove the unnecessary device-to-host memcpy. Please let me know if the new push that I just made satisfactorily addresses this issue. @fhieber @tdomhan @KellenSunderland @sbodenstein I included tests for cases of both even and uneven padding (different padding amounts per example in the batch). The tests run on both CPU and GPU. I ran them locally on a Pascal GPU (P100) on Linux, hopefully the CI will pass everywhere else. BTW, I used clang-format with the "Google" setting since it seems that "make lint" uses cpplint, which generally follows that layout. I also ran the linter locally prior to submission. |
src/operator/sequence_reverse-inl.h
Outdated
is provided as an argument to the Map() call as data used | ||
to determine the offset. | ||
*/ | ||
mxnet_op::Kernel<ReverseKernel, xpu>::Launch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would launch a single kernel with size batch_size*max_seq_len
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piiswrong OK, done, thanks for pointing it out!
@piiswrong It seems that one step of the pipeline blew up due to a timeout. Is there a way you guys can restart this step manually?
|
Thanks. Merged |
Thanks @piiswrong !!! |
@mkolod @piiswrong The table shows wall time after 2k batches and 10k batches in seconds on a single Tesla K80 (AWS p2.xlarge)
This is obviously not a detailed profile but its pretty clear that the revised SequenceReverse operator slows things down quite a bit. Any idea what's going on there? |
@fhieber @piiswrong I'll look into it today and will let you know what I find. |
I tried to repro the slowdown and was unable to. I compared the top of tree to the TOT with the commit in questions (a8804a) unapplied, on the following machine:
While I don't see a slowdown, the speed-up is very minor, after the various refactorings done in the process of the code review for this PR. Initially, it was a difference between 20% of the execution time (prior to PR version 1) to 0.5% of the execution time. Unfortunately, since padding cannot be inverted and requires branching, and due to some other changes since the first commit, the gain is now 2% at best. The Sockeye parameters I used were as follows (on WMT15 DE-EN dataset):
So, I can still see a speed-up (on Pascal), but it's much smaller than in the original commit, due to a different design (single MxNet kernel launch (but not single CUDA launch), honoring padding non-reversal as discussed earlier in the ticket, etc.). I wonder if there is a different way to improve perf more, e.g. via a raw kernel rather than the magic macros, or if the current implementation is the best it can be from a performance perspective while using things such as MSHADOW_XINLINE. Maybe I'm missing some magic step here that is happening behind the scenes, since it's not an actual CUDA kernel written the traditional way, so maybe I missed some opportunities for perf improvement via these wrappers. In any case, I can't repro the slowdown on my end. |
Thanks for diving deep into this once more ! Could the differences between outlr runs be related to the type of GPU ? |
@piiswrong @mli @szha This implementation provides some degree of parallelism on the GPU. The version it replaces would do 1 kernel launch per DType instance (int, float) to be copied, which would make perf really bad on the GPU (kernel launch latency and no parallelism). In Sockeye, this would take up 20% of the timeline. With this change, the time consumed went down to less than 0.5%. I also added tests, since this operator didn't have either CPU or GPU tests.