Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

slice operator supporting arbitrary values of step #8558

Merged
merged 2 commits into from
Nov 8, 2017

Conversation

reminisce
Copy link
Contributor

Description

  • Re-implemented slice operator using Kernel::Launch
  • Added support for arbitrary values of step along with begin and end, i.e. slice(data, begin, end, step), where step is an optional parameter.

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • For user-facing API changes, API doc string has been updated.
  • To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

  • The kernel implementation is based upon the parallelization approach of slice operator in mshadow with additional support for arbitrary value of step.
  • This operator will be used in Continued Work on Advanced Indexing #8246 to support slicing NDArray with step!=1. Currently, it uses gather_nd op to realize that functionality, which has heavy overhead of making index NDArray from slices.
  • Mini-benchmark slice_v1 (mshadow version) and slice (Kernel::Launch version).

Hardware
p2.xlarge (4 omp threads)

Commit
ba9de66

GPU build
mx.nd.slice_v1: 10000 repeats costs 0.561281 seconds
mx.nd.slice: 10000 repeats costs 0.562561 seconds

CPU-only build
mx.nd.slice_v1: 10000 repeats costs 1.049587 seconds
mx.nd.slice: 10000 repeats costs 1.130866 seconds

Benchmark script

 import mxnet as mx
 import numpy as np
 import time
 from mxnet.test_utils import same
 
 #ctx = mx.gpu(0)
 ctx = mx.cpu(0)
 
 repeat = 10000
 shape = (16, 16, 16, 16)
 a = mx.nd.arange(np.prod(shape), ctx=ctx).reshape(shape=shape)
 begin = (None, 1, 2)
 end = (shape[0], shape[1], shape[2])
 
 # warm up
 for i in range(100):
     b = mx.nd.slice_v1(a, begin=begin, end=end)
     c = mx.nd.slice(a, begin=begin, end=end)
 
 mx.nd.waitall()
 start = time.time()
 for i in range(repeat):
     c = mx.nd.slice_v1(a, begin=begin, end=end)
 mx.nd.waitall()
 elapsed = time.time() - start
 print("mx.nd.slice_v1: %d repeats costs %f seconds" % (repeat, elapsed))
 
 start = time.time()
 for i in range(repeat):
     b = mx.nd.slice(a, begin=begin, end=end)
 mx.nd.waitall()
 elapsed = time.time() - start
 print("mx.nd.slice: %d repeats costs %f seconds" % (repeat, elapsed))
 
 assert same(c.asnumpy(), b.asnumpy())

@piiswrong @eric-haibin-lin @anirudh2290 @rahul003 @cjolivier01

} else if (s < 0) {
e = -1;
}
CHECK_LE(e, len) << "slicing with end[" << i << "]="
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In thsse CHECK_XXX macros, is function call made for every <<, even if the check succeeds? If so, does this affect performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not called if check is successful. See the code:
https://github.com/dmlc/dmlc-core/blob/595d02c0e87be8a0846700462b6f45f1b1031e39/include/dmlc/logging.h#L89
All the << calls happen only when _check_err has a non-empty string stored.

stride *= dshape[k];
}
KERNEL_ASSIGN(igrad[irow*data_last_dim_size+j*step_last_dim+begin_last_dim],
req, ograd[ograd_offset++]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spaces around operators

@cjolivier01
Copy link
Member

For CPU build:
mx.nd.slice_v1: 10000 repeats costs 1.049587 seconds
mx.nd.slice: 10000 repeats costs 1.130866 seconds

Any chance you can break this down into forward and backward times? Is it known why this is slower than mshadow version?

@reminisce
Copy link
Contributor Author

reminisce commented Nov 6, 2017

@cjolivier01 The time is for forward only, no backward (the backward kernel is essentially the same as the forward one). It includes shape/dtype inferences and forward function calls. There might be a few reasons that lead to more runtime in the new implementation:

  1. Infer shape function is more complicated than the previous version since we need to support step, while the previous version assumes step=1.
  2. Kernel needs to take care of step as well.
  3. Kernel needs to copy shapes of data, out, and step in addition to the shapes of begin and end in the mshadow version.

I can investigate to see which part takes this 10% more time.

NNVM_REGISTER_OP(slice)
.add_alias("_sparse_slice")
.add_alias("crop")
NNVM_REGISTER_OP(slice_v1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need slice_v1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@reminisce
Copy link
Contributor Author

@cjolivier01 I tried replacing infer shape and forward kernel with the same ones as in mshadow, but there is still about 10% difference. That means there is some difference between the runs of mshadow and Kernel::Launch. Any insights?

SHAPE_ASSIGN_CHECK(*out_attrs, 0, GetSliceShape(param, dshape));
return true;
}

inline bool SliceForwardInferStorageType(const nnvm::NodeAttrs& attrs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the slice implementation for CSR support arbitrary step size? Do we want to update the infer storage fucntion to fallback appropriately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added fallback check now. Please review again.

Implement slice op backward

Change parallelization approach for slicing

Add unit test

Fix lint and typo

Fix doc

Fix doc

Remove slice_v1

Address cr

Remove slice_v1 in .cu

Change step data type

Add fallback for slicing csr with non-trivial step
@piiswrong piiswrong merged commit 70b68b1 into apache:master Nov 8, 2017
cjolivier01 pushed a commit to cjolivier01/mxnet that referenced this pull request Nov 9, 2017
* Implement slice op forward supporting abitrary step value

Implement slice op backward

Change parallelization approach for slicing

Add unit test

Fix lint and typo

Fix doc

Fix doc

Remove slice_v1

Address cr

Remove slice_v1 in .cu

Change step data type

Add fallback for slicing csr with non-trivial step

* Add error handling for take op infer shape
eric-haibin-lin pushed a commit to eric-haibin-lin/mxnet that referenced this pull request Dec 3, 2017
* Implement slice op forward supporting abitrary step value

Implement slice op backward

Change parallelization approach for slicing

Add unit test

Fix lint and typo

Fix doc

Fix doc

Remove slice_v1

Address cr

Remove slice_v1 in .cu

Change step data type

Add fallback for slicing csr with non-trivial step

* Add error handling for take op infer shape
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Implement slice op forward supporting abitrary step value

Implement slice op backward

Change parallelization approach for slicing

Add unit test

Fix lint and typo

Fix doc

Fix doc

Remove slice_v1

Address cr

Remove slice_v1 in .cu

Change step data type

Add fallback for slicing csr with non-trivial step

* Add error handling for take op infer shape
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants