-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Batch Norm rewrite without mshadow, 1D, 2D, 3D, float16, float32, float64 as well as operator gtest framework #5936
Conversation
include/mxnet/tensor_blob.h
Outdated
@@ -184,7 +184,7 @@ class TBlob { | |||
mshadow::Shape1(shape_.Size()), stream); | |||
} | |||
/*! \brief return number of dimension of the tensor inside */ | |||
inline int ndim(void) const { | |||
inline index_t ndim(void) const { | |||
return shape_.ndim(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cast instead of change return type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason it is int? It can never be negative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size() calls return index_t, btw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other codes may rely on the return type of ndim() and we should check more carefully before making the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/operator/operator_common.h
Outdated
@@ -309,6 +309,34 @@ inline void ParamParser(nnvm::NodeAttrs* attrs) { | |||
attrs->parsed = std::move(param); | |||
} | |||
|
|||
/*! \brief Callback class to allow for convenient development and testing */ | |||
template<typename Type> | |||
class Callbacker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For unit testing callback into the unit testing framework. I can move it to just batchnorm if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is a callback necessary? Can you do it in the test code instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that there is a callback call is to inspect intermediate values or complex matrices (print them, for instance) without having to include the test_util.h (for example) in the main build tree. When not being tested, it has no overhead (production code doesn't make callbacks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let's move this to batchnorm, or better, remove this and only test public methods so that no callback is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -0,0 +1,366 @@ | |||
/*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_norm_v1-inl.h is basically the old batch_norm-inl.h
@@ -0,0 +1,89 @@ | |||
/*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_norm_v1.cc is basically the old batch_norm.cc
please also add python tests similar to conv and pool refactors |
@@ -0,0 +1,19 @@ | |||
/*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_norm_v1.cu is basically the old batch_norm.cu
tests/jenkins/run_test_ubuntu.sh
Outdated
@@ -19,7 +19,7 @@ fi | |||
cp make/config.mk . | |||
echo "USE_CUDA=1" >> config.mk | |||
echo "USE_CUDA_PATH=/usr/local/cuda" >> config.mk | |||
echo "USE_CUDNN=1" >> config.mk | |||
echo "USE_CUDNN=0" >> config.mk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One the the jenkins scripts should build organic GPU operators rather than CUDNN always
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mkl test is doing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit tests in python are required.
- Verify the new implementation's results are same as the old one's.
- Verify the new implementation's results of CPU and GPU are same.
- Verify the new implementation produces the expected results for the cases that were not supported before.
src/operator/batch_norm-inl.h
Outdated
* \sa OpReqType, OpContext | ||
*/ | ||
#ifdef MXNET_USE_CUDA | ||
void doForward(mshadow::Stream<gpu> *stream, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function name use CamelCase style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/operator/batch_norm-inl.h
Outdated
const size_t itemCount = inputData.Size() / channelCount; | ||
|
||
// Avoid multiple dptr() call within the channel loop | ||
Dtype *inputDataPtr = inputData.dptr<Dtype>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use DType instead of Dtype to conform with the convention in MXNet.
src/operator/batch_norm.cu
Outdated
#define DeviceTensor3 DeviceTensor<Dtype, 3> | ||
|
||
template <typename Dtype, typename accreal> | ||
static void BatchNormalization_updateOutput(mshadow::Stream<gpu> *s, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid mixed styles of the function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/operator/batch_norm-inl.h
Outdated
/*! \brief inverse standard deviation <-> variance | ||
* Note that these aren't entirely reversible due to eps | ||
**/ | ||
#define VARIANCE_TO_INVSTD(__var$, __eps$) (1.0/sqrt((__var$) + Dtype(__eps$))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious. Why need a '$' sign at the end of every variable?
src/operator/batch_norm-inl.h
Outdated
#define INVSTD_TO_VARIANCE(__invstd$, __eps$) ((1.0 / ((__invstd$) * (__invstd$))) - (__eps$)) | ||
|
||
/*! \brief Compute the variance of each input channel, as well as update moving mean/variants */ | ||
template<typename DType, typename Shape> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need Shape as a template argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works with both mshadow::TShape and nnvm::TShape equally without causing extra overhead
src/operator/batch_norm-inl.h
Outdated
/*! \brief Batch normalization operator */ | ||
template<typename xpu, typename Dtype, typename AccType> | ||
class BatchNormOp : public Operator | ||
, public Callbacker<Operator> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid multiple inheritance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is convenient because I can use dynamic_cast in the unit tests to check for existence of the interface. Otherwise, for other operators, a change would have to be made to add a function returning a pointer to the callback interface, etc.
src/operator/batch_norm-inl.h
Outdated
class BatchNormOp : public Operator | ||
, public Callbacker<Operator> { | ||
typedef ::nnvm::TShape TShape; | ||
typedef ::mxnet::TBlob TBlob; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think TShape and TBlob are visible in this scope, right? Why need these typedefs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they get mixed up sometimes with mshadow::TShape/TBlob. They are there because of compile issues.
src/operator/batch_norm-inl.h
Outdated
|
||
/*! \brief Compute the mean of each input channel */ | ||
template<typename Shape> | ||
static inline void computeMean(const Dtype *in_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems you implemented computing mean and variance without using gemm that has been optmized internally. According to your benchmark results, new CPU is slower than the old CPU. Any good reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we apply gemm here?
For mean: we create an array [1/size .... ]T and multiply with the original in_data?
Will gemm consume more space?
Is gemm doing better than openmp?
New to MXNet, thank you for your help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuruofeifei Yes. Caffe uses gemm/gemv to implement computing mean and variance. I guess it's faster than putting omp there manually. In addition, CPU and GPU can share the same piece of code if using gemm. Currently, this code is for CPU only. It requires a little more memory to store the vector, but gives faster speed (my guess) and easy code maintenance in return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eric instructed me to not use blas such as caffe does for this, but to base it upon Torch.
In addition, gemm does superfluous multiplies when calculating mean.
src/operator/batch_norm-inl.h
Outdated
const std::vector<TBlob> &aux_states); | ||
#endif // MXNET_USE_CUDA | ||
|
||
void doBackward(mshadow::Stream<cpu> *stream, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When MXNET_USE_CUDA=1, this function is compiled. But it shouldn't be, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is required. One can always select not to use a GPU even when the code is compiled with GPU enabled
src/operator/batch_norm-inl.h
Outdated
|
||
/*! \brief Fast-foreach when you don't care about the position other than channel */ | ||
template<typename Shape, typename OnData> | ||
static inline void forEachFast(Dtype *in_data, const Shape& shape, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need for this function?
I think we can just pass the in_data array with correct channel offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my understanding that data is arranged like this:
[batch item 1][channel 1][spatial data (d,r,w)][channel 2][spatial data (d,r,w)][batch item 2]
[batch item 2][channel 1][spatial data (d,r,w)][channel 2][spatial data (d,r,w)][batch item 2]
In which case you still need to loop across the batch item, correct?
src/operator/batch_norm-inl.h
Outdated
const size_t channelCount = ishape[1]; | ||
CHECK(oshape.Size() == channelCount); | ||
|
||
forEachFast(in_data, ishape, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using std accumulate for mean and variance?
And using std transform for converting data back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Same question as above
- Once the omp loop is inserted across channels, then there's not much left to insert into binary operation. In addition, this adds a lot of overhead in debug mode with function calls for what's a pretty simple calculation.
wdyt?
.gitignore
Outdated
ps-lite | ||
nnvm | ||
!src/nnvm | ||
#dmlc-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put those there accidentally in the caffe data iter commit. I am removing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you removing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, removed
@@ -11,6 +11,23 @@ | |||
|
|||
#if MXNET_USE_CUDA | |||
|
|||
/*! \brief Macros/inlines to assist CLion to parse Cuda files (*.cu, *.cuh) */ | |||
#ifdef __JETBRAINS_IDE__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows CLion IDE to parse cuda code. It has no net effect on anything else.
one last change in tests: initialize gamma and beta to random (instead of ones and zeros) to make sure that fix_gamma is working the same way as before |
Already checking varying gamma/beta per our offline discussion |
src/operator/batch_norm.cc
Outdated
} | ||
} | ||
|
||
DO_BIND_DISPATCH(CreateOp, param, (*in_type)[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass in_shape to createop here and remove mkl_off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
…at64 as well as operator gtest framework (apache#5936) * Batch Norm rewrite without mshadow as well as operator gtest framework * performance testing * lint fixes * use CUDNN for this test * remove superfluous omp define * Fix file names in comments * build, run, clean gtest works (although a test is failing) * CR comments * Adjust timing tests for more strenuous sample * Remove temp resource allocation * DeviceTensor3 added, forEachFast not yet converted * DeviceTensor3 version working * DeviceTensor3 working * . * Fix for use_global_stats * fixed bug with testing suite for double (Float64) * python unit tests working for batchnorm * python unit tests * Update documentation for mxnet.initializer.Mixed (apache#5937) * Update documentation for SVMOutput. (apache#5931) * Update documentation for SVMOutput. * Update doc for SVMOutput - fix formatting. * Adding install instruction for Ubuntu-CPU-Python (apache#5885) * edit ndarray API docs (apache#5806) * edit docs in broadcast_reduce_op * edit docs in broadcast_reduce_op * minor change * lint fix * fix * mx.nd.ones * mx.nd.repeat * mx.nd.reverse * add example in repeat * optimizer update * fix nanprod * fix optimizer_op api doc * fix reduce_op api doc * fix nd.ones api doc * mx.nd.repeat doc change * Update broadcast_reduce_op.h * Symbol docs fixes (apache#5930) * symbol docs minor formatting changes * deepcopy, infer_shape, infer_shape_partial docs modified * Few more small fixes * arithmetic functions fixes * some more modifications * changes after review * small change * grad function note added * More API Doc Edits (apache#5886) * edit activation doc * doc l2_normalization * edit MakeLoss doc * edit blockgrad doc * blockgrad fileline fix * edit MakeLoss doc cont. * doc change 'tensor' to 'multidimensional array' * l2normalization doc improve * makeloss doc improve, blockgrad doc improve * fix doc in activation, l2_normalization, make_loss * fix minor grammar * use .describe to avoid build failure. * Update documentation for mxnet.image.imdecode (apache#5957) * Update documentation for mxnet.image.imdecode * Update documentation for mxnet.image.imdecode (clarify that we need OpenCV and not the CV2 Python library) * Fix script by adding path to Dockerfile (apache#5958) * Clean install script * Add test for pip installations * Remove debug statements & comments * Make test runnable as script and from framework * Fix path to Dockerfiles * Putting failing cases at the end * Update doc for Custom operator. (apache#5875) * Update doc for Custom operator. * Update doc for Custom operator. * Fix formating in doc for Custom operator. * Fix formating in doc for Custom operator. * Minor change to ndarray.Custom documentation. * Minor edit in doc for Custom operator. * Minor change to doc for Custom operator. Data is 'NDArray-or-Symbol'. * Minor formatting change for Custom operator documentation. * For Custom operator doc, move example into ndarray_doc.py. * Minor change in Custom operator documentation * Improve the doc of pick + Update dmlc-core (apache#5946) * Add PickParam to fix the docstring and the initial value for axis * Update dmlc-core * Update dmlc-core * Image docs modified (apache#5973) * imageIter doc modified * edited imageiter * ADD missing Libri_sample.json, FIX minor bugs in speech_recognition example (apache#5962) * [KVStore] Add support for other data types (apache#5818) * Fix kvstore type * Fix lint * Parse inputs to DataDesc * Make module support dtype * Fix lint * Add default dtype in Comm * Fix lint * Revert rename * [cpp-package] Add C++ basic tutorial and build instruction (apache#5971) * Add C++ basic tutorial and build instruction * Remove binaries * Fix lint * Avoid sign-compare * Update documentation for mxnet.metric.np (apache#5977) * Getting rid of identity (apache#5935) * Activation ops (apache#5938) * [Ops] Add op: 'relu' * Add op: 'sigmoid' * Introduce 'kernel_launch_op' * Add tests and describe; move it to elemwise_unary_op * Fix GPU version * Convert caffe AbsVal to mx.symbol.abs in caffe converter (apache#5984) * Correction to LSTMCell docstring (apache#5986) * [Module] fix input_grads order (apache#5980) * fix input_grads order + update dmlc-core * set label to be optional * update env_var doc (apache#5964) * Adjusting make, Callback removed * batch norm gpu testing * Batch Norm rewrite without mshadow as well as operator gtest framework * performance testing * lint fixes * use CUDNN for this test * remove superfluous omp define * Fix file names in comments * build, run, clean gtest works (although a test is failing) * CR comments * Adjust timing tests for more strenuous sample * Remove temp resource allocation * rearrange source into cc and cu files * lint fixes * Trigger build * Use latest mshadow * temporarily revert channel position parameter field * Add more tests for batchnorm * Add more tests for batchnorm * test_operator_gpu working for all types * Compiles after AccReal * Compiles after AccReal * All tests working * All tests working * build, run, clean gtest works (although a test is failing) * vc++ requires explicit int type for omp for loop * Repair cpp-package * signed/unsigned fixed in cuda file * lint fixes in tests and cpp-package directories * more lint * use IsWriting() helper * Fall-through for unsupported MKL shapes/types * Fall-through for unsupported MKL shapes/types * cleaner mkl_off approach * Warning only whem MKL is requested * Warning only whem MKL is requested * lint * .. * python problem fixed * python problem fixed * Merge branch 'batchnorm' into batchnorm_pr # Conflicts: # src/operator/batch_norm.cc # src/operator/batch_norm.cu # tests/cpp/operator/batchnorm_test.cc * lint fix * lint fix * lint fix * lint fix * lint fix * Fix visual c++ compile problem * . * . * All unit tests pass again * lint fix * fix strange compile errors in CUDNN batchnorm header * FInish using flags instead of bools * lint * Fix timing pass count for forward pass * Fix R script install roxygen problem * code formatting, addition of doc strings is causing IDE to add spaces before the calls * removed commented * cr comments * Change back to compilable code * For CPU mode, store as invstd * move testing code around a little * lint fix * Use AccReal in some places to avoid fp16 problems * Fix minor invstd problem in cuda version * remove unused scale param * add permutation unit test, handle cudnn doesn't like 3D * . * lint * . * Remove mkl_off * lint fix and time cudnn when enabled
…at64 as well as operator gtest framework (apache#5936) * Batch Norm rewrite without mshadow as well as operator gtest framework * performance testing * lint fixes * use CUDNN for this test * remove superfluous omp define * Fix file names in comments * build, run, clean gtest works (although a test is failing) * CR comments * Adjust timing tests for more strenuous sample * Remove temp resource allocation * DeviceTensor3 added, forEachFast not yet converted * DeviceTensor3 version working * DeviceTensor3 working * . * Fix for use_global_stats * fixed bug with testing suite for double (Float64) * python unit tests working for batchnorm * python unit tests * Update documentation for mxnet.initializer.Mixed (apache#5937) * Update documentation for SVMOutput. (apache#5931) * Update documentation for SVMOutput. * Update doc for SVMOutput - fix formatting. * Adding install instruction for Ubuntu-CPU-Python (apache#5885) * edit ndarray API docs (apache#5806) * edit docs in broadcast_reduce_op * edit docs in broadcast_reduce_op * minor change * lint fix * fix * mx.nd.ones * mx.nd.repeat * mx.nd.reverse * add example in repeat * optimizer update * fix nanprod * fix optimizer_op api doc * fix reduce_op api doc * fix nd.ones api doc * mx.nd.repeat doc change * Update broadcast_reduce_op.h * Symbol docs fixes (apache#5930) * symbol docs minor formatting changes * deepcopy, infer_shape, infer_shape_partial docs modified * Few more small fixes * arithmetic functions fixes * some more modifications * changes after review * small change * grad function note added * More API Doc Edits (apache#5886) * edit activation doc * doc l2_normalization * edit MakeLoss doc * edit blockgrad doc * blockgrad fileline fix * edit MakeLoss doc cont. * doc change 'tensor' to 'multidimensional array' * l2normalization doc improve * makeloss doc improve, blockgrad doc improve * fix doc in activation, l2_normalization, make_loss * fix minor grammar * use .describe to avoid build failure. * Update documentation for mxnet.image.imdecode (apache#5957) * Update documentation for mxnet.image.imdecode * Update documentation for mxnet.image.imdecode (clarify that we need OpenCV and not the CV2 Python library) * Fix script by adding path to Dockerfile (apache#5958) * Clean install script * Add test for pip installations * Remove debug statements & comments * Make test runnable as script and from framework * Fix path to Dockerfiles * Putting failing cases at the end * Update doc for Custom operator. (apache#5875) * Update doc for Custom operator. * Update doc for Custom operator. * Fix formating in doc for Custom operator. * Fix formating in doc for Custom operator. * Minor change to ndarray.Custom documentation. * Minor edit in doc for Custom operator. * Minor change to doc for Custom operator. Data is 'NDArray-or-Symbol'. * Minor formatting change for Custom operator documentation. * For Custom operator doc, move example into ndarray_doc.py. * Minor change in Custom operator documentation * Improve the doc of pick + Update dmlc-core (apache#5946) * Add PickParam to fix the docstring and the initial value for axis * Update dmlc-core * Update dmlc-core * Image docs modified (apache#5973) * imageIter doc modified * edited imageiter * ADD missing Libri_sample.json, FIX minor bugs in speech_recognition example (apache#5962) * [KVStore] Add support for other data types (apache#5818) * Fix kvstore type * Fix lint * Parse inputs to DataDesc * Make module support dtype * Fix lint * Add default dtype in Comm * Fix lint * Revert rename * [cpp-package] Add C++ basic tutorial and build instruction (apache#5971) * Add C++ basic tutorial and build instruction * Remove binaries * Fix lint * Avoid sign-compare * Update documentation for mxnet.metric.np (apache#5977) * Getting rid of identity (apache#5935) * Activation ops (apache#5938) * [Ops] Add op: 'relu' * Add op: 'sigmoid' * Introduce 'kernel_launch_op' * Add tests and describe; move it to elemwise_unary_op * Fix GPU version * Convert caffe AbsVal to mx.symbol.abs in caffe converter (apache#5984) * Correction to LSTMCell docstring (apache#5986) * [Module] fix input_grads order (apache#5980) * fix input_grads order + update dmlc-core * set label to be optional * update env_var doc (apache#5964) * Adjusting make, Callback removed * batch norm gpu testing * Batch Norm rewrite without mshadow as well as operator gtest framework * performance testing * lint fixes * use CUDNN for this test * remove superfluous omp define * Fix file names in comments * build, run, clean gtest works (although a test is failing) * CR comments * Adjust timing tests for more strenuous sample * Remove temp resource allocation * rearrange source into cc and cu files * lint fixes * Trigger build * Use latest mshadow * temporarily revert channel position parameter field * Add more tests for batchnorm * Add more tests for batchnorm * test_operator_gpu working for all types * Compiles after AccReal * Compiles after AccReal * All tests working * All tests working * build, run, clean gtest works (although a test is failing) * vc++ requires explicit int type for omp for loop * Repair cpp-package * signed/unsigned fixed in cuda file * lint fixes in tests and cpp-package directories * more lint * use IsWriting() helper * Fall-through for unsupported MKL shapes/types * Fall-through for unsupported MKL shapes/types * cleaner mkl_off approach * Warning only whem MKL is requested * Warning only whem MKL is requested * lint * .. * python problem fixed * python problem fixed * Merge branch 'batchnorm' into batchnorm_pr # Conflicts: # src/operator/batch_norm.cc # src/operator/batch_norm.cu # tests/cpp/operator/batchnorm_test.cc * lint fix * lint fix * lint fix * lint fix * lint fix * Fix visual c++ compile problem * . * . * All unit tests pass again * lint fix * fix strange compile errors in CUDNN batchnorm header * FInish using flags instead of bools * lint * Fix timing pass count for forward pass * Fix R script install roxygen problem * code formatting, addition of doc strings is causing IDE to add spaces before the calls * removed commented * cr comments * Change back to compilable code * For CPU mode, store as invstd * move testing code around a little * lint fix * Use AccReal in some places to avoid fp16 problems * Fix minor invstd problem in cuda version * remove unused scale param * add permutation unit test, handle cudnn doesn't like 3D * . * lint * . * Remove mkl_off * lint fix and time cudnn when enabled
Note that batch_norm.cu and batch_norm-inl.h are almost entirely new code.
However, github is not rendering the diff by default. Click "View" so see them.
test_op.h (all new code along with test_util.h and test_perf.h) also is not shown for the same reason.
@piiswrong
Performance (bs=128, c=3, h=28, w=28):
OLD CPU
BatchNormV1Prop 2D: Timing [Forward] 2828.16 ms, avg: 5.65631 ms X 500 passes
BatchNormV1Prop 2D: Timing [Backward] 20908.4 ms, avg: 41.8169 ms X 500 passes
NEW CPU
BatchNormProp 2D: Timing [Forward] 788.777 ms, avg: 1.57755 ms X 500 passes
BatchNormProp 2D: Timing [Backward] 322.013 ms, avg: 0.644026 ms X 500 passes
OLD GPU
BatchNormV1Prop 2D: Timing [Forward] 5.365 ms, avg: 0.01073 ms X 500 passes
BatchNormV1Prop 2D: Timing [Backward] 15.483 ms, avg: 0.030966 ms X 500 passes
NEW GPU
BatchNormProp 2D: Timing [Forward] 3.514 ms, avg: 0.007028 ms X 500 passes
BatchNormProp 2D: Timing [Backward] 4.787 ms, avg: 0.009574 ms X 500 passes