Skip to content

[Speed up compiling]: reduce the NVCC compiling (some .cu operators can be compiled by G++) #5491

@qingqing01

Description

@qingqing01

Compiling time comparison between NVCC and G++

  1. Conclusion:

    • NVCC is slower than G++, more than 1 min. For example, in elementwise_mul_op, the comiper time is 13s (G++) vs 1m41s (NVCC).
    • more cuda gencodes(gencode: sm_xx), more slower NVCC compiling
  2. Experiment 1: elementwise_mul_op, this op uses Eigen to compute

    • G++

      • compile :
      time /home/dangqingqing/.jumbo/opt/gcc48/bin/c++   -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_DISABLE_PROFILER -DPADDLE_DISABLE_RDMA -DPADDLE_DISABLE_TIMER -DPADDLE_USE_DSO -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_VERSION=0.10.0rc4 -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_WITH_TESTING -mavx -std=c++11 -fPIC -fno-omit-frame-pointer -Wall -Wextra -Werror -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wno-unused-parameter -Wno-unused-function -Wno-error=literal-suffix -Wno-error=sign-compare -Wno-error=unused-local-typedefs -O2 -g -DNDEBUG -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c    -o CMakeFiles/elementwise_mul_op.dir/elementwise_mul_op.cc.o -c /home/dangqingqing/github/myfork/Paddle/paddle/operators/elementwise_mul_op.cc
      • time:
      real	0m13.116s
      user	0m12.264s
      sys	0m0.849s
    • NVCC

      • gencode: sm_30, sm_35, sm_50,sm_52
      • compile :
      time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/elementwise_mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/elementwise_mul_op.dir//./elementwise_mul_op_generated_elementwise_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O2 -g -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include
      
      • time:
      real	1m41.708s
      user	1m30.757s
      sys   0m10.992s
    • gencode: only sm_35

      • compile :
      time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/elementwise_mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/elementwise_mul_op.dir//./elementwise_mul_op_generated_elementwise_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_35,code=sm_35 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O2 -g -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include
      • time:
       real	0m34.035s
       user	0m30.629s
       sys	0m3.414s 
  3. Experiment 2: mul_op, this op uses math::matmul to compute.

    • G++
      • compile :
        time /home/dangqingqing/.jumbo/opt/gcc48/bin/c++   -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_DISABLE_PROFILER -DPADDLE_DISABLE_RDMA -DPADDLE_DISABLE_TIMER -DPADDLE_USE_DSO -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_VERSION=0.10.0rc4 -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_WITH_TESTING -mavx -std=c++11 -fPIC -fno-omit-frame-pointer -Wall -Wextra -Werror -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wno-unused-parameter -Wno-unused-function -Wno-error=literal-suffix -Wno-error=sign-compare -Wno-error=unused-local-typedefs -O2 -g -DNDEBUG -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c    -o CMakeFiles/mul_op.dir/mul_op.cc.o -c /home/dangqingqing/github/myfork/Paddle/paddle/operators/mul_op.cc
      • time:
      real	0m11.383s
      user	0m10.568s
      sys	0m0.825s
    • NVCC: gencode: sm_30, sm_35, sm_50,sm_52
      • compile:
      time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/mul_op.dir//./mul_op_generated_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O2 -g -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include
      
      • time
      real	1m31.902s
      user	1m21.839s
      sys	0m10.026s

The .cu operators which can be compiled by G++

Following .cu operators can be compiled by G++, since some the dependent CUDA kernels have been compiled in math libraries (paddle/operator/math/ file). And the cuDNN can also be compiled by G++.

batch_norm_op.cu
concat_op.cu
conv2d_transpose_cudnn_op.cu
conv_cudnn_op.cu
conv_op.cu
conv_transpose_op.cu
fill_constant_batch_size_like_op.cu
fill_constant_op.cu
fill_zeros_like_op.cu
gru_op.cu
linear_chain_crf_op.cu
lstm_op.cu
matmul_op.cu
mul_op.cu
nccl_op.cu
nccl_op_test.cu
pool_cudnn_op.cu
pool_op.cu
pool_with_index_op.cu
sequence_conv_op.cu
reshape_op.cu
sequence_concat_op.cu
sequence_softmax_op.cu
softmax_op.cu
split_op.cu

But different compiling rules for different operators are a little confused for developers.

Also ralated to #5413

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions