Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Can't compile v1.x from source with CUDA 11.2 support #19992

Closed
lgg opened this issue Mar 8, 2021 · 8 comments
Closed

Can't compile v1.x from source with CUDA 11.2 support #19992

lgg opened this issue Mar 8, 2021 · 8 comments

Comments

@lgg
Copy link

lgg commented Mar 8, 2021

I can successfully compile from master branch, but it provides mxnet==2.0.0

Why

I want 1.8.x or 1.x mxnet version with CUDA 11.2 support

v1.x branch and steps

I checked that the last stable build
was on commit 9a2a50229d49586dfde8caa708b17c72a90b9727

Error output:

Environment

uname -a
Linux ml-dev 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
user@ml-dev:~/cudnn$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.2 LTS
Release:	20.04
Codename:	focal
user@ml-dev:~/cudnn$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Jan_28_19:32:09_PST_2021
Cuda compilation tools, release 11.2, V11.2.142
Build cuda_11.2.r11.2/compiler.29558016_0
----------Python Info----------
Version      : 3.8.5
Compiler     : GCC 9.3.0
Build        : ('default', 'Jan 27 2021 15:41:15')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 20.0.2
Directory    : /usr/lib/python3/dist-packages/pip
----------MXNet Info-----------
Version      : 2.0.0
Directory    : /home/user/.local/lib/python3.8/site-packages/mxnet-2.0.0-py3.8.egg/mxnet
Commit hash file "/home/user/.local/lib/python3.8/site-packages/mxnet-2.0.0-py3.8.egg/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library      : ['/home/user/.local/lib/python3.8/site-packages/mxnet-2.0.0-py3.8.egg/mxnet/libmxnet.so']
Build features:
✔ CUDA
✖ CUDNN
✖ NCCL
✖ TENSORRT
✖ CUTENSOR
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✔ MKLDNN
✔ OPENCV
✖ DIST_KVSTORE
✔ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Linux-5.8.0-44-generic-x86_64-with-glibc2.29
system       : Linux
node         : ml-dev
release      : 5.8.0-44-generic
version      : #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           158
Model name:                      Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
Stepping:                        10
CPU MHz:                         4313.191
CPU max MHz:                     4700,0000
CPU min MHz:                     800,0000
BogoMIPS:                        7399.70
Virtualization:                  VT-x
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        1,5 MiB
L3 cache:                        12 MiB
NUMA node0 CPU(s):               0-11
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Mitigation; Microcode
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT vulnerable
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 
                                 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology no
                                 nstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdc
                                 m pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowpref
                                 etch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsb
                                 ase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xs
                                 avec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0024 sec, LOAD: 0.4733 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1401 sec, LOAD: 0.3012 sec.
Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1123)>, DNS finished in 0.08941888809204102 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0665 sec, LOAD: 0.3682 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0349 sec, LOAD: 0.6479 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.05992579460144043 sec.
----------Environment----------
@lgg
Copy link
Author

lgg commented Mar 8, 2021

I also tried:

  • git checkout --recurse-submodules 1.8.0
  • git checkout --recurse-submodules v1.8.x

every time i cleared build folder with rm -rf

But its also failed on on the same step as above:

  • [ 39%] Building CXX object CMakeFiles/mxnet_static.dir/src/nnvm/tvm_bridge.cc.o

Also, 1.8.0 and v1.8.x didn't support https://github.com/apache/incubator-mxnet/blob/v1.x/config/distribution/linux_cu112.cmake
but differences is only in cuda(nvcc) version (and path) and CUDA_ARCH.

For 1.8.0 and v1.8.x I removed 8.6 cuda_arch in my config.cmake

@TristonC
Copy link
Contributor

The errors are in tvm source code. Which version of gcc or llvm did you use to build the MXNet?

@lgg
Copy link
Author

lgg commented Mar 11, 2021

@TristonC no llvm was used.

user@ml-dev:~$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

user@ml-dev:~$ cmake --version
cmake version 3.16.3

CMake suite maintained and supported by Kitware (kitware.com/cmake).
user@ml-dev:~$ g++ --version
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@TristonC
Copy link
Contributor

TristonC commented Mar 11, 2021

@lgg Could you try to set the cxx with 14 or 17 standard? The 11 standard is currently set in the CMakelists.txt (17 is used is used in the master branch).

project(mxnet C CXX)
set(CMAKE_CXX_STANDARD 17)

@lgg
Copy link
Author

lgg commented Mar 16, 2021

@TristonC thank you! Now i have this strange error:

  • git checkout --recurse-submodules 9a2a50229d4
    • I also tried:
    • git checkout --recurse-submodules 1.8.0
    • git checkout --recurse-submodules v1.8.x
    • it provides the same error with missing nnvm
  • mkdir build && cd build
  • cmake ..
  • cmake --build . --parallel 10 > build.log 2>&1

https://gist.github.com/lgg/4f0028e47034310eed596ffec2132ebe#file-build_1-8-0_after_fix-log

https://gist.githubusercontent.com/lgg/4f0028e47034310eed596ffec2132ebe/raw/50f6289cc6e416249a8e24256ffa5994cac3695e/build_1.8.0_after_fix.log

[ 40%] Building CXX object 3rdparty/mkldnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/shuffle/jit_uni_shuffle.cpp.o
[ 40%] Built target dnnl_cpu_x64
Scanning dependencies of target dnnl
[ 40%] Linking CXX static library libdnnl.a
[ 40%] Built target dnnl
Scanning dependencies of target compat_libs
[ 40%] Generating libmkldnn.a
[ 40%] Built target compat_libs
Scanning dependencies of target mxnet_static
[ 40%] Building CXX object CMakeFiles/mxnet_static.dir/src/api/_api_internal/_api_internal.cc.o
[ 40%] Building CXX object CMakeFiles/mxnet_static.dir/src/api/operator/numpy/np_init_op.cc.o
[ 41%] Building CXX object CMakeFiles/mxnet_static.dir/src/api/operator/numpy/np_tensordot_op.cc.o
[ 41%] Building CXX object CMakeFiles/mxnet_static.dir/src/api/operator/utils.cc.o
[ 41%] Building CXX object CMakeFiles/mxnet_static.dir/src/base.cc.o
[ 41%] Building CXX object CMakeFiles/mxnet_static.dir/src/c_api/c_api_error.cc.o
[ 41%] Building CXX object CMakeFiles/mxnet_static.dir/src/c_api/c_api.cc.o
[ 41%] Building CXX object CMakeFiles/mxnet_static.dir/src/c_api/c_api_executor.cc.o
[ 41%] Building CXX object CMakeFiles/mxnet_static.dir/src/c_api/c_api_function.cc.o
[ 41%] Building CXX object CMakeFiles/mxnet_static.dir/src/c_api/c_api_ndarray.cc.o
/home/user/mxnet1.8/mxnet1.8_real/src/c_api/c_api_error.cc:25:10: fatal error: nnvm/c_api.h: No such file or directory
   25 | #include <nnvm/c_api.h>
      |          ^~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/mxnet_static.dir/build.make:141: CMakeFiles/mxnet_static.dir/src/c_api/c_api_error.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....
In file included from /home/user/mxnet1.8/mxnet1.8_real/include/mxnet/runtime/packed_func.h:37,
                 from /home/user/mxnet1.8/mxnet1.8_real/include/mxnet/runtime/registry.h:49,
                 from /home/user/mxnet1.8/mxnet1.8_real/include/mxnet/api_registry.h:31,
                 from /home/user/mxnet1.8/mxnet1.8_real/src/api/_api_internal/_api_internal.cc:25:
/home/user/mxnet1.8/mxnet1.8_real/include/mxnet/ndarray.h:33:10: fatal error: nnvm/node.h: No such file or directory
   33 | #include <nnvm/node.h>
      |          ^~~~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/mxnet_static.dir/build.make:63: CMakeFiles/mxnet_static.dir/src/api/_api_internal/_api_internal.cc.o] Error 1
In file included from /home/user/mxnet1.8/mxnet1.8_real/include/mxnet/runtime/packed_func.h:37,
                 from /home/user/mxnet1.8/mxnet1.8_real/src/api/operator/numpy/../utils.h:29,
                 from /home/user/mxnet1.8/mxnet1.8_real/src/api/operator/numpy/np_tensordot_op.cc:24:
/home/user/mxnet1.8/mxnet1.8_real/include/mxnet/ndarray.h:33:10: fatal error: nnvm/node.h: No such file or directory
   33 | #include <nnvm/node.h>
      |          ^~~~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/mxnet_static.dir/build.make:89: CMakeFiles/mxnet_static.dir/src/api/operator/numpy/np_tensordot_op.cc.o] Error 1
In file included from /home/user/mxnet1.8/mxnet1.8_real/include/mxnet/runtime/packed_func.h:37,
                 from /home/user/mxnet1.8/mxnet1.8_real/src/api/operator/numpy/../utils.h:29,
                 from /home/user/mxnet1.8/mxnet1.8_real/src/api/operator/numpy/np_init_op.cc:24:
/home/user/mxnet1.8/mxnet1.8_real/include/mxnet/ndarray.h:33:10: fatal error: nnvm/node.h: No such file or directory
   33 | #include <nnvm/node.h>
      |          ^~~~~~~~~~~~~
In file included from /home/user/mxnet1.8/mxnet1.8_real/include/mxnet/runtime/packed_func.h:37,
                 from /home/user/mxnet1.8/mxnet1.8_real/src/api/operator/utils.h:29,
                 from /home/user/mxnet1.8/mxnet1.8_real/src/api/operator/utils.cc:24:
/home/user/mxnet1.8/mxnet1.8_real/include/mxnet/ndarray.h:33:10: fatal error: nnvm/node.h: No such file or directory
   33 | #include <nnvm/node.h>
      |          ^~~~~~~~~~~~~
compilation terminated.
compilation terminated.
make[2]: *** [CMakeFiles/mxnet_static.dir/build.make:102: CMakeFiles/mxnet_static.dir/src/api/operator/utils.cc.o] Error 1
make[2]: *** [CMakeFiles/mxnet_static.dir/build.make:76: CMakeFiles/mxnet_static.dir/src/api/operator/numpy/np_init_op.cc.o] Error 1
In file included from /home/user/mxnet1.8/mxnet1.8_real/src/c_api/c_api_executor.cc:25:
/home/user/mxnet1.8/mxnet1.8_real/include/mxnet/base.h:35:10: fatal error: nnvm/op.h: No such file or directory
   35 | #include "nnvm/op.h"
      |          ^~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/mxnet_static.dir/build.make:154: CMakeFiles/mxnet_static.dir/src/c_api/c_api_executor.cc.o] Error 1
In file included from /home/user/mxnet1.8/mxnet1.8_real/src/base.cc:25:
/home/user/mxnet1.8/mxnet1.8_real/include/mxnet/base.h:35:10: fatal error: nnvm/op.h: No such file or directory
   35 | #include "nnvm/op.h"
      |          ^~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/mxnet_static.dir/build.make:115: CMakeFiles/mxnet_static.dir/src/base.cc.o] Error 1
In file included from /home/user/mxnet1.8/mxnet1.8_real/src/c_api/c_api_ndarray.cc:26:
/home/user/mxnet1.8/mxnet1.8_real/include/mxnet/base.h:35:10: fatal error: nnvm/op.h: No such file or directory
   35 | #include "nnvm/op.h"
      |          ^~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/mxnet_static.dir/build.make:180: CMakeFiles/mxnet_static.dir/src/c_api/c_api_ndarray.cc.o] Error 1
In file included from /home/user/mxnet1.8/mxnet1.8_real/src/c_api/c_api.cc:39:
/home/user/mxnet1.8/mxnet1.8_real/include/mxnet/base.h:35:10: fatal error: nnvm/op.h: No such file or directory
   35 | #include "nnvm/op.h"
      |          ^~~~~~~~~~~
compilation terminated.
In file included from /home/user/mxnet1.8/mxnet1.8_real/src/c_api/c_api_function.cc:26:
/home/user/mxnet1.8/mxnet1.8_real/include/mxnet/base.h:35:10: fatal error: nnvm/op.h: No such file or directory
   35 | #include "nnvm/op.h"
      |          ^~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/mxnet_static.dir/build.make:128: CMakeFiles/mxnet_static.dir/src/c_api/c_api.cc.o] Error 1
make[2]: *** [CMakeFiles/mxnet_static.dir/build.make:167: CMakeFiles/mxnet_static.dir/src/c_api/c_api_function.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:469: CMakeFiles/mxnet_static.dir/all] Error 2
make: *** [Makefile:141: all] Error 2

@TristonC
Copy link
Contributor

@lgg Did you git clone with --recursive? It seems some code file is missing your local repo.

@lgg
Copy link
Author

lgg commented Mar 16, 2021

@TristonC yes. I will try now again from scratch.

In my folder from previous try:

f.golovin@ml-dev:~/mxnet1.8/mxnet1.8_real$ git status
On branch v1.8.x
Your branch is up to date with 'origin/v1.8.x'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	modified:   .gitmodules
	modified:   3rdparty/openmp

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
	modified:   3rdparty/ps-lite (new commits, modified content)
	modified:   3rdparty/tvm (new commits, modified content)

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	include/mkldnn/oneapi/

@lgg
Copy link
Author

lgg commented Mar 16, 2021

@TristonC thank you! Your answer helped!

@lgg lgg closed this as completed Mar 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants