This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Open
Description
Description
NDArray.clip()
works very slow in imperative execution on GPU (~x3 slower than ReLU).
More details below
Environment info (Required)
----------Python Info----------
Version : 3.6.5
Compiler : GCC 7.2.0
Build : ('default', 'Apr 29 2018 16:14:56')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 10.0.1
Directory : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version : 1.4.1
Directory : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit hash file "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library : ['/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so']
Build features:
No runtime build feature info available
----------System Info----------
Platform : Linux-4.4.0-1092-aws-x86_64-with-debian-stretch-sid
system : Linux
node : ip-XXX-XX-X-XXX
release : 4.4.0-1092-aws
version : #103-Ubuntu SMP Tue Aug 27 10:21:48 UTC 2019
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2699.984
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4600.08
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0018 sec, LOAD: 0.5233 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1289 sec, LOAD: 0.4413 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.2271 sec, LOAD: 0.5561 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0098 sec, LOAD: 0.4055 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0145 sec, LOAD: 0.3227 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0135 sec, LOAD: 0.0799 sec.
----------Environment----------
I'm using Python
Build info (Required if built from source)
N/A
Error Message:
Running GluonCV resnet18_v2 on ImageNet:
Imperative, with ndarray.clip(0,6)
as activation: throughput ~900 samples/sec.
~x3 slower compared to:
Imperative, with ReLU activation (original version): throughput ~3000 samples/sec.
Hybrid, with ReLU activation (original version): throughput ~3000 samples/sec.
Hybrid, with ndarray.clip(0,6)
as activation: throughput ~3000 samples/sec.
Minimum reproducible example / Steps to reproduce
- Start an AWS p3.8xlarge with Deep Learning AMI (Ubuntu) Version 24.1 machine
- Activate mxnet env:
source activate mxnet_p36
- Install gluoncv: pip install gluoncv
- Download train_imagenet.py from gluoncv: https://gluon-cv.mxnet.io/_downloads/3bb06a6d6d085b1bb501b30aaf6c21c5/train_imagenet.py (source: https://gluon-cv.mxnet.io/model_zoo/classification.html#imagenet )
- Modify line 257 ( https://github.com/dmlc/gluon-cv/blob/745ed855d769534eb2e23f0c136cd5f1bc9b60b7/gluoncv/model_zoo/resnet.py#L257 ) in /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluoncv/model_zoo/resnet.py , replace
x = F.Activation(x, act_type='relu')
withx = x.clip(a_min=0, a_max=6)
- run:
python train_imagenet.py --rec-train /home/ubuntu/path/to/train.rec --rec-train-idx /home/ubuntu/path/to/train.idx --rec-val /home/ubuntu/path/to/val.rec --rec-val-idx /home/ubuntu/path/to/val.idx --model resnet18_v2 --mode imperative --lr 0.4 --lr-mode cosine --num-epochs 120 --batch-size 256 --num-gpus 4 -j 30 --warmup-epochs 5 --use-rec --save-dir params_resnet18_v2
What have you tried to solve it?
N/A
Might be related to #11683