Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

NDArray.clip() works very slow in imperative execution on GPU. #16220

Open
@igolan

Description

Description

NDArray.clip() works very slow in imperative execution on GPU (~x3 slower than ReLU).
More details below

Environment info (Required)

----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 7.2.0
Build        : ('default', 'Apr 29 2018 16:14:56')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.4.1
Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit hash file "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library      : ['/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so']
Build features:
No runtime build feature info available
----------System Info----------
Platform     : Linux-4.4.0-1092-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-XXX-XX-X-XXX
release      : 4.4.0-1092-aws
version      : #103-Ubuntu SMP Tue Aug 27 10:21:48 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.984
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.08
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0018 sec, LOAD: 0.5233 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1289 sec, LOAD: 0.4413 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.2271 sec, LOAD: 0.5561 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0098 sec, LOAD: 0.4055 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0145 sec, LOAD: 0.3227 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0135 sec, LOAD: 0.0799 sec.
----------Environment----------


I'm using Python

Build info (Required if built from source)

N/A

Error Message:

Running GluonCV resnet18_v2 on ImageNet:
Imperative, with ndarray.clip(0,6) as activation: throughput ~900 samples/sec.
~x3 slower compared to:
Imperative, with ReLU activation (original version): throughput ~3000 samples/sec.
Hybrid, with ReLU activation (original version): throughput ~3000 samples/sec.
Hybrid, with ndarray.clip(0,6) as activation: throughput ~3000 samples/sec.

Minimum reproducible example / Steps to reproduce

  1. Start an AWS p3.8xlarge with Deep Learning AMI (Ubuntu) Version 24.1 machine
  2. Activate mxnet env: source activate mxnet_p36
  3. Install gluoncv: pip install gluoncv
  4. Download train_imagenet.py from gluoncv: https://gluon-cv.mxnet.io/_downloads/3bb06a6d6d085b1bb501b30aaf6c21c5/train_imagenet.py (source: https://gluon-cv.mxnet.io/model_zoo/classification.html#imagenet )
  5. Modify line 257 ( https://github.com/dmlc/gluon-cv/blob/745ed855d769534eb2e23f0c136cd5f1bc9b60b7/gluoncv/model_zoo/resnet.py#L257 ) in /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluoncv/model_zoo/resnet.py , replace x = F.Activation(x, act_type='relu') with x = x.clip(a_min=0, a_max=6)
  6. run:
    python train_imagenet.py --rec-train /home/ubuntu/path/to/train.rec --rec-train-idx /home/ubuntu/path/to/train.idx --rec-val /home/ubuntu/path/to/val.rec --rec-val-idx /home/ubuntu/path/to/val.idx --model resnet18_v2 --mode imperative --lr 0.4 --lr-mode cosine --num-epochs 120 --batch-size 256 --num-gpus 4 -j 30 --warmup-epochs 5 --use-rec --save-dir params_resnet18_v2

What have you tried to solve it?

N/A

Might be related to #11683

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions