`NDArray.clip()` works very slow in imperative execution on GPU.

## Description
`NDArray.clip()` works very slow in imperative execution on GPU (~x3 slower than ReLU).
More details below

## Environment info (Required)

```
----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 7.2.0
Build        : ('default', 'Apr 29 2018 16:14:56')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.4.1
Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit hash file "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library      : ['/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so']
Build features:
No runtime build feature info available
----------System Info----------
Platform     : Linux-4.4.0-1092-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-XXX-XX-X-XXX
release      : 4.4.0-1092-aws
version      : #103-Ubuntu SMP Tue Aug 27 10:21:48 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.984
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.08
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0018 sec, LOAD: 0.5233 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1289 sec, LOAD: 0.4413 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.2271 sec, LOAD: 0.5561 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0098 sec, LOAD: 0.4055 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0145 sec, LOAD: 0.3227 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0135 sec, LOAD: 0.0799 sec.
----------Environment----------


```

I'm using Python

## Build info (Required if built from source)
N/A

## Error Message:
Running GluonCV resnet18_v2 on ImageNet:
Imperative, with `ndarray.clip(0,6)` as activation: throughput ~900 samples/sec.
~x3 slower compared to:
Imperative, with ReLU activation (original version): throughput ~3000 samples/sec.
Hybrid, with ReLU activation (original version): throughput ~3000 samples/sec.
Hybrid, with `ndarray.clip(0,6)` as activation: throughput ~3000 samples/sec.


## Minimum reproducible example / Steps to reproduce
1. Start an AWS p3.8xlarge with Deep Learning AMI (Ubuntu) Version 24.1 machine
2. Activate mxnet env: `source activate mxnet_p36`
3. Install gluoncv: pip install gluoncv
4. Download train_imagenet.py from gluoncv: https://gluon-cv.mxnet.io/_downloads/3bb06a6d6d085b1bb501b30aaf6c21c5/train_imagenet.py (source: https://gluon-cv.mxnet.io/model_zoo/classification.html#imagenet )
5. Modify line 257 ( https://github.com/dmlc/gluon-cv/blob/745ed855d769534eb2e23f0c136cd5f1bc9b60b7/gluoncv/model_zoo/resnet.py#L257 ) in /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluoncv/model_zoo/resnet.py , replace `x = F.Activation(x, act_type='relu')` with `x = x.clip(a_min=0, a_max=6)`
6. run:
`python train_imagenet.py --rec-train /home/ubuntu/path/to/train.rec --rec-train-idx /home/ubuntu/path/to/train.idx --rec-val /home/ubuntu/path/to/val.rec --rec-val-idx /home/ubuntu/path/to/val.idx --model resnet18_v2 --mode imperative --lr 0.4 --lr-mode cosine --num-epochs 120 --batch-size 256 --num-gpus 4 -j 30 --warmup-epochs 5 --use-rec --save-dir params_resnet18_v2`


## What have you tried to solve it?
N/A

Might be related to #11683

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`NDArray.clip()` works very slow in imperative execution on GPU. #16220

Description

Environment info (Required)

Build info (Required if built from source)

Error Message:

Minimum reproducible example / Steps to reproduce

What have you tried to solve it?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NDArray.clip() works very slow in imperative execution on GPU. #16220

Description

Description

Environment info (Required)

Build info (Required if built from source)

Error Message:

Minimum reproducible example / Steps to reproduce

What have you tried to solve it?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`NDArray.clip()` works very slow in imperative execution on GPU. #16220