Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Numpy] Gradient of np.prod is wrong when there are zeros in the array #18078

Open
ghost opened this issue Apr 16, 2020 · 8 comments
Open

[Numpy] Gradient of np.prod is wrong when there are zeros in the array #18078

ghost opened this issue Apr 16, 2020 · 8 comments

Comments

@ghost
Copy link

ghost commented Apr 16, 2020

Description

I investigated the MXNet C++ code and noticed that the gradient of the input array [[x, y]] is computed as [[(x*y)/x, (x*y)/y]]. Therefore, if y is zero, then we always get nan as the second element of the array. Is it a bug?

Error Message

No error message

To Reproduce

This is a python program I run:

import mxnet as mx
sym = mx.symbol.prod(data=mx.symbol.Variable("in"), axis=(1))
exec = sym.bind(mx.cpu(0), {"in":mx.nd.array([[4, 0]])}, {"in":mx.nd.array([[9, 9]])})
out = exec.forward()[0]
exec.backward(mx.nd.ones(out.shape))
print('-- Gradient:')
print(exec.grad_dict['in'])

I get this output:

-- Gradient:
[[ 0. nan]]
<NDArray 1x2 @cpu(0)>

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

paste outputs here

----------Python Info----------
Version      : 3.7.7
Compiler     : Clang 11.0.0 (clang-1100.0.33.17)
Build        : ('default', 'Mar 10 2020 15:43:03')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 20.0.2
Directory    : /usr/local/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /Users/zibi22/MXNet-1.6.0/python/mxnet
Num GPUs     : 0
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Darwin-18.7.0-x86_64-i386-64bit
system       : Darwin
node         : zibi22maclap
release      : 18.7.0
version      : Darwin Kernel Version 18.7.0: Thu Jan 23 06:52:12 PST 2020; root:xnu-4903.278.25~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 HLE AVX2 SMEP BMI2 ERMS INVPCID RTM FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT MDCLEAR TSXFA IBRS STIBP L1DF SSBD'
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0756 sec, LOAD: 0.9306 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0006 sec, LOAD: 0.8133 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0008 sec, LOAD: 0.3825 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0006 sec, LOAD: 0.1266 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0008 sec, LOAD: 0.1243 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0009 sec, LOAD: 0.3131 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0731 sec, LOAD: 0.9831 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.00061798095703125 sec.
@ghost ghost added the Bug label Apr 16, 2020
@sxjscience
Copy link
Member

@zleyk22 Yes, it's a bug and also exists in numpy.

import mxnet as mx
mx.npx.set_np()
a = mx.np.array([1,0])
a.attach_grad()
with mx.autograd.record():
    b = mx.np.prod(a)
    b.backward()
print(a.grad)

Output:

[ 0. nan]

@ghost
Copy link
Author

ghost commented Apr 16, 2020 via email

@sxjscience
Copy link
Member

@zleyk22 I think currently no one is looking at the issue. Would you try to solve it? Thanks for pointing out the problem!

@ghost
Copy link
Author

ghost commented Apr 16, 2020 via email

@sxjscience sxjscience assigned ghost Apr 16, 2020
@sxjscience
Copy link
Member

@zleyk22 I can think of two possible ways to solve this problem:

  1. Use two cumsums
    • [\log a_0, \log a_0 + \log a_1, ..., \log a_0 + \log a_1 + ... \log a_{n-3}],
    • [\log a_2 + \log a_3 ... + \log a_{n-1}, \log a_3 + ... + \log a_{n-1}, ... \log a_{n-1}]
      Then, sum up these two cumsums. Here, I think we should take the log-sum approach to avoid the overflow/underflow problem of multiplying lots of numbers. (Also this is the algorithm used to solve https://leetcode.com/problems/product-of-array-except-self/)
  2. Detect 0s and give them special treatment. We may detect the positions of the zeros and update the gradient of these positions with the correct value.

@sxjscience sxjscience changed the title A possible bug when computing a gradient vector [Numpy] Gradient of np.prod is wrong when there are zeros in the array Apr 16, 2020
@leezu leezu added the v2.0 label May 7, 2020
@yzhliu
Copy link
Member

yzhliu commented May 13, 2020

@zleyk22 are you working on this?

@ghost
Copy link
Author

ghost commented May 13, 2020 via email

@matteosal
Copy link
Contributor

matteosal commented Jun 2, 2020

@yzhliu
Me and @zleyk22 spent some time investigating this and tracking it down. We managed to track it down to this line.
OP::Map(data[i], DType(out[out_idx])) computes DType(out[out_idx]) / data[i], which returns nan when data[i] is zero (our input is mx.nd.array([[4, 0]])).

The logic behind this computation is that (unless zeros are involved) dividing x*y*z by e.g. x gives you y*z which is the desired gradient. But this trick doesn't work when zeros are involved. In this case, the code should just recompute the entire product, leaving x out.

But it seems that this infrastructure is quite rigid and only allows to re-use the previously-computed output and one single input element at a time. So it seems that fixing this bug would require some changes to this system, which is shared by more than one operator. Due to our low familiarity with the codebase, we don't think we are able to proceed further without a bit of guidance and help.

Can you suggest a way forward? Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants