MultiStepParametricLIFNode cupy backend bug #151

fangwei123456 · 2021-12-10T03:02:09Z

spikingjelly/spikingjelly/clock_driven/neuron_kernel.cu

Line 2845 in ee2b22f

grad_reciprocal_tau[0] = sdata[0];

Shared memory is allocated per thread block, so all threads in the block have access to the same shared memory.

This reduction only sums across threads in each block and ignores reduction over blocks.

We did not find this bug because we checked gradients between cupy and torch backends with a small number of neurons, and blocks number is only 1.

from spikingjelly.clock_driven import surrogate, neuron, neuron_kernel
import torch
torch.manual_seed(0)
import random
random.seed(0)
import numpy as np
np.random.seed(0)

# neuron_kernel.save_cuda_codes()


def check_multi_step_neuron_output_and_grad(device, multi_step_neuron, *neu_args, **neu_kwargs):
    @torch.no_grad()
    def max_error(x, y):
        return (x - y).abs().max().item()

    def fbptt(m, x: torch.Tensor):
        x = x.detach()
        x.requires_grad_(True)
        m(x)
        (m.spike_seq * m.v_seq ** 2).sum().backward()
        ret = {
            'spike_seq': m.spike_seq.detach().clone(),
            'v_seq': m.v_seq.detach().clone(),
            'x.grad': x.grad.clone()
        }
        for i, param in enumerate(m.parameters()):
            ret[f'param_{i}.grad'] = param.grad.detach().clone()
            param.grad.zero_()
        x.grad.zero_()
        m.reset()
        return ret

    shape = [63, 127]
    for hard_reset in [True, False]:
        for detach_reset in [False, True]:
            x = (torch.rand(shape, device=device) - 0.5) * 3.
            for dtype in ['fp32', 'fp16']:
                if dtype == 'fp32':
                    x = x.float()
                if dtype == 'fp16':
                    x = x.half()
                print(f'hard_reset={hard_reset}, detach_reset={detach_reset}, dtype={dtype}')
                model = multi_step_neuron(v_reset=0. if hard_reset else None, detach_reset=detach_reset, *neu_args,
                                          **neu_kwargs)
                # print(model)
                model.to(device)
                model.backend = 'torch'
                y_torch = fbptt(model, x)

                model.backend = 'cupy'
                y_cupy = fbptt(model, x)

                for key in y_torch.keys():
                    # if key == 'spike_seq' and max_error(y_torch[key], y_cupy[key]) == 1.:
                    #     err = y_torch['v_seq'] - y_cupy['v_seq']
                    #     print(err)

                    print(key, 'max error', max_error(y_torch[key], y_cupy[key]))
                print('\n')

device = 'cuda:0'
print('Sigmoid sg')
check_multi_step_neuron_output_and_grad(device, neuron.MultiStepParametricLIFNode, surrogate_function=surrogate.Sigmoid(), init_tau=1.9)

When we change shape = [63, 127] to shape = [63, 4097], we can find the gradient error:

hard_reset=True, detach_reset=False, dtype=fp32
spike_seq max error 0.0
v_seq max error 1.7881393432617188e-07
x.grad max error 5.960464477539063e-08
param_0.grad max error 382.3267822265625

The text was updated successfully, but these errors were encountered:

Yanqi-Chen · 2021-12-10T03:19:08Z

A warning should be made to avoid using MultiStepParametricLIFNode in previous version

fangwei123456 · 2021-12-10T03:30:22Z

A warning should be made to avoid using MultiStepParametricLIFNode in previous version

I will add a "bug list" in the readme to show previous bugs.

fangwei123456 · 2021-12-10T04:47:28Z

In the current version:

Sigmoid sg
hard_reset=True, detach_reset=False, dtype=fp32
spike_seq max error 0.0
v_seq max error 1.7881393432617188e-07
x.grad max error 5.960464477539063e-08
param_0.grad max error 0.0

fangwei123456 · 2021-12-28T09:45:20Z

In the current version https://github.com/fangwei123456/spikingjelly/tree/4381767d0a09c2dc6f66537b68f461222b6a795e :

from spikingjelly.clock_driven import neuron_kernel, neuron
device = 'cuda:0'
neuron_kernel.check_multi_step_neuron_output_and_grad(device, neuron.MultiStepParametricLIFNode)

The gradients of fp16 are wrong:

hard_reset=True, detach_reset=False, dtype=fp32
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 1.1920928955078125e-07
param_0.grad max error 0.00390625


hard_reset=True, detach_reset=False, dtype=fp16
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 0.0009765625
param_0.grad max error inf


hard_reset=True, detach_reset=True, dtype=fp32
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 1.1920928955078125e-07
param_0.grad max error 0.00390625


hard_reset=True, detach_reset=True, dtype=fp16
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 0.0009765625
param_0.grad max error inf


hard_reset=False, detach_reset=False, dtype=fp32
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 8.940696716308594e-08
param_0.grad max error 0.0


hard_reset=False, detach_reset=False, dtype=fp16
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 0.000732421875
param_0.grad max error inf


hard_reset=False, detach_reset=True, dtype=fp32
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 1.1920928955078125e-07
param_0.grad max error 0.0078125


hard_reset=False, detach_reset=True, dtype=fp16
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 0.0009765625
param_0.grad max error inf

fangwei123456 · 2021-12-28T11:46:46Z

The frist problem is:

spikingjelly/spikingjelly/clock_driven/neuron_kernel.py

Line 879 in 4381767

const int stride = neuron_num >> 1;

spikingjelly/spikingjelly/clock_driven/neuron_kernel.py

Line 943 in 4381767

for (int stride = threadx >> 1; stride > 0; stride = stride >> 1)

The second 'stride' should be renamed.

fangwei123456 · 2021-12-28T11:59:43Z

I find that this problem is caused by too many neurons and the accumulated gradients excced the range of half.

fangwei123456 · 2022-02-23T07:28:27Z

29b1bb3

fangwei123456 added the bug Something isn't working label Dec 10, 2021

fangwei123456 closed this as completed in 732f39a Dec 10, 2021

fangwei123456 added a commit that referenced this issue Dec 10, 2021

fix #151

49f6146

fangwei123456 reopened this Dec 28, 2021

fangwei123456 closed this as completed Dec 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiStepParametricLIFNode cupy backend bug #151

MultiStepParametricLIFNode cupy backend bug #151

fangwei123456 commented Dec 10, 2021 •

edited

Loading

Yanqi-Chen commented Dec 10, 2021

fangwei123456 commented Dec 10, 2021

fangwei123456 commented Dec 10, 2021

fangwei123456 commented Dec 28, 2021

fangwei123456 commented Dec 28, 2021

fangwei123456 commented Dec 28, 2021

fangwei123456 commented Feb 23, 2022

MultiStepParametricLIFNode cupy backend bug #151

MultiStepParametricLIFNode cupy backend bug #151

Comments

fangwei123456 commented Dec 10, 2021 • edited Loading

Yanqi-Chen commented Dec 10, 2021

fangwei123456 commented Dec 10, 2021

fangwei123456 commented Dec 10, 2021

fangwei123456 commented Dec 28, 2021

fangwei123456 commented Dec 28, 2021

fangwei123456 commented Dec 28, 2021

fangwei123456 commented Feb 23, 2022

fangwei123456 commented Dec 10, 2021 •

edited

Loading