Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiStepParametricLIFNode cupy backend bug #151

Closed
fangwei123456 opened this issue Dec 10, 2021 · 7 comments
Closed

MultiStepParametricLIFNode cupy backend bug #151

fangwei123456 opened this issue Dec 10, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@fangwei123456
Copy link
Owner

fangwei123456 commented Dec 10, 2021

grad_reciprocal_tau[0] = sdata[0];

Shared memory is allocated per thread block, so all threads in the block have access to the same shared memory.

This reduction only sums across threads in each block and ignores reduction over blocks.

We did not find this bug because we checked gradients between cupy and torch backends with a small number of neurons, and blocks number is only 1.

from spikingjelly.clock_driven import surrogate, neuron, neuron_kernel
import torch
torch.manual_seed(0)
import random
random.seed(0)
import numpy as np
np.random.seed(0)

# neuron_kernel.save_cuda_codes()


def check_multi_step_neuron_output_and_grad(device, multi_step_neuron, *neu_args, **neu_kwargs):
    @torch.no_grad()
    def max_error(x, y):
        return (x - y).abs().max().item()

    def fbptt(m, x: torch.Tensor):
        x = x.detach()
        x.requires_grad_(True)
        m(x)
        (m.spike_seq * m.v_seq ** 2).sum().backward()
        ret = {
            'spike_seq': m.spike_seq.detach().clone(),
            'v_seq': m.v_seq.detach().clone(),
            'x.grad': x.grad.clone()
        }
        for i, param in enumerate(m.parameters()):
            ret[f'param_{i}.grad'] = param.grad.detach().clone()
            param.grad.zero_()
        x.grad.zero_()
        m.reset()
        return ret

    shape = [63, 127]
    for hard_reset in [True, False]:
        for detach_reset in [False, True]:
            x = (torch.rand(shape, device=device) - 0.5) * 3.
            for dtype in ['fp32', 'fp16']:
                if dtype == 'fp32':
                    x = x.float()
                if dtype == 'fp16':
                    x = x.half()
                print(f'hard_reset={hard_reset}, detach_reset={detach_reset}, dtype={dtype}')
                model = multi_step_neuron(v_reset=0. if hard_reset else None, detach_reset=detach_reset, *neu_args,
                                          **neu_kwargs)
                # print(model)
                model.to(device)
                model.backend = 'torch'
                y_torch = fbptt(model, x)

                model.backend = 'cupy'
                y_cupy = fbptt(model, x)

                for key in y_torch.keys():
                    # if key == 'spike_seq' and max_error(y_torch[key], y_cupy[key]) == 1.:
                    #     err = y_torch['v_seq'] - y_cupy['v_seq']
                    #     print(err)

                    print(key, 'max error', max_error(y_torch[key], y_cupy[key]))
                print('\n')

device = 'cuda:0'
print('Sigmoid sg')
check_multi_step_neuron_output_and_grad(device, neuron.MultiStepParametricLIFNode, surrogate_function=surrogate.Sigmoid(), init_tau=1.9)

When we change shape = [63, 127] to shape = [63, 4097], we can find the gradient error:

hard_reset=True, detach_reset=False, dtype=fp32
spike_seq max error 0.0
v_seq max error 1.7881393432617188e-07
x.grad max error 5.960464477539063e-08
param_0.grad max error 382.3267822265625
@fangwei123456 fangwei123456 added the bug Something isn't working label Dec 10, 2021
@Yanqi-Chen
Copy link
Collaborator

A warning should be made to avoid using MultiStepParametricLIFNode in previous version

@fangwei123456
Copy link
Owner Author

A warning should be made to avoid using MultiStepParametricLIFNode in previous version

I will add a "bug list" in the readme to show previous bugs.

fangwei123456 added a commit that referenced this issue Dec 10, 2021
@fangwei123456
Copy link
Owner Author

In the current version:

Sigmoid sg
hard_reset=True, detach_reset=False, dtype=fp32
spike_seq max error 0.0
v_seq max error 1.7881393432617188e-07
x.grad max error 5.960464477539063e-08
param_0.grad max error 0.0

@fangwei123456
Copy link
Owner Author

In the current version https://github.com/fangwei123456/spikingjelly/tree/4381767d0a09c2dc6f66537b68f461222b6a795e :

from spikingjelly.clock_driven import neuron_kernel, neuron
device = 'cuda:0'
neuron_kernel.check_multi_step_neuron_output_and_grad(device, neuron.MultiStepParametricLIFNode)

The gradients of fp16 are wrong:

hard_reset=True, detach_reset=False, dtype=fp32
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 1.1920928955078125e-07
param_0.grad max error 0.00390625


hard_reset=True, detach_reset=False, dtype=fp16
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 0.0009765625
param_0.grad max error inf


hard_reset=True, detach_reset=True, dtype=fp32
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 1.1920928955078125e-07
param_0.grad max error 0.00390625


hard_reset=True, detach_reset=True, dtype=fp16
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 0.0009765625
param_0.grad max error inf


hard_reset=False, detach_reset=False, dtype=fp32
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 8.940696716308594e-08
param_0.grad max error 0.0


hard_reset=False, detach_reset=False, dtype=fp16
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 0.000732421875
param_0.grad max error inf


hard_reset=False, detach_reset=True, dtype=fp32
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 1.1920928955078125e-07
param_0.grad max error 0.0078125


hard_reset=False, detach_reset=True, dtype=fp16
spike_seq max error 0.0
v_seq max error 0.0
x.grad max error 0.0009765625
param_0.grad max error inf

@fangwei123456 fangwei123456 reopened this Dec 28, 2021
@fangwei123456
Copy link
Owner Author

The frist problem is:

const int stride = neuron_num >> 1;

for (int stride = threadx >> 1; stride > 0; stride = stride >> 1)

The second 'stride' should be renamed.

@fangwei123456
Copy link
Owner Author

I find that this problem is caused by too many neurons and the accumulated gradients excced the range of half.

@fangwei123456
Copy link
Owner Author

29b1bb3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants