set amp_mode and scaler to false in grad_acc=1 test #2243

Ishan-Kumar2 · 2021-10-04T19:46:50Z

Description:
In the _test_create_supervised_trainer a test for gradient accumulation==1 was added which calculates the result using trainer and compares it with the output of doing simple backprop.

While this works fine for most cases, it fails when AMP is turned on because of the precision drop when changing to fp16 and then back to fp32. This causes the test to fail see here

In this PR I have passed None as amp_mode so that it doesn't test gradient_accumulation==1 case with amp.

The other alternative I can think of would be to change this test to similar manner as the one of gradient_accumulation != 1 case and have values of input larger so that amp precision doesn't affect too much.

@sdesrozis what do you think?

Check list:

New tests are added (if a new feature is added)
New doc strings: description and/or example code are in RST format
Documentation is updated (if required)

sdesrozis · 2021-10-04T20:58:34Z

I'm quite surprised by the lack of precision here. In these tests, we don't need a very high precision.

Anyway we must check amp, so I would prefer we don't disable the test.

Why don't compute analytically the results in the case of gradient accumulation as without ? Remove the bias should help to get a concise code with the same coverage.

We need to fix this as fast as possible, I'll work on it when I'm available, unless you're done before.

Ishan-Kumar2 · 2021-10-05T09:51:07Z

@sdesrozis I removed the bias part, which does take care of the other failures. Still 1 failure remains on test test_create_supervised_trainer_on_cuda_amp_scaler which is weird because trainer outputs the new weight also as zero. Which means there is no update taking place even though the loss is the same for both the cases.

============================= test session starts ==============================
platform linux -- Python 3.7.12, pytest-4.3.1, py-1.10.0, pluggy-0.7.1
rootdir: /content/ignite, inifile: setup.cfg
plugins: typeguard-2.7.1, xdist-1.27.0, forked-1.3.0, cov-2.9.0
collected 24 items                                                             

tests/ignite/engine/test_create_supervised.py .........Fs..ss.........   [100%]

=================================== FAILURES ===================================
______________ test_create_supervised_trainer_on_cuda_amp_scaler _______________

    @pytest.mark.skipif(LooseVersion(torch.__version__) < LooseVersion("1.6.0"), reason="Skip if < 1.6.0")
    @pytest.mark.skipif(not torch.cuda.is_available(), reason="Skip if no GPU")
    def test_create_supervised_trainer_on_cuda_amp_scaler():
        model_device = trainer_device = "cuda"
        _test_create_supervised_trainer_wrong_accumulation(
            model_device=model_device, trainer_device=trainer_device, amp_mode="amp"
        )
    
        _test_create_supervised_trainer(
>           model_device=model_device, trainer_device=trainer_device, amp_mode="amp", scaler=True,
        )

tests/ignite/engine/test_create_supervised.py:443: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

gradient_accumulation_steps = 3, loss = 0.0045, weight = 0.054, bias = 0.1133
model_device = 'cuda', trainer_device = 'cuda', trace = False, amp_mode = 'amp'
scaler = True

    def _test_create_supervised_trainer(
        gradient_accumulation_steps=3,
        loss=0.0045,
        weight=0.0540,
        bias=0.1133,
        model_device: Optional[str] = None,
        trainer_device: Optional[str] = None,
        trace: bool = False,
        amp_mode: str = None,
        scaler: Union[bool, "torch.cuda.amp.GradScaler"] = False,
    ):
        trainer, model = _default_create_supervised_trainer(
            gradient_accumulation_steps=gradient_accumulation_steps,
            model_device=model_device,
            trainer_device=trainer_device,
            trace=trace,
            amp_mode=amp_mode,
            scaler=scaler,
        )
    
        x = torch.tensor([[0.1], [0.3], [0.7], [0.9], [1.3]])
        y = torch.tensor([[0.3], [0.5], [0.9], [1.3], [0.3]])
        data = [(_x, _y) for _x, _y in zip(x, y)]
    
        if model_device == trainer_device or ((model_device == "cpu") ^ (trainer_device == "cpu")):
            state = trainer.run(data)
    
            assert round(state.output[-1], 4) == loss, state.output[-1]
            assert round(model.weight.data[0, 0].item(), 4) == weight, model.weight.item()
            assert round(model.bias.item(), 4) == bias, model.bias.item()
    
            if amp_mode == "amp":
                assert state.output[0].dtype is torch.half
                if scaler and isinstance(scaler, bool):
                    assert hasattr(state, "scaler")
                else:
                    assert not hasattr(state, "scaler")
    
            # Test for Gradient Accumulation Turned Off
            trainer, model = _default_create_supervised_trainer(
                model_device=model_device, trainer_device=trainer_device, trace=trace, amp_mode=amp_mode, scaler=scaler, bias=False,
            )
            x = torch.tensor([[1.0], [1.0], [1.0], [1.0], [1.0]])
            data = [(_x, _y) for _x, _y in zip(x, x)]
    
            for i in range(len(data)):
                original_weights = model.weight.data[0, 0].item()
                state = trainer.run([data[i]])
                if i==0:
                    continue
                print(i)
                assert round(state.output[-1],2) == pytest.approx(round((1 - (original_weights)) ** 2,2)), state.output[-1]
>               assert round(model.weight.data[0, 0].item(),2) == pytest.approx(
                    round(original_weights + 2 * 0.1 * (1 - (original_weights)),2)
                ), model.weight.item()
E               AssertionError: 0.0
E               assert 0.0 == 0.2 ± 2.0e-07
E                +  where 0.0 = round(0.0, 2)
E                +    where 0.0 = <built-in method item of Tensor object at 0x7f7eb7691690>()
E                +      where <built-in method item of Tensor object at 0x7f7eb7691690> = tensor(0., device='cuda:0').item
E                +  and   0.2 ± 2.0e-07 = <function approx at 0x7f7f23220ef0>(0.2)
E                +    where <function approx at 0x7f7f23220ef0> = pytest.approx
E                +    and   0.2 = round((0.0 + ((2 * 0.1) * (1 - 0.0))), 2)

tests/ignite/engine/test_create_supervised.py:125: AssertionError
----------------------------- Captured stdout call -----------------------------
1
================ 1 failed, 20 passed, 3 skipped in 2.60 seconds ================

sdesrozis

CI works now

set amp_mode and scaler to false in grad_acc=1 test

192ccd8

removed bias from test

4133c77

Desroziers added 3 commits October 5, 2021 13:50

refactor tests

3d8e9e8

fix approx

1e79877

fix device

a76ee01

vfdev-5 mentioned this pull request Oct 5, 2021

Fix grad acc test - check CI #2245

Closed

3 tasks

fix tol

2e1a403

sdesrozis approved these changes Oct 5, 2021

View reviewed changes

vfdev-5 merged commit 47961cd into pytorch:master Oct 5, 2021

Ishan-Kumar2 deleted the fix_grad_acc_test branch October 5, 2021 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

set amp_mode and scaler to false in grad_acc=1 test #2243

set amp_mode and scaler to false in grad_acc=1 test #2243

Uh oh!

Ishan-Kumar2 commented Oct 4, 2021

Uh oh!

sdesrozis commented Oct 4, 2021 •

edited

Loading

Uh oh!

Ishan-Kumar2 commented Oct 5, 2021

Uh oh!

sdesrozis left a comment

Uh oh!

Uh oh!

Uh oh!

set amp_mode and scaler to false in grad_acc=1 test #2243

set amp_mode and scaler to false in grad_acc=1 test #2243

Uh oh!

Conversation

Ishan-Kumar2 commented Oct 4, 2021

Uh oh!

sdesrozis commented Oct 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ishan-Kumar2 commented Oct 5, 2021

Uh oh!

sdesrozis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sdesrozis commented Oct 4, 2021 •

edited

Loading