Skip to content

Conversation

Ishan-Kumar2
Copy link
Contributor

Description:
In the _test_create_supervised_trainer a test for gradient accumulation==1 was added which calculates the result using trainer and compares it with the output of doing simple backprop.

While this works fine for most cases, it fails when AMP is turned on because of the precision drop when changing to fp16 and then back to fp32. This causes the test to fail see here

In this PR I have passed None as amp_mode so that it doesn't test gradient_accumulation==1 case with amp.

The other alternative I can think of would be to change this test to similar manner as the one of gradient_accumulation != 1 case and have values of input larger so that amp precision doesn't affect too much.

@sdesrozis what do you think?

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@sdesrozis
Copy link
Contributor

sdesrozis commented Oct 4, 2021

I'm quite surprised by the lack of precision here. In these tests, we don't need a very high precision.

Anyway we must check amp, so I would prefer we don't disable the test.

Why don't compute analytically the results in the case of gradient accumulation as without ? Remove the bias should help to get a concise code with the same coverage.

We need to fix this as fast as possible, I'll work on it when I'm available, unless you're done before.

@Ishan-Kumar2
Copy link
Contributor Author

@sdesrozis I removed the bias part, which does take care of the other failures. Still 1 failure remains on test test_create_supervised_trainer_on_cuda_amp_scaler which is weird because trainer outputs the new weight also as zero. Which means there is no update taking place even though the loss is the same for both the cases.

============================= test session starts ==============================
platform linux -- Python 3.7.12, pytest-4.3.1, py-1.10.0, pluggy-0.7.1
rootdir: /content/ignite, inifile: setup.cfg
plugins: typeguard-2.7.1, xdist-1.27.0, forked-1.3.0, cov-2.9.0
collected 24 items                                                             

tests/ignite/engine/test_create_supervised.py .........Fs..ss.........   [100%]

=================================== FAILURES ===================================
______________ test_create_supervised_trainer_on_cuda_amp_scaler _______________

    @pytest.mark.skipif(LooseVersion(torch.__version__) < LooseVersion("1.6.0"), reason="Skip if < 1.6.0")
    @pytest.mark.skipif(not torch.cuda.is_available(), reason="Skip if no GPU")
    def test_create_supervised_trainer_on_cuda_amp_scaler():
        model_device = trainer_device = "cuda"
        _test_create_supervised_trainer_wrong_accumulation(
            model_device=model_device, trainer_device=trainer_device, amp_mode="amp"
        )
    
        _test_create_supervised_trainer(
>           model_device=model_device, trainer_device=trainer_device, amp_mode="amp", scaler=True,
        )

tests/ignite/engine/test_create_supervised.py:443: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

gradient_accumulation_steps = 3, loss = 0.0045, weight = 0.054, bias = 0.1133
model_device = 'cuda', trainer_device = 'cuda', trace = False, amp_mode = 'amp'
scaler = True

    def _test_create_supervised_trainer(
        gradient_accumulation_steps=3,
        loss=0.0045,
        weight=0.0540,
        bias=0.1133,
        model_device: Optional[str] = None,
        trainer_device: Optional[str] = None,
        trace: bool = False,
        amp_mode: str = None,
        scaler: Union[bool, "torch.cuda.amp.GradScaler"] = False,
    ):
        trainer, model = _default_create_supervised_trainer(
            gradient_accumulation_steps=gradient_accumulation_steps,
            model_device=model_device,
            trainer_device=trainer_device,
            trace=trace,
            amp_mode=amp_mode,
            scaler=scaler,
        )
    
        x = torch.tensor([[0.1], [0.3], [0.7], [0.9], [1.3]])
        y = torch.tensor([[0.3], [0.5], [0.9], [1.3], [0.3]])
        data = [(_x, _y) for _x, _y in zip(x, y)]
    
        if model_device == trainer_device or ((model_device == "cpu") ^ (trainer_device == "cpu")):
            state = trainer.run(data)
    
            assert round(state.output[-1], 4) == loss, state.output[-1]
            assert round(model.weight.data[0, 0].item(), 4) == weight, model.weight.item()
            assert round(model.bias.item(), 4) == bias, model.bias.item()
    
            if amp_mode == "amp":
                assert state.output[0].dtype is torch.half
                if scaler and isinstance(scaler, bool):
                    assert hasattr(state, "scaler")
                else:
                    assert not hasattr(state, "scaler")
    
            # Test for Gradient Accumulation Turned Off
            trainer, model = _default_create_supervised_trainer(
                model_device=model_device, trainer_device=trainer_device, trace=trace, amp_mode=amp_mode, scaler=scaler, bias=False,
            )
            x = torch.tensor([[1.0], [1.0], [1.0], [1.0], [1.0]])
            data = [(_x, _y) for _x, _y in zip(x, x)]
    
            for i in range(len(data)):
                original_weights = model.weight.data[0, 0].item()
                state = trainer.run([data[i]])
                if i==0:
                    continue
                print(i)
                assert round(state.output[-1],2) == pytest.approx(round((1 - (original_weights)) ** 2,2)), state.output[-1]
>               assert round(model.weight.data[0, 0].item(),2) == pytest.approx(
                    round(original_weights + 2 * 0.1 * (1 - (original_weights)),2)
                ), model.weight.item()
E               AssertionError: 0.0
E               assert 0.0 == 0.2 ± 2.0e-07
E                +  where 0.0 = round(0.0, 2)
E                +    where 0.0 = <built-in method item of Tensor object at 0x7f7eb7691690>()
E                +      where <built-in method item of Tensor object at 0x7f7eb7691690> = tensor(0., device='cuda:0').item
E                +  and   0.2 ± 2.0e-07 = <function approx at 0x7f7f23220ef0>(0.2)
E                +    where <function approx at 0x7f7f23220ef0> = pytest.approx
E                +    and   0.2 = round((0.0 + ((2 * 0.1) * (1 - 0.0))), 2)

tests/ignite/engine/test_create_supervised.py:125: AssertionError
----------------------------- Captured stdout call -----------------------------
1
================ 1 failed, 20 passed, 3 skipped in 2.60 seconds ================

@vfdev-5 vfdev-5 mentioned this pull request Oct 5, 2021
3 tasks
Copy link
Contributor

@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI works now

@vfdev-5 vfdev-5 merged commit 47961cd into pytorch:master Oct 5, 2021
@Ishan-Kumar2 Ishan-Kumar2 deleted the fix_grad_acc_test branch October 5, 2021 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants