Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Flaky?) CI failures on GHA latest + Buildkite #359

Closed
ToucheSir opened this issue Oct 30, 2021 · 12 comments · Fixed by #441
Closed

(Flaky?) CI failures on GHA latest + Buildkite #359

ToucheSir opened this issue Oct 30, 2021 · 12 comments · Fixed by #441

Comments

@ToucheSir
Copy link
Member

The Appveyor badge in the readme should be swapped out for a Buildkite one as well.

@DhairyaLGandhi
Copy link
Member

Is it flaky or have many of the tests been consistently failing? GPU tests and some convolution tests come to mind.

@ToucheSir
Copy link
Member Author

At least on the couple of commits I checked, different AD tests were failing between the different CI services. Which seems like a fun problem 😅

@ToucheSir
Copy link
Member Author

After some more digging, these failures really do seem random. I've yet to see all 3 conv spatial autodiff suites fail in one run, but it's not uncommon to see tests in 2/3 fail. Perhaps the tolerances are too strict?

@ToucheSir
Copy link
Member Author

Does anyone know how many threads Buildkite runs by default? I had a look back through test failures on merge commits on GHA, and they are solely limited to the config with 2 threads (e.g. 1, 2, 3). My hunch is that Buildkite is using more threads by default and thus running into issues more frequently.

@ToucheSir
Copy link
Member Author

Ok, I've been able to replicate this semi-frequently locally with 4/8 threads (on 4 cores). Running the depthwise conv fuzz tests is pretty damning:

[ Info: Starting Depthwise Convolutional fuzzing tests; this can take a few minutes...
.......................fuzzing: Test Failed at /home/brianc/projects/julia-dev/NNlib.jl/test/conv.jl:635
  Expression: dw_direct ≈ dw_im2col
   Evaluated: [0.18744640672662355] ≈ [0.00810922314783625]
Stacktrace:
 [1] macro expansion
   @ ~/projects/julia-dev/NNlib.jl/test/conv.jl:635 [inlined]
 [2] macro expansion
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
 [3] top-level scope
   @ ~/projects/julia-dev/NNlib.jl/test/conv.jl:565
fuzzing: Test Failed at /home/brianc/projects/julia-dev/NNlib.jl/test/conv.jl:635
  Expression: dw_direct ≈ dw_im2col
   Evaluated: [0.18744640672662355] ≈ [0.1856732181254305]
Stacktrace:
 [1] macro expansion
   @ ~/projects/julia-dev/NNlib.jl/test/conv.jl:635 [inlined]
 [2] macro expansion
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
 [3] top-level scope
   @ ~/projects/julia-dev/NNlib.jl/test/conv.jl:565
fuzzing: Test Failed at /home/brianc/projects/julia-dev/NNlib.jl/test/conv.jl:635
  Expression: dw_direct ≈ dw_im2col
   Evaluated: [0.18744640672662355] ≈ [0.022131711476828686]
Stacktrace:
 [1] macro expansion
   @ ~/projects/julia-dev/NNlib.jl/test/conv.jl:635 [inlined]
 [2] macro expansion
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
 [3] top-level scope
   @ ~/projects/julia-dev/NNlib.jl/test/conv.jl:565
fuzzing: Test Failed at /home/brianc/projects/julia-dev/NNlib.jl/test/conv.jl:635
  Expression: dw_direct ≈ dw_im2col
   Evaluated: [0.18744640672662355] ≈ [0.17519710699882415]
Stacktrace:
 [1] macro expansion
   @ ~/projects/julia-dev/NNlib.jl/test/conv.jl:635 [inlined]
 [2] macro expansion
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
 [3] top-level scope
   @ ~/projects/julia-dev/NNlib.jl/test/conv.jl:565
fuzzing: Test Failed at /home/brianc/projects/julia-dev/NNlib.jl/test/conv.jl:635
  Expression: dw_direct ≈ dw_im2col
   Evaluated: [0.18744640672662355] ≈ [0.16531469524979486]
Stacktrace:
 [1] macro expansion
   @ ~/projects/julia-dev/NNlib.jl/test/conv.jl:635 [inlined]
 [2] macro expansion
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
 [3] top-level scope
   @ ~/projects/julia-dev/NNlib.jl/test/conv.jl:565
... etc.

AFAICT, multithreaded CI has been consistently (but not always) failing since #242. That PR introduced gradcheck tests in the first place, so it's likely that the direct and im2col implementations had diverged for some time beforehand. It appears #235 fixed some major issues with the latter, but evidently that was not enough to ensure consistent behaviour.

@ToucheSir
Copy link
Member Author

ToucheSir commented Nov 3, 2021

Ok, a couple more interesting observations for the folks who actually understand these routines (i.e. not me) to work with.

  1. Fuzzing fails with 1 Julia runtime thread and 1 BLAS thread set. So unless the GEMM implementations in NNlib are somehow ignoring that configuration, this appears to be an algorithmic issue.
  2. Not all fuzz tests fail. I haven't combed through the whole set (still running them), but I have yet to see https://github.com/FluxML/NNlib.jl/blob/master/test/conv.jl#L633-L635 or https://github.com/FluxML/NNlib.jl/blob/master/test/conv.jl#L640-L642 fail.
  3. Not all test configurations fail, either. Since one . is printed for every iteration of https://github.com/FluxML/NNlib.jl/blob/master/test/conv.jl#L583-L587, searching over https://github.com/FluxML/NNlib.jl/blob/master/test/conv.jl#L597-L599 is probably not required.

Edit: I annotated the outer 2 loops so that Test could keep track of each set of parameters. Here's the full output:

Test Summary:                            |  Pass  Fail  Total
Convolution                              | 42728  3784  46512
  fuzzing                                | 42728  3784  46512
    x_size=(1,), C_in=1, batch=1         |   216          216
    x_size=(1,), C_in=1, batch=5         |   216          216
    x_size=(1,), C_in=3, batch=1         |   198    18    216
      w_size=(1,), C_mult=1              |    81           81
      w_size=(1,), C_mult=4              |    81           81
      w_size=(3,), C_mult=1              |    18     9     27
      w_size=(3,), C_mult=4              |    18     9     27
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |              No tests
      w_size=(1, 1), C_mult=4            |              No tests
      w_size=(1, 3), C_mult=1            |              No tests
      w_size=(1, 3), C_mult=4            |              No tests
      w_size=(3, 4), C_mult=1            |              No tests
      w_size=(3, 4), C_mult=4            |              No tests
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(1,), C_in=3, batch=5         |   198    18    216
      w_size=(1,), C_mult=1              |    81           81
      w_size=(1,), C_mult=4              |    81           81
      w_size=(3,), C_mult=1              |    18     9     27
      w_size=(3,), C_mult=4              |    18     9     27
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |              No tests
      w_size=(1, 1), C_mult=4            |              No tests
      w_size=(1, 3), C_mult=1            |              No tests
      w_size=(1, 3), C_mult=4            |              No tests
      w_size=(3, 4), C_mult=1            |              No tests
      w_size=(3, 4), C_mult=4            |              No tests
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(3,), C_in=1, batch=1         |   270          270
    x_size=(3,), C_in=1, batch=5         |   270          270
    x_size=(3,), C_in=3, batch=1         |   240    30    270
      w_size=(1,), C_mult=1              |    81           81
      w_size=(1,), C_mult=4              |    81           81
      w_size=(3,), C_mult=1              |    33    12     45
      w_size=(3,), C_mult=4              |    33    12     45
      w_size=(7,), C_mult=1              |     6     3      9
      w_size=(7,), C_mult=4              |     6     3      9
      w_size=(1, 1), C_mult=1            |              No tests
      w_size=(1, 1), C_mult=4            |              No tests
      w_size=(1, 3), C_mult=1            |              No tests
      w_size=(1, 3), C_mult=4            |              No tests
      w_size=(3, 4), C_mult=1            |              No tests
      w_size=(3, 4), C_mult=4            |              No tests
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(3,), C_in=3, batch=5         |   240    30    270
      w_size=(1,), C_mult=1              |    81           81
      w_size=(1,), C_mult=4              |    81           81
      w_size=(3,), C_mult=1              |    33    12     45
      w_size=(3,), C_mult=4              |    33    12     45
      w_size=(7,), C_mult=1              |     6     3      9
      w_size=(7,), C_mult=4              |     6     3      9
      w_size=(1, 1), C_mult=1            |              No tests
      w_size=(1, 1), C_mult=4            |              No tests
      w_size=(1, 3), C_mult=1            |              No tests
      w_size=(1, 3), C_mult=4            |              No tests
      w_size=(3, 4), C_mult=1            |              No tests
      w_size=(3, 4), C_mult=4            |              No tests
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(7,), C_in=1, batch=1         |   360          360
    x_size=(7,), C_in=1, batch=5         |   360          360
    x_size=(7,), C_in=3, batch=1         |   312    48    360
      w_size=(1,), C_mult=1              |    81           81
      w_size=(1,), C_mult=4              |    81           81
      w_size=(3,), C_mult=1              |    54    18     72
      w_size=(3,), C_mult=4              |    54    18     72
      w_size=(7,), C_mult=1              |    21     6     27
      w_size=(7,), C_mult=4              |    21     6     27
      w_size=(1, 1), C_mult=1            |              No tests
      w_size=(1, 1), C_mult=4            |              No tests
      w_size=(1, 3), C_mult=1            |              No tests
      w_size=(1, 3), C_mult=4            |              No tests
      w_size=(3, 4), C_mult=1            |              No tests
      w_size=(3, 4), C_mult=4            |              No tests
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(7,), C_in=3, batch=5         |   312    48    360
      w_size=(1,), C_mult=1              |    81           81
      w_size=(1,), C_mult=4              |    81           81
      w_size=(3,), C_mult=1              |    54    18     72
      w_size=(3,), C_mult=4              |    54    18     72
      w_size=(7,), C_mult=1              |    21     6     27
      w_size=(7,), C_mult=4              |    21     6     27
      w_size=(1, 1), C_mult=1            |              No tests
      w_size=(1, 1), C_mult=4            |              No tests
      w_size=(1, 3), C_mult=1            |              No tests
      w_size=(1, 3), C_mult=4            |              No tests
      w_size=(3, 4), C_mult=1            |              No tests
      w_size=(3, 4), C_mult=4            |              No tests
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(1, 3), C_in=1, batch=1       |  1530         1530
    x_size=(1, 3), C_in=1, batch=5       |  1530         1530
    x_size=(1, 3), C_in=3, batch=1       |  1332   198   1530
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |   375          375
      w_size=(1, 1), C_mult=4            |   375          375
      w_size=(1, 3), C_mult=1            |   207    63    270
      w_size=(1, 3), C_mult=4            |   207    63    270
      w_size=(3, 4), C_mult=1            |    84    36    120
      w_size=(3, 4), C_mult=4            |    84    36    120
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(1, 3), C_in=3, batch=5       |  1332   198   1530
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |   375          375
      w_size=(1, 1), C_mult=4            |   375          375
      w_size=(1, 3), C_mult=1            |   207    63    270
      w_size=(1, 3), C_mult=4            |   207    63    270
      w_size=(3, 4), C_mult=1            |    84    36    120
      w_size=(3, 4), C_mult=4            |    84    36    120
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(3, 3), C_in=1, batch=1       |  1710         1710
    x_size=(3, 3), C_in=1, batch=5       |  1710         1710
    x_size=(3, 3), C_in=3, batch=1       |  1420   290   1710
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |   375          375
      w_size=(1, 1), C_mult=4            |   375          375
      w_size=(1, 3), C_mult=1            |   195    75    270
      w_size=(1, 3), C_mult=4            |   195    75    270
      w_size=(3, 4), C_mult=1            |   100    50    150
      w_size=(3, 4), C_mult=4            |   100    50    150
      w_size=(7, 4), C_mult=1            |    40    20     60
      w_size=(7, 4), C_mult=4            |    40    20     60
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(3, 3), C_in=3, batch=5       |  1420   290   1710
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |   375          375
      w_size=(1, 1), C_mult=4            |   375          375
      w_size=(1, 3), C_mult=1            |   195    75    270
      w_size=(1, 3), C_mult=4            |   195    75    270
      w_size=(3, 4), C_mult=1            |   100    50    150
      w_size=(3, 4), C_mult=4            |   100    50    150
      w_size=(7, 4), C_mult=1            |    40    20     60
      w_size=(7, 4), C_mult=4            |    40    20     60
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(12, 3), C_in=1, batch=1      |  1830         1830
    x_size=(12, 3), C_in=1, batch=5      |  1830         1830
    x_size=(12, 3), C_in=3, batch=1      |  1484   346   1830
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |   375          375
      w_size=(1, 1), C_mult=4            |   375          375
      w_size=(1, 3), C_mult=1            |   187    83    270
      w_size=(1, 3), C_mult=4            |   187    83    270
      w_size=(3, 4), C_mult=1            |   100    50    150
      w_size=(3, 4), C_mult=4            |   100    50    150
      w_size=(7, 4), C_mult=1            |    80    40    120
      w_size=(7, 4), C_mult=4            |    80    40    120
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(12, 3), C_in=3, batch=5      |  1484   346   1830
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |   375          375
      w_size=(1, 1), C_mult=4            |   375          375
      w_size=(1, 3), C_mult=1            |   187    83    270
      w_size=(1, 3), C_mult=4            |   187    83    270
      w_size=(3, 4), C_mult=1            |   100    50    150
      w_size=(3, 4), C_mult=4            |   100    50    150
      w_size=(7, 4), C_mult=1            |    80    40    120
      w_size=(7, 4), C_mult=4            |    80    40    120
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(20, 17), C_in=1, batch=1     |  2880         2880
    x_size=(20, 17), C_in=1, batch=5     |  2880         2880
    x_size=(20, 17), C_in=3, batch=1     |  2320   560   2880
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |   375          375
      w_size=(1, 1), C_mult=4            |   375          375
      w_size=(1, 3), C_mult=1            |   279    96    375
      w_size=(1, 3), C_mult=4            |   279    96    375
      w_size=(3, 4), C_mult=1            |   276    99    375
      w_size=(3, 4), C_mult=4            |   276    99    375
      w_size=(7, 4), C_mult=1            |   230    85    315
      w_size=(7, 4), C_mult=4            |   230    85    315
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(20, 17), C_in=3, batch=5     |  2320   560   2880
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |   375          375
      w_size=(1, 1), C_mult=4            |   375          375
      w_size=(1, 3), C_mult=1            |   279    96    375
      w_size=(1, 3), C_mult=4            |   279    96    375
      w_size=(3, 4), C_mult=1            |   276    99    375
      w_size=(3, 4), C_mult=4            |   276    99    375
      w_size=(7, 4), C_mult=1            |   230    85    315
      w_size=(7, 4), C_mult=4            |   230    85    315
      w_size=(1, 1, 1), C_mult=1         |              No tests
      w_size=(1, 1, 1), C_mult=4         |              No tests
      w_size=(1, 1, 3), C_mult=1         |              No tests
      w_size=(1, 1, 3), C_mult=4         |              No tests
      w_size=(3, 4, 3), C_mult=1         |              No tests
      w_size=(3, 4, 3), C_mult=4         |              No tests
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(1, 1, 3), C_in=1, batch=1    |   672          672
    x_size=(1, 1, 3), C_in=1, batch=5    |   672          672
    x_size=(1, 1, 3), C_in=3, batch=1    |   626    46    672
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |              No tests
      w_size=(1, 1), C_mult=4            |              No tests
      w_size=(1, 3), C_mult=1            |              No tests
      w_size=(1, 3), C_mult=4            |              No tests
      w_size=(3, 4), C_mult=1            |              No tests
      w_size=(3, 4), C_mult=4            |              No tests
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |   192          192
      w_size=(1, 1, 1), C_mult=4         |   192          192
      w_size=(1, 1, 3), C_mult=1         |   105    15    120
      w_size=(1, 1, 3), C_mult=4         |   105    15    120
      w_size=(3, 4, 3), C_mult=1         |    16     8     24
      w_size=(3, 4, 3), C_mult=4         |    16     8     24
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(1, 1, 3), C_in=3, batch=5    |   626    46    672
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |              No tests
      w_size=(1, 1), C_mult=4            |              No tests
      w_size=(1, 3), C_mult=1            |              No tests
      w_size=(1, 3), C_mult=4            |              No tests
      w_size=(3, 4), C_mult=1            |              No tests
      w_size=(3, 4), C_mult=4            |              No tests
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |   192          192
      w_size=(1, 1, 1), C_mult=4         |   192          192
      w_size=(1, 1, 3), C_mult=1         |   105    15    120
      w_size=(1, 1, 3), C_mult=4         |   105    15    120
      w_size=(3, 4, 3), C_mult=1         |    16     8     24
      w_size=(3, 4, 3), C_mult=4         |    16     8     24
      w_size=(7, 3, 2), C_mult=1         |              No tests
      w_size=(7, 3, 2), C_mult=4         |              No tests
    x_size=(3, 5, 4), C_in=1, batch=1    |   816          816
    x_size=(3, 5, 4), C_in=1, batch=5    |   816          816
    x_size=(3, 5, 4), C_in=3, batch=1    |   700   116    816
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |              No tests
      w_size=(1, 1), C_mult=4            |              No tests
      w_size=(1, 3), C_mult=1            |              No tests
      w_size=(1, 3), C_mult=4            |              No tests
      w_size=(3, 4), C_mult=1            |              No tests
      w_size=(3, 4), C_mult=4            |              No tests
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |   192          192
      w_size=(1, 1, 1), C_mult=4         |   192          192
      w_size=(1, 1, 3), C_mult=1         |    90    30    120
      w_size=(1, 1, 3), C_mult=4         |    90    30    120
      w_size=(3, 4, 3), C_mult=1         |    60    24     84
      w_size=(3, 4, 3), C_mult=4         |    60    24     84
      w_size=(7, 3, 2), C_mult=1         |     8     4     12
      w_size=(7, 3, 2), C_mult=4         |     8     4     12
    x_size=(3, 5, 4), C_in=3, batch=5    |   700   116    816
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |              No tests
      w_size=(1, 1), C_mult=4            |              No tests
      w_size=(1, 3), C_mult=1            |              No tests
      w_size=(1, 3), C_mult=4            |              No tests
      w_size=(3, 4), C_mult=1            |              No tests
      w_size=(3, 4), C_mult=4            |              No tests
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |   192          192
      w_size=(1, 1, 1), C_mult=4         |   192          192
      w_size=(1, 1, 3), C_mult=1         |    90    30    120
      w_size=(1, 1, 3), C_mult=4         |    90    30    120
      w_size=(3, 4, 3), C_mult=1         |    60    24     84
      w_size=(3, 4, 3), C_mult=4         |    60    24     84
      w_size=(7, 3, 2), C_mult=1         |     8     4     12
      w_size=(7, 3, 2), C_mult=4         |     8     4     12
    x_size=(20, 17, 14), C_in=1, batch=1 |  1344         1344
    x_size=(20, 17, 14), C_in=1, batch=5 |  1344         1344
    x_size=(20, 17, 14), C_in=3, batch=1 |  1104   240   1344
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |              No tests
      w_size=(1, 1), C_mult=4            |              No tests
      w_size=(1, 3), C_mult=1            |              No tests
      w_size=(1, 3), C_mult=4            |              No tests
      w_size=(3, 4), C_mult=1            |              No tests
      w_size=(3, 4), C_mult=4            |              No tests
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |   192          192
      w_size=(1, 1, 1), C_mult=4         |   192          192
      w_size=(1, 1, 3), C_mult=1         |   144    48    192
      w_size=(1, 1, 3), C_mult=4         |   144    48    192
      w_size=(3, 4, 3), C_mult=1         |   144    48    192
      w_size=(3, 4, 3), C_mult=4         |   144    48    192
      w_size=(7, 3, 2), C_mult=1         |    72    24     96
      w_size=(7, 3, 2), C_mult=4         |    72    24     96
    x_size=(20, 17, 14), C_in=3, batch=5 |  1104   240   1344
      w_size=(1,), C_mult=1              |              No tests
      w_size=(1,), C_mult=4              |              No tests
      w_size=(3,), C_mult=1              |              No tests
      w_size=(3,), C_mult=4              |              No tests
      w_size=(7,), C_mult=1              |              No tests
      w_size=(7,), C_mult=4              |              No tests
      w_size=(1, 1), C_mult=1            |              No tests
      w_size=(1, 1), C_mult=4            |              No tests
      w_size=(1, 3), C_mult=1            |              No tests
      w_size=(1, 3), C_mult=4            |              No tests
      w_size=(3, 4), C_mult=1            |              No tests
      w_size=(3, 4), C_mult=4            |              No tests
      w_size=(7, 4), C_mult=1            |              No tests
      w_size=(7, 4), C_mult=4            |              No tests
      w_size=(1, 1, 1), C_mult=1         |   192          192
      w_size=(1, 1, 1), C_mult=4         |   192          192
      w_size=(1, 1, 3), C_mult=1         |   144    48    192
      w_size=(1, 1, 3), C_mult=4         |   144    48    192
      w_size=(3, 4, 3), C_mult=1         |   144    48    192
      w_size=(3, 4, 3), C_mult=4         |   144    48    192
      w_size=(7, 3, 2), C_mult=1         |    72    24     96
      w_size=(7, 3, 2), C_mult=4         |    72    24     96
ERROR: LoadError: Some tests did not pass: 42728 passed, 3784 failed, 0 errored, 0 broken.

@ToucheSir
Copy link
Member Author

@staticfloat would you have any insight into why this might be happening? I have basically zero understanding of how the im2col implementation works, unfortunately.

@staticfloat
Copy link
Contributor

@ToucheSir, thanks for the detailed info, I think I found the error: #367

@CarloLucibello
Copy link
Member

Does someone understand the buildkite error we are having?
https://buildkite.com/julialang/nnlib-dot-jl/builds/306#2ff2a604-e66e-4ffc-99d4-abfc5e20be32/559-675

@staticfloat
Copy link
Contributor

My first guess is that it might be a bug due to the / in my branch name.

@ToucheSir
Copy link
Member Author

ToucheSir commented Dec 5, 2021

Interestingly, Buildkite was perfectly happy on the PR, but failed on

gradtest((x, w) -> depthwiseconv(x, w, dcdims), x, w)
after merging: https://buildkite.com/julialang/nnlib-dot-jl/builds/309#8c0bd06b-f65d-40d6-bb1a-7a70bd1600f3/245-444.

Coincidentally, this is the second instance in under a week of Buildkite being happy on a PR but not during/after integration. Granted the other instance was with a different layer and on a GPU path, so there may be no link whatsoever.

@ToucheSir
Copy link
Member Author

After spending some time (unsuccessfully) trying to make sense of FluxML/Flux.jl#1804, I wonder if these two issues are related. The common factor is the ∇(depthwise)conv_* methods, which by default dispatch to im2col instead of direct implementations. ConvTranspose's forward pass calls ∇conv_data, for example.

As a side note, not being able to repro these issues reliably/at all outside of Buildkite runs is a huge pain 😞. The only changed variables I can think of compared to my local machines are Zen 2 vs Skylake/Haswell and 12c/24t vs 4c/8t, but alas both require a whole new set of hardware...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants