-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Flaky?) CI failures on GHA latest + Buildkite #359
Comments
Is it flaky or have many of the tests been consistently failing? GPU tests and some convolution tests come to mind. |
At least on the couple of commits I checked, different AD tests were failing between the different CI services. Which seems like a fun problem 😅 |
After some more digging, these failures really do seem random. I've yet to see all 3 conv spatial autodiff suites fail in one run, but it's not uncommon to see tests in 2/3 fail. Perhaps the tolerances are too strict? |
Does anyone know how many threads Buildkite runs by default? I had a look back through test failures on merge commits on GHA, and they are solely limited to the config with 2 threads (e.g. 1, 2, 3). My hunch is that Buildkite is using more threads by default and thus running into issues more frequently. |
Ok, I've been able to replicate this semi-frequently locally with 4/8 threads (on 4 cores). Running the depthwise conv fuzz tests is pretty damning:
AFAICT, multithreaded CI has been consistently (but not always) failing since #242. That PR introduced |
Ok, a couple more interesting observations for the folks who actually understand these routines (i.e. not me) to work with.
Edit: I annotated the outer 2 loops so that
|
@staticfloat would you have any insight into why this might be happening? I have basically zero understanding of how the im2col implementation works, unfortunately. |
@ToucheSir, thanks for the detailed info, I think I found the error: #367 |
Does someone understand the buildkite error we are having? |
My first guess is that it might be a bug due to the |
Interestingly, Buildkite was perfectly happy on the PR, but failed on Line 739 in 2436b32
Coincidentally, this is the second instance in under a week of Buildkite being happy on a PR but not during/after integration. Granted the other instance was with a different layer and on a GPU path, so there may be no link whatsoever. |
After spending some time (unsuccessfully) trying to make sense of FluxML/Flux.jl#1804, I wonder if these two issues are related. The common factor is the As a side note, not being able to repro these issues reliably/at all outside of Buildkite runs is a huge pain 😞. The only changed variables I can think of compared to my local machines are Zen 2 vs Skylake/Haswell and 12c/24t vs 4c/8t, but alas both require a whole new set of hardware... |
The Appveyor badge in the readme should be swapped out for a Buildkite one as well.
The text was updated successfully, but these errors were encountered: