unbreak CI for now #1822

ToucheSir · 2022-01-04T06:47:02Z

After too long failing to repro #1804, getting CI back online for bors is the highest priority.

DhairyaLGandhi · 2022-01-04T12:24:14Z

test/cuda/layers.jl

@@ -46,7 +46,8 @@ function gpu_gradtest(name::String, layers::Vector, x_cpu = nothing, args...; te
          # test
          if test_cpu
            if VERSION >= v"1.7" && layer === GroupedConvTranspose && args[end] == selu
-              @test_broken y_gpu ≈ y_cpu rtol=1f-3 atol=1f-3
+              # FIXME revisit this after CUDA deps on CI are updated
+              @test y_gpu ≈ y_cpu rtol=2 atol=2


Doesn't really help here since the error bounds are pretty high and the broken test is already specific to ConvTranspose + selu. Can we specify a kind of failure we expect? Say we expect the test to fail but not error?

Nope, because the test doesn't always fail! It was a choice between this and skipping the test entirely.

I'd rather avoid having high error tolerances since that isn't very helpful in the real world, and and it's unlikely an error would be raised in this code path (although I'd rather retain the test). Something that doesn't error and gives inaccurate answers would be hard to debug! Can we compare against a standard (say tf/ pytorch) implementation?

We could, but that doesn't address the issue of very high variance in results. The main problem is that we (or at least I) can't figure out where that variance is coming from. It could be something deep within the bowels of cuDNN, and since the forward pass of ConvTranspose is literally 1 cuDNN call + broadcasted bias add + broadcasted activation, there'd be very little we could do about that.

All that said, I'm happy to change it to a @test_skip if you feel that's more appropriate.

Is the issue that we might see Red ci spuriously if this test passes as it is on master? I don't think we've encountered that very frequently with the current setup right?

I'm fairly certain that the underlying issue would be in CUDA/ cudnn, and that would be pretty out of our hands at that point. To fix this we'd need Julia kernels which might not be the worst idea but seeing as the motivation is to fix one combination of conv and activation, it's fair to say that it would be low priority with little overall benefit.

If we are to let this test be here with wide tolerance it would be good to know what CUDA deps are referred to in the comment and what we should do to be alerted when this combination works to a decent degree again.

That's the thing, I don't know because I couldn't repro anything. It may well be that the CUDA deps are a red herring and the problem lies elsewhere (say with GPUCompiler + the new LLVM + compiler changes on Julia 1.7). I've added a commit with the last CUDA.versioninfo() output I got out of Buildkite.

Thoughts? Concerns? Again, I'm happy to turn this into a @test_skip with the previous tolerance to get the PR merged.

Is there a functional difference here between @test_skip and the adjusted tolerances? They are so wide, I would expect a 200% relative tolerance to be tantamount to skipping any testing. So, let's just change to @test_skip and merge.

This will show up in the summary as well for us to keep track of.

ToucheSir · 2022-01-12T19:12:25Z

bors r+

1822: unbreak CI for now r=ToucheSir a=ToucheSir After too long failing to repro #1804, getting CI back online for bors is the highest priority. Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>

bors · 2022-01-12T19:22:23Z

Build failed:

buildkite/flux-dot-jl

ToucheSir · 2022-01-12T19:38:35Z

What changed in the last 48 hours that we now have new failures despite no changes to Flux, Zygote or IRTools and no related changes to NNlib???

mcabbott · 2022-01-13T16:52:22Z

bors r+

1822: unbreak CI for now r=mcabbott a=ToucheSir After too long failing to repro #1804, getting CI back online for bors is the highest priority. Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>

bors · 2022-01-13T17:02:07Z

Build failed:

buildkite/flux-dot-jl

mcabbott · 2022-01-13T17:08:16Z

Needs FluxML/NNlib.jl#375 I think, or #1830.

Although Bors will probably still fail thanks to a red flag, in this bizarre system.

mcabbott · 2022-01-14T19:03:54Z

bors r+

1822: unbreak CI for now r=mcabbott a=ToucheSir After too long failing to repro #1804, getting CI back online for bors is the highest priority. Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>

bors · 2022-01-14T19:28:30Z

This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried.

Additional information:

{"message":"1 review requesting changes and 1 approving review by reviewers with write access.","documentation_url":"https://docs.github.com/articles/about-protected-branches"}

1836: Try using latest cu(DNN) binaries r=ToucheSir a=ToucheSir Possible alternative to #1822. Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>

1835: Document disabling GPUs r=DhairyaLGandhi a=DhairyaLGandhi From the discussion in #1834 1836: Try using latest cu(DNN) binaries r=DhairyaLGandhi a=ToucheSir Possible alternative to #1822. Co-authored-by: Dhairya Gandhi <dhairya@juliacomputing.com> Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>

ToucheSir · 2022-01-15T23:51:25Z

Seems to be superseded by #1836, no hacks needed.

DhairyaLGandhi · 2022-01-16T19:02:30Z

Apologies for the delayed review request, I saw this only around when #1836 was opened

unbreak CI for now

6d70ecd

ToucheSir mentioned this pull request Jan 4, 2022

Buildkite CI failures with grad test of ConvTranspose + selu #1804

Closed

DhairyaLGandhi requested changes Jan 4, 2022

View reviewed changes

add snapshot of CUDA deps

e7149de

ToucheSir mentioned this pull request Jan 10, 2022

Significant compile time latency in Flux with default optimization FluxML/Zygote.jl#1126

Closed

Just @test_skip for now

2bbf729

This will show up in the summary as well for us to keep track of.

darsnack approved these changes Jan 10, 2022

View reviewed changes

darsnack requested a review from DhairyaLGandhi January 14, 2022 19:29

ToucheSir mentioned this pull request Jan 15, 2022

Try using latest cu(DNN) binaries #1836

Merged

bors bot added a commit that referenced this pull request Jan 15, 2022

Merge #1836

5108ab2

1836: Try using latest cu(DNN) binaries r=ToucheSir a=ToucheSir Possible alternative to #1822. Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>

ToucheSir closed this Jan 15, 2022

ToucheSir deleted the bc/hack-unbreak-ci branch January 15, 2022 23:53

mcabbott mentioned this pull request Jan 16, 2022

Roadmap #1829

Open

14 tasks

Uh oh!

unbreak CI for now #1822

unbreak CI for now #1822

Uh oh!

Conversation

ToucheSir commented Jan 4, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ToucheSir Jan 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ToucheSir commented Jan 12, 2022

Uh oh!

bors bot commented Jan 12, 2022

Uh oh!

ToucheSir commented Jan 12, 2022

Uh oh!

mcabbott commented Jan 13, 2022

Uh oh!

bors bot commented Jan 13, 2022

Uh oh!

mcabbott commented Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcabbott commented Jan 14, 2022

Uh oh!

bors bot commented Jan 14, 2022

Uh oh!

ToucheSir commented Jan 15, 2022

Uh oh!

DhairyaLGandhi commented Jan 16, 2022

Uh oh!

Uh oh!

ToucheSir Jan 4, 2022 •

edited

Loading

mcabbott commented Jan 13, 2022 •

edited

Loading