Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unbreak CI for now #1822

Closed
wants to merge 3 commits into from
Closed

unbreak CI for now #1822

wants to merge 3 commits into from

Conversation

ToucheSir
Copy link
Member

After too long failing to repro #1804, getting CI back online for bors is the highest priority.

@@ -46,7 +46,8 @@ function gpu_gradtest(name::String, layers::Vector, x_cpu = nothing, args...; te
# test
if test_cpu
if VERSION >= v"1.7" && layer === GroupedConvTranspose && args[end] == selu
@test_broken y_gpu ≈ y_cpu rtol=1f-3 atol=1f-3
# FIXME revisit this after CUDA deps on CI are updated
@test y_gpu ≈ y_cpu rtol=2 atol=2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't really help here since the error bounds are pretty high and the broken test is already specific to ConvTranspose + selu. Can we specify a kind of failure we expect? Say we expect the test to fail but not error?

Copy link
Member Author

@ToucheSir ToucheSir Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, because the test doesn't always fail! It was a choice between this and skipping the test entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather avoid having high error tolerances since that isn't very helpful in the real world, and and it's unlikely an error would be raised in this code path (although I'd rather retain the test). Something that doesn't error and gives inaccurate answers would be hard to debug! Can we compare against a standard (say tf/ pytorch) implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but that doesn't address the issue of very high variance in results. The main problem is that we (or at least I) can't figure out where that variance is coming from. It could be something deep within the bowels of cuDNN, and since the forward pass of ConvTranspose is literally 1 cuDNN call + broadcasted bias add + broadcasted activation, there'd be very little we could do about that.

All that said, I'm happy to change it to a @test_skip if you feel that's more appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue that we might see Red ci spuriously if this test passes as it is on master? I don't think we've encountered that very frequently with the current setup right?

I'm fairly certain that the underlying issue would be in CUDA/ cudnn, and that would be pretty out of our hands at that point. To fix this we'd need Julia kernels which might not be the worst idea but seeing as the motivation is to fix one combination of conv and activation, it's fair to say that it would be low priority with little overall benefit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are to let this test be here with wide tolerance it would be good to know what CUDA deps are referred to in the comment and what we should do to be alerted when this combination works to a decent degree again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the thing, I don't know because I couldn't repro anything. It may well be that the CUDA deps are a red herring and the problem lies elsewhere (say with GPUCompiler + the new LLVM + compiler changes on Julia 1.7). I've added a commit with the last CUDA.versioninfo() output I got out of Buildkite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts? Concerns? Again, I'm happy to turn this into a @test_skip with the previous tolerance to get the PR merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a functional difference here between @test_skip and the adjusted tolerances? They are so wide, I would expect a 200% relative tolerance to be tantamount to skipping any testing. So, let's just change to @test_skip and merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

This will show up in the summary as well for us to keep track of.
@ToucheSir
Copy link
Member Author

bors r+

bors bot added a commit that referenced this pull request Jan 12, 2022
1822: unbreak CI for now r=ToucheSir a=ToucheSir

After too long failing to repro #1804, getting CI back online for bors is the highest priority.


Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
@bors
Copy link
Contributor

bors bot commented Jan 12, 2022

Build failed:

@ToucheSir
Copy link
Member Author

What changed in the last 48 hours that we now have new failures despite no changes to Flux, Zygote or IRTools and no related changes to NNlib???

@mcabbott
Copy link
Member

bors r+

bors bot added a commit that referenced this pull request Jan 13, 2022
1822: unbreak CI for now r=mcabbott a=ToucheSir

After too long failing to repro #1804, getting CI back online for bors is the highest priority.


Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
@bors
Copy link
Contributor

bors bot commented Jan 13, 2022

Build failed:

@mcabbott
Copy link
Member

mcabbott commented Jan 13, 2022

Needs FluxML/NNlib.jl#375 I think, or #1830.

Although Bors will probably still fail thanks to a red flag, in this bizarre system.

@mcabbott
Copy link
Member

bors r+

bors bot added a commit that referenced this pull request Jan 14, 2022
1822: unbreak CI for now r=mcabbott a=ToucheSir

After too long failing to repro #1804, getting CI back online for bors is the highest priority.


Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
@bors
Copy link
Contributor

bors bot commented Jan 14, 2022

This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried.

Additional information:

{"message":"1 review requesting changes and 1 approving review by reviewers with write access.","documentation_url":"https://docs.github.com/articles/about-protected-branches"}

bors bot added a commit that referenced this pull request Jan 15, 2022
1836: Try using latest cu(DNN) binaries r=ToucheSir a=ToucheSir

Possible alternative to #1822.

Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
bors bot added a commit that referenced this pull request Jan 15, 2022
1835: Document disabling GPUs r=DhairyaLGandhi a=DhairyaLGandhi

From the discussion in #1834

1836: Try using latest cu(DNN) binaries r=DhairyaLGandhi a=ToucheSir

Possible alternative to #1822.

Co-authored-by: Dhairya Gandhi <dhairya@juliacomputing.com>
Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
@ToucheSir
Copy link
Member Author

Seems to be superseded by #1836, no hacks needed.

@ToucheSir ToucheSir closed this Jan 15, 2022
@ToucheSir ToucheSir deleted the bc/hack-unbreak-ci branch January 15, 2022 23:53
@mcabbott mcabbott mentioned this pull request Jan 16, 2022
14 tasks
@DhairyaLGandhi
Copy link
Member

Apologies for the delayed review request, I saw this only around when #1836 was opened

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants