-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unbreak CI for now #1822
unbreak CI for now #1822
Conversation
test/cuda/layers.jl
Outdated
@@ -46,7 +46,8 @@ function gpu_gradtest(name::String, layers::Vector, x_cpu = nothing, args...; te | |||
# test | |||
if test_cpu | |||
if VERSION >= v"1.7" && layer === GroupedConvTranspose && args[end] == selu | |||
@test_broken y_gpu ≈ y_cpu rtol=1f-3 atol=1f-3 | |||
# FIXME revisit this after CUDA deps on CI are updated | |||
@test y_gpu ≈ y_cpu rtol=2 atol=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't really help here since the error bounds are pretty high and the broken test is already specific to ConvTranspose + selu. Can we specify a kind of failure we expect? Say we expect the test to fail but not error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, because the test doesn't always fail! It was a choice between this and skipping the test entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather avoid having high error tolerances since that isn't very helpful in the real world, and and it's unlikely an error would be raised in this code path (although I'd rather retain the test). Something that doesn't error and gives inaccurate answers would be hard to debug! Can we compare against a standard (say tf/ pytorch) implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but that doesn't address the issue of very high variance in results. The main problem is that we (or at least I) can't figure out where that variance is coming from. It could be something deep within the bowels of cuDNN, and since the forward pass of ConvTranspose
is literally 1 cuDNN call + broadcasted bias add + broadcasted activation, there'd be very little we could do about that.
All that said, I'm happy to change it to a @test_skip
if you feel that's more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the issue that we might see Red ci spuriously if this test passes as it is on master? I don't think we've encountered that very frequently with the current setup right?
I'm fairly certain that the underlying issue would be in CUDA/ cudnn, and that would be pretty out of our hands at that point. To fix this we'd need Julia kernels which might not be the worst idea but seeing as the motivation is to fix one combination of conv and activation, it's fair to say that it would be low priority with little overall benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are to let this test be here with wide tolerance it would be good to know what CUDA deps are referred to in the comment and what we should do to be alerted when this combination works to a decent degree again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the thing, I don't know because I couldn't repro anything. It may well be that the CUDA deps are a red herring and the problem lies elsewhere (say with GPUCompiler + the new LLVM + compiler changes on Julia 1.7). I've added a commit with the last CUDA.versioninfo()
output I got out of Buildkite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts? Concerns? Again, I'm happy to turn this into a @test_skip
with the previous tolerance to get the PR merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a functional difference here between @test_skip
and the adjusted tolerances? They are so wide, I would expect a 200% relative tolerance to be tantamount to skipping any testing. So, let's just change to @test_skip
and merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This will show up in the summary as well for us to keep track of.
bors r+ |
Build failed: |
What changed in the last 48 hours that we now have new failures despite no changes to Flux, Zygote or IRTools and no related changes to NNlib??? |
bors r+ |
Build failed: |
Needs FluxML/NNlib.jl#375 I think, or #1830. Although Bors will probably still fail thanks to a red flag, in this bizarre system. |
bors r+ |
This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried. Additional information: {"message":"1 review requesting changes and 1 approving review by reviewers with write access.","documentation_url":"https://docs.github.com/articles/about-protected-branches"} |
1835: Document disabling GPUs r=DhairyaLGandhi a=DhairyaLGandhi From the discussion in #1834 1836: Try using latest cu(DNN) binaries r=DhairyaLGandhi a=ToucheSir Possible alternative to #1822. Co-authored-by: Dhairya Gandhi <dhairya@juliacomputing.com> Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
Seems to be superseded by #1836, no hacks needed. |
Apologies for the delayed review request, I saw this only around when #1836 was opened |
After too long failing to repro #1804, getting CI back online for bors is the highest priority.