Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buildkite CI failures with grad test of ConvTranspose + selu #1804

Closed
ToucheSir opened this issue Dec 8, 2021 · 3 comments
Closed

Buildkite CI failures with grad test of ConvTranspose + selu #1804

ToucheSir opened this issue Dec 8, 2021 · 3 comments

Comments

@ToucheSir
Copy link
Member

ToucheSir commented Dec 8, 2021

One example: https://buildkite.com/julialang/flux-dot-jl/builds/1914#c62d9761-ab7f-415a-b995-51552eb2b1e5

I could not repro this locally on 2 separate GPU machines, master and 2 threads (same number Buildkite uses). Oddly, it's succeded twice (once on trying and once on staging), but fails with the same result every time, which makes me suspect some environmental discrepancy between workers.

@ToucheSir
Copy link
Member Author

ToucheSir commented Dec 20, 2021

Despite pulling the exact inputs and weights from failing buildkite runs, I was not able to replicate this locally. If someone has a znver2 machine they can test on, everything needed should be in https://gist.github.com/ToucheSir/32fd6688d3932c9f498c78a42a0ea017. @DhairyaLGandhi and possibly @maleadt, is there any way to replicate the CI environment and run tests against that? I don't think it's necessary to have a GPU attached, because the inconsistencies are coming from the CPU calculation. I read the test output backwards, the discrepancy is in the GPU results.

@ToucheSir
Copy link
Member Author

ToucheSir commented Jan 4, 2022

I could get the same summed result by exporting the pre-selu output and loading that locally, but still have not been able to generate a similar output and wasn't able to coax a container image into downloading the same dep versions. What's confusing is that the tests for ConvTranspose + identity activation do pass on CI, yet the differing output for the selu version is happening before the activation function is applied. Either way, #1822 is now up as a stopgap measure until some brave soul decides to revisit this in the future.

bors bot added a commit that referenced this issue Jan 12, 2022
1822: unbreak CI for now r=ToucheSir a=ToucheSir

After too long failing to repro #1804, getting CI back online for bors is the highest priority.


Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
bors bot added a commit that referenced this issue Jan 13, 2022
1822: unbreak CI for now r=mcabbott a=ToucheSir

After too long failing to repro #1804, getting CI back online for bors is the highest priority.


Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
bors bot added a commit that referenced this issue Jan 14, 2022
1822: unbreak CI for now r=mcabbott a=ToucheSir

After too long failing to repro #1804, getting CI back online for bors is the highest priority.


Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
@ToucheSir
Copy link
Member Author

Fixed by #1836, ref. #1836 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant