Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove threading from all ∇*conv_filter and re-enable old tests #441

Merged
merged 1 commit into from
Nov 4, 2022

Conversation

ToucheSir
Copy link
Member

@ToucheSir ToucheSir commented Oct 25, 2022

This should hopefully fix persistent threading-related issues on CI, While also slightly reducing memory usage when taking gradients. The fix applied is the same as #235. I've taken the opportunity to not over-allocate memory for convs and depthwise convs at the same time.

Fixes #359, which has been causing CI failures for the past two years...

@ToucheSir ToucheSir marked this pull request as draft October 25, 2022 04:40
@ToucheSir
Copy link
Member Author

Ok, a couple of findings:

  1. BLAS threads are not a culprit, since setting them == 1 still repros this easily.
  2. julia -t 2 is enough to repro this consistently. What I don't understand is why Buildkite and running locally are way more consistent than on GHA despite all 3 using 2 threads. Maybe resource contention on the latter leads to less effective parallelism?
  3. At least on paper, threads should not be writing to overlapping memory regions for depthwise convs. Is it possible that there's a bug?
  4. On the opposite end, could it be that the error comes from how FiniteDifferences interacts with threading? I don't understand its internals at all, but it feels to me like the Zygote version should give fully deterministic results.

Going to sleep on this and try seeing if both gradients are self-consistent across multiple runs tomorrow. If anybody has any insights on this, please do post them!

@ToucheSir ToucheSir changed the title Trying to isolate depthwise conv threading failures on CI Remove threading from all ∇*conv_filter and re-enable old tests Nov 2, 2022
@ToucheSir ToucheSir changed the title Remove threading from all ∇*conv_filter and re-enable old tests Remove threading from all ∇*conv_filter and re-enable old tests Nov 2, 2022
This should hopefully fix persistent threading-related issues on CI,
While also slightly reducing memory usage when taking gradients.
@ToucheSir ToucheSir marked this pull request as ready for review November 2, 2022 03:49
@ToucheSir
Copy link
Member Author

So used to seeing broken CI here that the green looks unnatural! This is good to go, barring any objections about the threading removal.

@CarloLucibello
Copy link
Member

Very nice to see all greens here!
Should we open an issue tracking where we stand with threaded convolutions (nowhere I guess at this point) and lay down some possible future plans? E.g. we might reconsider wrapping again NNPACK but in a more robust and maintainable way, explore oneDNN or others

@ToucheSir ToucheSir merged commit 7087954 into master Nov 4, 2022
@ToucheSir ToucheSir deleted the bc/dw-conv-debug branch November 4, 2022 13:59
@ToucheSir
Copy link
Member Author

I recently bumped #74 which talked about oneDNN. Just changed the title to reflect that (it was made before the rename). For threading in general, we could do a tracking issue as well for more than conv I think? One thing that would be nice is to add benchmarks now that we have that capability, per #395 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(Flaky?) CI failures on GHA latest + Buildkite
2 participants