Skip to content

Update core.async version#2

Closed
pschorf wants to merge 4 commits intoaredington:masterfrom
pschorf:master
Closed

Update core.async version#2
pschorf wants to merge 4 commits intoaredington:masterfrom
pschorf:master

Conversation

@pschorf
Copy link
Contributor

@pschorf pschorf commented Oct 6, 2016

Two changes:

  • Update the core.async version
  • Use the profile from the project if defined

@aredington
Copy link
Owner

I'll take a look at this tonight, was out of town over the weekend and just now have time. Just wanna do a quick test drive before merging.

Thanks for your contributions!!

@aredington
Copy link
Owner

Hi @pschorf, I've tried using your branch to parallel test ring-core as a good reference, and I think the changes are provoking a parallel test invocation to hang indeterminately.

My test proc is to install your branch, clobbering the published version of parallel-test, then to clone ring's master branch and drop to ring-core. I've modified the ring-core project clj to add the two following:

  :plugins [[com.holychao/parallel-test "0.3.0"]]
  :parallel-test {:categorizer (fn [meta] (if (:serial meta) :serial :parallel))
                  :pools {:serial (constantly 1)
                          :parallel (fn [] (.availableProcessors (Runtime/getRuntime)))}
                  :sequence [:serial]}

I then invoke lein parallel-test. When I use your branch it hangs. When I use my master it runs to completion. Could you take a look and see if you can reproduce?

@pschorf
Copy link
Contributor Author

pschorf commented Oct 17, 2016

Hm, I haven't been able to reproduce it, but on our internal build we do get a hang occasionally (with the new core.async or the current master).

If you can reproduce it again, would it be possible to get a thread dump so I can see what's going on? I'd be happy to dig into the code a bit.

Thanks for your help with the PR!

@pschorf
Copy link
Contributor Author

pschorf commented Oct 17, 2016

I was able to reproduce it after a while, I'll run with my proposed fix for a few hours to make sure it doesn't hang again before updating the PR.

@aredington
Copy link
Owner

Great! If you can find the deadlock provocation we could try and specifically provoke for those deadlocking conditions and then ensure that the fix properly prevents the deadlock from occurring.

@pschorf
Copy link
Contributor Author

pschorf commented Oct 17, 2016

From the thread dump, I saw that the main thread was blocked on the alts!! in parallel-test-vars, which wasn't super helpful.

My theory on the deadlock is that when all of the tests have been scheduled, the coordinator go-loop (which is reading off of the close-chan) exits. Say when this happens, all of the testers are running. Then, one of the testers puts it's result in close-chan and reads the last test off of open-chan. Then when all the other threads finish, there will be threads values on close-chan, so the first thread will block on it's >!. Since no one is reading that channel anymore, the tester thread never exits.

My change was to use put! instead of >!, but you could also increase the buffer for close-chan. I like put! just because it doesn't seem to be particularly important that the operation block, and it frees the thread up to start a new test.

@pschorf
Copy link
Contributor Author

pschorf commented Oct 17, 2016

I'm going to leave my parallel test of ring going overnight just to check, but I wanted to update the PR in case you had any better testing ideas.

@pschorf
Copy link
Contributor Author

pschorf commented Oct 18, 2016

I didn't see it fail again for me. Can you still reproduce it?

@pschorf
Copy link
Contributor Author

pschorf commented Nov 2, 2016

Hi, just checking in to see if you had a chance to look at this MR recently.

@aredington
Copy link
Owner

Hi, I'll try to take a look tonight. Thanks for your patience, and sorry for not getting to this sooner!

@aredington
Copy link
Owner

Ok I tested this and it still provokes deadlocks. I will see if I can isolate what specific change in the PR provokes the failure.

@aredington
Copy link
Owner

Not terribly surprising, but the deadlocks arise because of upgrading the core.async version.

@aredington
Copy link
Owner

Pretty sure this is getting caused because <!! is getting used during the body of com.holychao.parallel-test.runner/test-ns-vars which is getting called during a go-loop

@pschorf
Copy link
Contributor Author

pschorf commented Nov 4, 2016

I added that change to the PR, but I still haven't been able to reproduce the deadlock in ring-core. Can you try again with the change?

@aredington
Copy link
Owner

I'll try and dig up some time this weekend to mess with it, but the last two commits (ee8dbc6 and e22235f) should not be included. Switching to <! means that when a read blocks it can park that job and have some other thread come pick it up; clojure.test fixtures by their nature can only be relied upon if all the tests execute in the same call stack as the fixture. Using <!! is the right call in the loop body, but we probably need to move from go-loop on line 106 to (async/thread (loop ...)) so that it spawns a different thread that can safely sleep on a channel pull.

I think the reason I'm seeing the problems is:

  • I'm running on a 2011 Macbook Pro which only has 8 cores.
  • The new version of core async uses one of the async workers to do some amount of dispatch work, when all async workers hit <!! then the entire core async system grinds to a halt.
  • As you increase cores the odds of all of them binding on read to starve out the dispatch worker drop.

I did a bad thing that used to work when I used blocking read (<!!) in an async worker, but it used to work fine. Now it's important to get a clean solution that solves the problem properly.

@pschorf
Copy link
Contributor Author

pschorf commented Nov 4, 2016

Thanks, sorry for all of the back and forth. If you don't have time to look at it this weekend, I can try to take another crack at it on Monday.

@aredington
Copy link
Owner

No worries, it's valuable to keep it up to date and I'm happy to know that someone has found this useful enough to contribute back.

Parallelism is a hard enough problem to tackle that it's important to be deliberative in our approach to it. :)

@aredington
Copy link
Owner

I incorporated two of the commits in 068dcf7 and made the necessary changes to how workers are spawned. This did not appear to provoke the deadlocking problem I was getting locally after multiple tests.

@aredington aredington closed this Dec 30, 2016
@aredington
Copy link
Owner

btw, thanks for all your work @pschorf, sorry for the long turnaround time :)

@pschorf
Copy link
Contributor Author

pschorf commented Dec 30, 2016

Thank you! Glad you were able to figure out the issue.

@aredington
Copy link
Owner

I'll go ahead and bump the version and get it into clojars once @amoe can confirm this fixes the problem locally.

@dposada
Copy link

dposada commented Jun 14, 2017

@aredington Any progress on getting these changes into clojars?

@aredington
Copy link
Owner

@dposada I just cut 0.3.1 and it's on Clojars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants