Update core.async version by pschorf · Pull Request #2 · aredington/parallel-test

pschorf · 2016-10-06T18:09:56Z

Two changes:

Update the core.async version
Use the profile from the project if defined

aredington · 2016-10-11T13:00:54Z

I'll take a look at this tonight, was out of town over the weekend and just now have time. Just wanna do a quick test drive before merging.

Thanks for your contributions!!

aredington · 2016-10-11T21:05:56Z

Hi @pschorf, I've tried using your branch to parallel test ring-core as a good reference, and I think the changes are provoking a parallel test invocation to hang indeterminately.

My test proc is to install your branch, clobbering the published version of parallel-test, then to clone ring's master branch and drop to ring-core. I've modified the ring-core project clj to add the two following:

  :plugins [[com.holychao/parallel-test "0.3.0"]]
  :parallel-test {:categorizer (fn [meta] (if (:serial meta) :serial :parallel))
                  :pools {:serial (constantly 1)
                          :parallel (fn [] (.availableProcessors (Runtime/getRuntime)))}
                  :sequence [:serial]}

I then invoke lein parallel-test. When I use your branch it hangs. When I use my master it runs to completion. Could you take a look and see if you can reproduce?

pschorf · 2016-10-17T14:20:58Z

Hm, I haven't been able to reproduce it, but on our internal build we do get a hang occasionally (with the new core.async or the current master).

If you can reproduce it again, would it be possible to get a thread dump so I can see what's going on? I'd be happy to dig into the code a bit.

Thanks for your help with the PR!

pschorf · 2016-10-17T18:13:28Z

I was able to reproduce it after a while, I'll run with my proposed fix for a few hours to make sure it doesn't hang again before updating the PR.

aredington · 2016-10-17T19:53:08Z

Great! If you can find the deadlock provocation we could try and specifically provoke for those deadlocking conditions and then ensure that the fix properly prevents the deadlock from occurring.

pschorf · 2016-10-17T20:01:28Z

From the thread dump, I saw that the main thread was blocked on the alts!! in parallel-test-vars, which wasn't super helpful.

My theory on the deadlock is that when all of the tests have been scheduled, the coordinator go-loop (which is reading off of the close-chan) exits. Say when this happens, all of the testers are running. Then, one of the testers puts it's result in close-chan and reads the last test off of open-chan. Then when all the other threads finish, there will be threads values on close-chan, so the first thread will block on it's >!. Since no one is reading that channel anymore, the tester thread never exits.

My change was to use put! instead of >!, but you could also increase the buffer for close-chan. I like put! just because it doesn't seem to be particularly important that the operation block, and it frees the thread up to start a new test.

pschorf · 2016-10-17T20:01:58Z

I'm going to leave my parallel test of ring going overnight just to check, but I wanted to update the PR in case you had any better testing ideas.

pschorf · 2016-10-18T21:40:52Z

I didn't see it fail again for me. Can you still reproduce it?

pschorf · 2016-11-02T18:19:30Z

Hi, just checking in to see if you had a chance to look at this MR recently.

aredington · 2016-11-02T18:40:45Z

Hi, I'll try to take a look tonight. Thanks for your patience, and sorry for not getting to this sooner!

aredington · 2016-11-02T23:44:06Z

Ok I tested this and it still provokes deadlocks. I will see if I can isolate what specific change in the PR provokes the failure.

aredington · 2016-11-02T23:46:51Z

Not terribly surprising, but the deadlocks arise because of upgrading the core.async version.

aredington · 2016-11-03T00:54:41Z

Pretty sure this is getting caused because <!! is getting used during the body of com.holychao.parallel-test.runner/test-ns-vars which is getting called during a go-loop

pschorf · 2016-11-04T14:50:15Z

I added that change to the PR, but I still haven't been able to reproduce the deadlock in ring-core. Can you try again with the change?

aredington · 2016-11-04T14:58:14Z

I'll try and dig up some time this weekend to mess with it, but the last two commits (ee8dbc6 and e22235f) should not be included. Switching to <! means that when a read blocks it can park that job and have some other thread come pick it up; clojure.test fixtures by their nature can only be relied upon if all the tests execute in the same call stack as the fixture. Using <!! is the right call in the loop body, but we probably need to move from go-loop on line 106 to (async/thread (loop ...)) so that it spawns a different thread that can safely sleep on a channel pull.

I think the reason I'm seeing the problems is:

I'm running on a 2011 Macbook Pro which only has 8 cores.
The new version of core async uses one of the async workers to do some amount of dispatch work, when all async workers hit <!! then the entire core async system grinds to a halt.
As you increase cores the odds of all of them binding on read to starve out the dispatch worker drop.

I did a bad thing that used to work when I used blocking read (<!!) in an async worker, but it used to work fine. Now it's important to get a clean solution that solves the problem properly.

pschorf · 2016-11-04T15:01:25Z

Thanks, sorry for all of the back and forth. If you don't have time to look at it this weekend, I can try to take another crack at it on Monday.

aredington · 2016-11-04T17:38:03Z

No worries, it's valuable to keep it up to date and I'm happy to know that someone has found this useful enough to contribute back.

Parallelism is a hard enough problem to tackle that it's important to be deliberative in our approach to it. :)

aredington · 2016-12-30T18:58:44Z

I incorporated two of the commits in 068dcf7 and made the necessary changes to how workers are spawned. This did not appear to provoke the deadlocking problem I was getting locally after multiple tests.

aredington · 2016-12-30T18:59:16Z

btw, thanks for all your work @pschorf, sorry for the long turnaround time :)

pschorf · 2016-12-30T18:59:51Z

Thank you! Glad you were able to figure out the issue.

aredington · 2016-12-30T19:02:55Z

I'll go ahead and bump the version and get it into clojars once @amoe can confirm this fixes the problem locally.

dposada · 2017-06-14T19:12:34Z

@aredington Any progress on getting these changes into clojars?

aredington · 2017-06-14T19:22:01Z

@dposada I just cut 0.3.1 and it's on Clojars

Paul Schorfheide added 2 commits October 17, 2016 13:37

Use profile from project where possible

7376232

Update core.async version

f75adf6

pschorf force-pushed the master branch from 216af22 to ee8dbc6 Compare October 17, 2016 19:56

Use put!

ee8dbc6

Fix for test-ns-vars

e22235f

amoe mentioned this pull request Dec 29, 2016

Parallel test threads sometimes hang and prevent JVM from exiting #3

Open

aredington closed this Dec 30, 2016

Conversation

pschorf commented Oct 6, 2016

Uh oh!

aredington commented Oct 11, 2016

Uh oh!

aredington commented Oct 11, 2016

Uh oh!

pschorf commented Oct 17, 2016

Uh oh!

pschorf commented Oct 17, 2016

Uh oh!

aredington commented Oct 17, 2016

Uh oh!

pschorf commented Oct 17, 2016

Uh oh!

pschorf commented Oct 17, 2016

Uh oh!

pschorf commented Oct 18, 2016

Uh oh!

pschorf commented Nov 2, 2016

Uh oh!

aredington commented Nov 2, 2016

Uh oh!

aredington commented Nov 2, 2016

Uh oh!

aredington commented Nov 2, 2016

Uh oh!

aredington commented Nov 3, 2016

Uh oh!

pschorf commented Nov 4, 2016

Uh oh!

aredington commented Nov 4, 2016

Uh oh!

pschorf commented Nov 4, 2016

Uh oh!

aredington commented Nov 4, 2016

Uh oh!

aredington commented Dec 30, 2016

Uh oh!

aredington commented Dec 30, 2016

Uh oh!

pschorf commented Dec 30, 2016

Uh oh!

aredington commented Dec 30, 2016

Uh oh!

dposada commented Jun 14, 2017

Uh oh!

aredington commented Jun 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants