Conversation
|
I'll take a look at this tonight, was out of town over the weekend and just now have time. Just wanna do a quick test drive before merging. Thanks for your contributions!! |
|
Hi @pschorf, I've tried using your branch to parallel test ring-core as a good reference, and I think the changes are provoking a parallel test invocation to hang indeterminately. My test proc is to install your branch, clobbering the published version of parallel-test, then to clone ring's master branch and drop to ring-core. I've modified the ring-core project clj to add the two following: I then invoke lein parallel-test. When I use your branch it hangs. When I use my master it runs to completion. Could you take a look and see if you can reproduce? |
|
Hm, I haven't been able to reproduce it, but on our internal build we do get a hang occasionally (with the new core.async or the current master). If you can reproduce it again, would it be possible to get a thread dump so I can see what's going on? I'd be happy to dig into the code a bit. Thanks for your help with the PR! |
|
I was able to reproduce it after a while, I'll run with my proposed fix for a few hours to make sure it doesn't hang again before updating the PR. |
|
Great! If you can find the deadlock provocation we could try and specifically provoke for those deadlocking conditions and then ensure that the fix properly prevents the deadlock from occurring. |
|
From the thread dump, I saw that the main thread was blocked on the alts!! in parallel-test-vars, which wasn't super helpful. My theory on the deadlock is that when all of the tests have been scheduled, the coordinator go-loop (which is reading off of the close-chan) exits. Say when this happens, all of the testers are running. Then, one of the testers puts it's result in close-chan and reads the last test off of open-chan. Then when all the other threads finish, there will be threads values on close-chan, so the first thread will block on it's >!. Since no one is reading that channel anymore, the tester thread never exits. My change was to use put! instead of >!, but you could also increase the buffer for close-chan. I like put! just because it doesn't seem to be particularly important that the operation block, and it frees the thread up to start a new test. |
|
I'm going to leave my parallel test of ring going overnight just to check, but I wanted to update the PR in case you had any better testing ideas. |
|
I didn't see it fail again for me. Can you still reproduce it? |
|
Hi, just checking in to see if you had a chance to look at this MR recently. |
|
Hi, I'll try to take a look tonight. Thanks for your patience, and sorry for not getting to this sooner! |
|
Ok I tested this and it still provokes deadlocks. I will see if I can isolate what specific change in the PR provokes the failure. |
|
Not terribly surprising, but the deadlocks arise because of upgrading the core.async version. |
|
Pretty sure this is getting caused because <!! is getting used during the body of com.holychao.parallel-test.runner/test-ns-vars which is getting called during a go-loop |
|
I added that change to the PR, but I still haven't been able to reproduce the deadlock in ring-core. Can you try again with the change? |
|
I'll try and dig up some time this weekend to mess with it, but the last two commits (ee8dbc6 and e22235f) should not be included. Switching to <! means that when a read blocks it can park that job and have some other thread come pick it up; clojure.test fixtures by their nature can only be relied upon if all the tests execute in the same call stack as the fixture. Using <!! is the right call in the loop body, but we probably need to move from go-loop on line 106 to (async/thread (loop ...)) so that it spawns a different thread that can safely sleep on a channel pull. I think the reason I'm seeing the problems is:
I did a bad thing that used to work when I used blocking read (<!!) in an async worker, but it used to work fine. Now it's important to get a clean solution that solves the problem properly. |
|
Thanks, sorry for all of the back and forth. If you don't have time to look at it this weekend, I can try to take another crack at it on Monday. |
|
No worries, it's valuable to keep it up to date and I'm happy to know that someone has found this useful enough to contribute back. Parallelism is a hard enough problem to tackle that it's important to be deliberative in our approach to it. :) |
|
I incorporated two of the commits in 068dcf7 and made the necessary changes to how workers are spawned. This did not appear to provoke the deadlocking problem I was getting locally after multiple tests. |
|
btw, thanks for all your work @pschorf, sorry for the long turnaround time :) |
|
Thank you! Glad you were able to figure out the issue. |
|
I'll go ahead and bump the version and get it into clojars once @amoe can confirm this fixes the problem locally. |
|
@aredington Any progress on getting these changes into clojars? |
|
@dposada I just cut 0.3.1 and it's on Clojars |
Two changes: