Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes sending to a Channel will segfault #4748

Closed
johnnovak opened this issue Sep 10, 2016 · 24 comments
Closed

Sometimes sending to a Channel will segfault #4748

johnnovak opened this issue Sep 10, 2016 · 24 comments

Comments

@johnnovak
Copy link
Contributor

johnnovak commented Sep 10, 2016

This is happening quite regularly on Windows and the results are catastrophic... I'm actually using a shared channel for this, so I guess it's either a bug in the channel implementation or maybe a GC bug? I'm sending to my shared channel from multiple threads, my understanding is that that's OK as channels use locking internally. The same code has been working flawlessly on Linux and OS X so far.

I am happy to share my full source with the brave person who's willing to investigate this. Isolating the root cause would be extremely hard as I have really no idea where the problem is coming from, and changing anything might actually make it go away.

channels.nim(202)        send
channels.nim(166)        rawSend
alloc.nim(650)           alloc0
alloc.nim(644)           alloc
alloc.nim(481)           rawAlloc
alloc.nim(437)           getSmallChunk
alloc.nim(428)           getBigChunk
alloc.nim(399)           splitChunk
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
@johnnovak
Copy link
Contributor Author

johnnovak commented Sep 10, 2016

Another one... As the exact place of the segfault seems to be random, I have a theory that the GC attempts to free the pointer to my shared channel that I passed in in the thread args. Although I have no idea why, because it was allocated using sharedAlloc0....

workerpool.nim(366)      shutdown
workerpool.nim(248)      sendCmd
channels.nim(202)        send
channels.nim(166)        rawSend
alloc.nim(650)           alloc0
alloc.nim(644)           alloc
alloc.nim(481)           rawAlloc
alloc.nim(437)           getSmallChunk
alloc.nim(428)           getBigChunk
alloc.nim(407)           splitChunk
alloc.nim(337)           listAdd
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

@johnnovak
Copy link
Contributor Author

Btw, compiling with -d:useSysAssert and -d:useGcAssert will make the code fail immediately. Although I'm not too sure if these errors are valid or not because I'm getting incRef: interiorPtr even if I deleted the whole body of my thread proc and just echoed a string...

[GCASSERT] incRef: interiorPtr
[GCASSERT] incRef: interiorPtr
[GCASSERT] incRef: interiorPtr
Traceback (most recent call last)
workerpool.nim(71)       threadProc
assign.nim(96)           genericAssign
assign.nim(85)           genericAssignAux
assign.nim(24)           genericAssignAux
assign.nim(21)           genericAssignAux
assign.nim(85)           genericAssignAux
assign.nim(24)           genericAssignAux
assign.nim(21)           genericAssignAux
assign.nim(91)           genericAssignAux
gc.nim(285)              unsureAsgnRef
gc.nim(117)              incRef
[GCASSERT] incRef: interiorPtr
[GCASSERT] incRef: interiorPtr
Traceback (most recent call last)
workerpool.nim(71)       threadProc
assign.nim(96)           genericAssign
assign.nim(85)           genericAssignAux
assign.nim(24)           genericAssignAux
assign.nim(21)           genericAssignAux
assign.nim(85)           genericAssignAux
assign.nim(24)           genericAssignAux
assign.nim(21)           genericAssignAux
assign.nim(91)           genericAssignAux
gc.nim(285)              unsureAsgnRef
gc.nim(117)              incRef

@endragor
Copy link
Contributor

I've had similar issues on OS X before I realized that you mustn't share channels as objects between threads. Make sure the channel is either in a global variable or passed as a ptr between threads (use allocShared0 to allocate the memory for it). Also, make sure Thread object is never copied - it causes similar errors, and compiler doesn't check that, even though it could.

You can find an example of how to properly pass Channel as an argument in #4689.

@johnnovak
Copy link
Contributor Author

johnnovak commented Sep 10, 2016

Thanks man, I appreciate your help. I was aware of the issues you described and checked out your code example. I was already using channels allocated on the shared heap and I was passing only pointers to the thread proc.

In my desperation I tried to create the threads like in your example (currently I'm just creating normal GC'd objects and I'm passing the pointer, that should work and it does). To my surprise, then the whole thing consistently starting crashing right at startup on all platforms, but to my understanding it should not make any difference at all

Anyway, I'm totally confused now, if you could take a look at my code I'd appreciate it, maybe you can spot something that's wrong.

https://github.com/johnnovak/nim-raytracer/blob/master/src/workerpool.nim

To run the tests that sometimes crash on Windows:

cd test
nim c -r --threads:on workerpooltest.nim

@endragor
Copy link
Contributor

endragor commented Sep 10, 2016

It seems you do copy Thread object and that's what causes these problems. Try either changing Worker to be ptr Thread or change WorkerPool to be ref object.

@johnnovak
Copy link
Contributor Author

johnnovak commented Sep 11, 2016

Thanks for the suggestion, I have changed WorkerPool and WorkerArgs to be ref objects and Worker to be ptr Thread, but I'm still having the exact same issue (always works on Linux, randomly crashes on Windows).

I suspect there's something wrong with the Channel implementation that breaks it on Windows under some circumstances. Or maybe I'm using too meany features together in an unsupported way that breaks the runtime... who knows... :/

@Araq
Copy link
Member

Araq commented Sep 11, 2016

well without any code snippet I cannot fix anything.

@johnnovak
Copy link
Contributor Author

johnnovak commented Sep 11, 2016

Hi Araq, you can have a look at the whole source here. Since you would have to debug it anyway, I don't think there's much point in trying to slim it down. I also have a suspicion that maybe I'm using too many features that aren't really compatible with each other...

https://github.com/johnnovak/nim-raytracer/

You can also run the tests that will fail randomly on Windows:

cd test
nim c -r --threads:on workerpooltest.nim

Actually, I would really appreciate it if you could tell me whether I'm doing something stupid in there that is guaranteed to blow up... I'm especially suspicious about those two closures I used for the callback procs (just search for Closure).

UPDATE: I just ran the tests on OS X at work and sometimes it segfaults inside the closures:

Traceback (most recent call last)
workerpool.nim(89)       threadProc
workerpool.nim(78)       ack
workerpool.nim(203)      ackClosure
workerpool.nim(131)      ackCb
workerpool.nim(121)      sendEvent
unittest.nim(295)        eventCb
unittest.nim(184)        checkpoint
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

Still a bit strange that it crashed with a different error on different operating systems. Anyway, any advice is appreciated, I've been struggling with this for days now... Thanks!

EDIT: fixed repo URL

@Araq
Copy link
Member

Araq commented Sep 13, 2016

Oh yeah, right now the T that you can pass to a thread needs to be free of GC'ed memory (including your closures). I'm working on a fix, but perhaps this information enables you to come up with a workaround.

@johnnovak
Copy link
Contributor Author

Ah, that explains the random crashes then. Thanks for the info, I'm looking forward to the fix. Hopefully that will fix all my problems.

@johnnovak
Copy link
Contributor Author

@Araq Just wondering, any updates on this?

@Araq
Copy link
Member

Araq commented Oct 6, 2016

I wanted to retest on Windows, sorry. Can you retest? Lots of bugs wrt channels have been fixed for 0.15.

@johnnovak
Copy link
Contributor Author

johnnovak commented Oct 6, 2016

I could only retest on OS X at the moment because I'm at work, but the issue still persists with the latest dev version. Just keep rerunning the tests as described in #4748 (comment)

Some test runs will succeed, but sooner of later you'll get random errors as described in the above comment (basically, it still fails inside the closure).

But as you hinted earlier, this might not even be a channel bug, more like a GC/thread bug:

Oh yeah, right now the T that you can pass to a thread needs to be free of GC'ed memory (including your closures).

So this definitely needs more work...

EDIT: I've removed "on Windows" from the title because it's actually happening on all platforms (just a bit more likely on Windows).

@johnnovak johnnovak changed the title Sometimes sending to a Channel will segfault on Windows Sometimes sending to a Channel will segfault Oct 6, 2016
@Araq
Copy link
Member

Araq commented Oct 7, 2016

So this definitely needs more work...

Oh no, that particular bug was fixed too.

Ah, yes, channels do predate 'closure', that could explain everything. :-)

@johnnovak
Copy link
Contributor Author

johnnovak commented Oct 8, 2016

So are you gonna fix it? :)

EDIT: Btw, there's definitely some improvement with the latest dev version. I had to use object instead of ref object previously for some stuff that's being used inside the worker thread because refs were causing instant crashes. That's fixed now, so there's progress :)

@Araq
Copy link
Member

Araq commented Feb 2, 2017

Allocator and channels bugfixes mean this could now work. Please retest.

@johnnovak
Copy link
Contributor Author

Issue is still there:

Traceback (most recent call last)
workerpool.nim(89)       threadProc
workerpool.nim(78)       ack
workerpool.nim(203)      ackClosure
workerpool.nim(131)      ackCb
workerpool.nim(121)      sendEvent
unittest.nim(295)        eventCb
unittest.nim(184)        checkpoint
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

It would be best if you tested this with the code I provided earlier, then you can replicate it exactly and debug it.

@endragor
Copy link
Contributor

endragor commented Feb 7, 2017

@johnnovak Could you please duplicate the link you provided? There are multiple links above and I'm not sure which one is up to date.

The error you show is different from what was there in the past. At the moment, judging by the stacktrace, the problem seems to be that you use unittest module from non-main thread which is not the expected usage. That should be easy to fix, but are you sure you are doing what was intended?

@johnnovak
Copy link
Contributor Author

@endragor You can find all the info on how to replicate this in this post: #4748 (comment)

If you check out that post you'll see that message I got now is the same to the letter. But yeah, sometimes I was able to get different messages randomly, you're right in that...

The workerpool does not only fail in the unit tests, but also in actual use sometimes. So I doubt that it's only an unit test issue. Anyway, it would be good if you could have a look at the code as well, maybe you'll be able to spot something. I spent weeks on this and I just gave up at this point... I figured the language is just broken so I'll wait until it gets fixed properly.

@endragor
Copy link
Contributor

endragor commented Feb 8, 2017

@johnnovak Could you try to redesign your tests so that all calls to unittest module are only performed in the main thread? I've looked a bit more at it, and unittest is simply not designed to be invoked for multiple threads, and I think that makes sense. So best we can do is raise a proper error if unittest is invoked from non-main thread.

That seems to be the only problem left, because there seems to be no more corruption crashes like you had in the past:

channels.nim(202)        send
channels.nim(166)        rawSend
alloc.nim(650)           alloc0
alloc.nim(644)           alloc
alloc.nim(481)           rawAlloc
alloc.nim(437)           getSmallChunk
alloc.nim(428)           getBigChunk
alloc.nim(399)           splitChunk
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

(this is the stacktrace from the description of the issue).

But you'll know for sure when unittest is only used from main thread.

@vegansk
Copy link
Contributor

vegansk commented Feb 9, 2017

Oops, I think, I get the same problem on windows. Preparing the test...

Traceback (most recent call last)
logclient.nim(158)       zenith_action
client_impl.nim(87)      logZenithAction
either.nim(162)          tryS
client_impl.nim(88)      :anonymous
channels.nim(222)        send
channels.nim(194)        rawSend
channels.nim(139)        storeAux
channels.nim(70)         storeAux
channels.nim(63)         storeAux
channels.nim(76)         storeAux
channels.nim(139)        storeAux
channels.nim(65)         storeAux
channels.nim(63)         storeAux
channels.nim(158)        storeAux

@johnnovak
Copy link
Contributor Author

@vegansk Have you had any success in isolation the issue?

@Araq
Copy link
Member

Araq commented Jul 1, 2018

No feedback. Closing.

@Araq Araq closed this as completed Jul 1, 2018
@FedericoCeratto
Copy link
Member

I had a similar error involving sendImpl -> rawSend -> storeAux leading to SIGSEGV: Illegal storage access. (Attempt to read from nil?) when the channel was not opened with open()

chr-1x added a commit to chr-1x/Nim that referenced this issue Jan 20, 2020
This PR revives nim-lang#8611 since the channels module still does not have an
example of usage. I attempted to fix the outstanding issue that caused
the previous PR to be closed, namely implementing the example on top of
threads instead of threadpool. I also added an additional example of
using sharedAlloc0 to create channel pointers based on the discussion in
 nim-lang#4748. Please let me know if there is anything that needs to be
changed!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants