-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"panic: send on closed channel" during indexing #195
Comments
This is definitely a bug, but I'd like to try and reproduce it (the line numbers in the trace don't seem to correspond with the code in your bleve-bench repo) |
I haven't been able to reproduce it yet, though I may ask @steveyen to help me review the relevant sections of code later today. Can you let me know which bleve commit you're on, and if possible provide the datasource file you used for testing? Thanks |
Let me see if can reproduce it. I forced-pushed a couple of times last night, that's why the line numbers are off. |
I tried a couple more times, but have not managed to reproduce this. Perhaps it was something about the earlier state of my code, but unfortunately I no longer have the code. :-( |
@steveyen helped me review the code. We did theorize one way it could happen. If you were to add a document to a batch, after you already started executing the batch, there could be a data race between where we compute how many items we expect to read from a result channel, and how many items we actually push onto the work channel. Obviously your code would not do this by design, but I can see how it could happen during development. The next question is what should be done to prevent this. Right now Batches are a use-once model, so calls to Index/Delete/SetInternal/DeleteInternal after executing the batch are not correct. Unfortunately, not all these operations return an error today. One option is to modify the signature to return an error. Another option is to have the methods acquire a lock, in this case any calls to modify the batch during execution would block. At first the second option appears like a hack to provide safety, while not letting the user know there is a problem... But, I'd like to consider a future where we can reuse the batches by calling a Reset() method, this could avoid some unnecessary allocations. |
Actually, this might make sense. When I first coded this I forgot to ask for a new batch after I started executing the new batch. So I was re-using batch objects. This has been fixed now in my code. |
Confirmed I was able to cause the same crash in a test case. Evaluating our options... |
To summarize what I put into the commit message. The Batch object was never intended to be thread-safe and this particular panic was actually a manifestation of inadvertently mutating the batch by multiple go-routines at the same time. To address this we have done 2 things:
|
Is this the same as #41 ? |
I just posted to the same effect on blevesearch/blevex#41 – The root cause of the panic is the same, but the reason we get into this state is likely different and unique to the automatic index batcher. The pattern bleve uses is to figure out how many items are going to be analyzed, build a channel, pass it to the analysis queue, and read off EXACTLY the expected number of items. We then close and move on. However in this case, another analyzed document was written to the channel AFTER we already closed it. As we observed in this issue, this can happen if you start to execute a batch, and the keep mutating the batch without waiting for the batch execution to finish. I'm not very familiar with the automatic index batcher, so I'm reviewing it now to see if I spot anything. |
Yeah I'm going to have to modify this IndexBatcher a bit. I'm not too familiar with this pattern it implements however. Do you know why we do this whole song 'n dance of blocking the callers up to the period interval? Why not just fill a buffer and write batches up to buffer_size? Or just fill a buffer and empty and batch the contents of the buffered operations very X interval? |
Or in other words; its not clear to me why you'd want to block your callers at all? Certainly in my case I'd rather not to get a higher write throughput. |
At Couchbase we do something closer to what you've described. We just build batches of the sizes we like, execute them and recycle them. It's straightforward and works well for us. The user who contributed this wanted to try and hide the batching entirely and just expose a simpler API that didn't do any explicit batching. It sounds like you should just avoid using this automatic batcher. |
So... I'm able to repro this with this very simple naive batching indxer:
Crash:
Which is exactly at the same place. I don't see how if I have a mutex guard around all |
Can we please get some more eyes on this and consider reopening? I've crawled through my simple indexer with a fine tooth comb and can't understand where the problem lies. |
I've added a main function, here is a complete example, but it does not crash for me. Maybe you're configuring the batcher differently, or using it differently, can you take a look?
|
My code is identical. The test-environment is what is different. My conditions are:
This crashes at the same spot quite consistently. |
New find; re-creating the
|
Note the lines:
|
I don't think it's new information from my understanding. Our understanding of this crash is as follows:
This crash happens when the count computed in 1 and the number of documents written in 2 are not the same. In this original issue, this happened because the user would continue to modify a batch after it had been executed. This is clearly not supported and thus the issue was closed. In this case, we still don't understand why these counts are different, but this seems to still be the only way we would end up closing the channel prematurely and get the panic. Not resetting batches avoids this entirely, but presumably is unacceptable for other reasons. The question remains, why do we see this problem, even though the locking looks correct and the batch reuse looks to be as intended. |
@mschoch What I don't understand is how this is even remotely possible while all |
If this is what's happening; then my solution is a "hack" but works and the IndexBatcher in blevex is just plain wrong. |
@prologic , so I've tried to reproduce this issue a good number of times with the test snippet you shared .. but haven't been able to so far. Also as we've noted earlier - it's not quite obvious from the code why this panic seems to occur only with the "batch.Reset()", as all operations occur while within the lock. |
Can I provide someone with the source to my project where I can still
reliability repro this panic? (I haven't published this project yet and not
ready to do so yet...) Alternatively I *could* try to create a MVP that
repros this -- but I worry it may be specific to my code and the load
patterns I put on the system.
James Mills / prologic
E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au
…On Mon, Jun 11, 2018 at 11:17 AM, Abhinav Dangeti ***@***.***> wrote:
@prologic <https://github.com/prologic> , so I've tried to reproduce this
issue a good number of times with the test snippet you shared .. but
haven't been able to so far. Also as we've noted earlier - it's not quite
obvious from the code why this panic seems to occur only with the
"batch.Reset()", as all operations occur while within the lock.
Could you perhaps give more tips on how you've managed to reproduce the
issue with your test? I'm not quite familiar with hey
<https://github.com/rakyll/hey>, and how you've used it to reproduce the
issue..
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#195 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABOv-nQ0vrSbrSkMFiGj1bdC3CcjTrv4ks5t7rRCgaJpZM4EJYLV>
.
|
At this point, any test case that we can set up and run to reliably reproduce this issue could prove very useful. Cheers. |
While running tests described at http://www.philipotoole.com/increasing-bleve-performance-sharding/ I hit the panic below 3 times.
The text was updated successfully, but these errors were encountered: