-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Close LRU by database path for deleted database/index #2130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| "enable_database_recovery", false), | ||
| case RecoveryEnabled of | ||
| true -> | ||
| clouseau_rpc:close_lru(DbName), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an inconsistency here: clouseau:close_lru/1 is called in the DB event handler, while the rename message is processed by the dreyfus_index_manager gen_server (after casting to it). As I mentioned over in the partner clouseau PR, this opens up a race condition.
I think we ought to do one or the other. In general, I'd say offload work to the caller if you can, but in this case the caller is our couch_event listener, so that might not be smart (we could end up backlogging that). However, pushing more traffic through dreyfus_index_manager is also not ideal... :)
In terms of the race condition, it probably makes the most sense to move both clouseau RPCs into dreyfus_index_manager:handle_cast/3, so that the gen_server serialises the close/rename and other open (aka get_index) operations.
|
@jiangphcn Thanks, I think this approach is more promising, but I think we need to address race windows on both sides: here by moving the two operations together, and in clouseau by adding some defensive logic to force the writer closed on an IOError. |
|
@theburge thanks for your valuable comments on these two PRs and your patience. I already logic to force to close writer when getting IOException cloudant-labs/clouseau@ff6d496 Regarding to combine two RPC calls (
Back to race condition scenario, I am thinking that it is possible to have |
theburge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
(Assuming the CI failure is spurious)
@jiangphcn and I agreed to try this approach and see how it performs in practice, rather than do a bigger rework to eliminate the remaining races.
5a3de26 to
29cad0a
Compare
Overview
The problem was originally exposed when updating search index with the following error:
Finally, it turns out that the soft-deletion path is missing a call to evict the shard from LRU (and therefore release the associated locks). This causes that lock file was left in working directory and search index failed to be updated.
To address this, the fix is to close the corresponding index in LRU before index is to be renamed/deleted.
Testing recommendations
@theburge made excellent troubleshooting and analysis, and also wrote below script to reproduce problem.
repro2.sh.txt
The problem doesn't occur after using the fix here and cloudant-labs/clouseau#23
Related Issues or Pull Requests
cloudant-labs/clouseau#23
Checklist
rel/overlay/etc/default.ini