Close LRU by database path for deleted database/index #2130

jiangphcn · 2019-08-22T09:30:22Z

Overview

The problem was originally exposed when updating search index with the following error:

{error, "*** eval: {'EXIT',{{badmatch,{error,<<"Lock obtain timed out: NativeFSLock@/srv/search_index/shards/20000000-3fffffff/user1/db1.1509715496/ec8bc1dc690a15d795e65f7dce409e2c/write.lock">>}},\n                   [{erl_eval,expr,3,[]}]}}"}

Finally, it turns out that the soft-deletion path is missing a call to evict the shard from LRU (and therefore release the associated locks). This causes that lock file was left in working directory and search index failed to be updated.

To address this, the fix is to close the corresponding index in LRU before index is to be renamed/deleted.

Testing recommendations

@theburge made excellent troubleshooting and analysis, and also wrote below script to reproduce problem.

repro2.sh.txt

The problem doesn't occur after using the fix here and cloudant-labs/clouseau#23

Related Issues or Pull Requests

cloudant-labs/clouseau#23

Checklist

Code is written and works correctly
Changes are covered by tests
Any new configurable parameters are documented in rel/overlay/etc/default.ini
A PR for documentation changes has been made in https://github.com/apache/couchdb-documentation

theburge · 2019-08-22T11:34:08Z

src/dreyfus/src/dreyfus_index_manager.erl

        "enable_database_recovery", false),
    case RecoveryEnabled of
        true ->
+            clouseau_rpc:close_lru(DbName),


There's an inconsistency here: clouseau:close_lru/1 is called in the DB event handler, while the rename message is processed by the dreyfus_index_manager gen_server (after casting to it). As I mentioned over in the partner clouseau PR, this opens up a race condition.

I think we ought to do one or the other. In general, I'd say offload work to the caller if you can, but in this case the caller is our couch_event listener, so that might not be smart (we could end up backlogging that). However, pushing more traffic through dreyfus_index_manager is also not ideal... :)

In terms of the race condition, it probably makes the most sense to move both clouseau RPCs into dreyfus_index_manager:handle_cast/3, so that the gen_server serialises the close/rename and other open (aka get_index) operations.

theburge · 2019-08-22T11:36:47Z

@jiangphcn Thanks, I think this approach is more promising, but I think we need to address race windows on both sides: here by moving the two operations together, and in clouseau by adding some defensive logic to force the writer closed on an IOError.

jiangphcn · 2019-10-25T13:39:23Z

@theburge thanks for your valuable comments on these two PRs and your patience. I already logic to force to close writer when getting IOException cloudant-labs/clouseau@ff6d496

Regarding to combine two RPC calls (close_lru and rename) into one, I think that it is good idea. But I am getting some difficulty to combine them because they are sent to different process reference (two services in clouseau side)

rpc({main, clouseau()}, {close_lru_by_path, DbName}). i.e. IndexCleanupService in clouseau
gen_server:cast({cleanup, clouseau()}, {rename, DbName}) i.e. IndexManagerService in clouseau

Back to race condition scenario, I am thinking that it is possible to have close_lru, open, and rename index calling sequence. However, how much chance to have these calls against the same shared database.

theburge

+1

(Assuming the CI failure is spurious)

@jiangphcn and I agreed to try this approach and see how it performs in practice, rather than do a bigger rework to eliminate the remaining races.

theburge reviewed Aug 22, 2019

View reviewed changes

jiangphcn changed the title ~~[WIP] close LRU by database path~~ Close LRU by database path for deleted database/index Nov 1, 2019

jiangphcn mentioned this pull request Nov 1, 2019

Close LRU by database path for deleted database/index cloudant-labs/clouseau#23

Merged

4 tasks

jiangphcn force-pushed the close-lru branch from 14a7608 to 3e4c2c7 Compare November 1, 2019 07:56

theburge approved these changes Nov 4, 2019

View reviewed changes

jiangphcn force-pushed the close-lru branch 2 times, most recently from 5a3de26 to 29cad0a Compare November 4, 2019 23:14

close LRU by database path

367d17a

jiangphcn force-pushed the close-lru branch from 29cad0a to 367d17a Compare November 5, 2019 00:50

jiangphcn merged commit d60551d into master Nov 5, 2019

jiangphcn deleted the close-lru branch November 5, 2019 05:34

jiangphcn mentioned this pull request Dec 12, 2019

Unlock IndexWriter when IndexService process exits cloudant-labs/clouseau#27

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Close LRU by database path for deleted database/index #2130

Close LRU by database path for deleted database/index #2130

Uh oh!

jiangphcn commented Aug 22, 2019 •

edited

Loading

Uh oh!

theburge Aug 22, 2019

Uh oh!

theburge commented Aug 22, 2019

Uh oh!

jiangphcn commented Oct 25, 2019

Uh oh!

theburge left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Close LRU by database path for deleted database/index #2130

Close LRU by database path for deleted database/index #2130

Uh oh!

Conversation

jiangphcn commented Aug 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Testing recommendations

Related Issues or Pull Requests

Checklist

Uh oh!

theburge Aug 22, 2019

Choose a reason for hiding this comment

Uh oh!

theburge commented Aug 22, 2019

Uh oh!

jiangphcn commented Oct 25, 2019

Uh oh!

theburge left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiangphcn commented Aug 22, 2019 •

edited

Loading