Skip to content

Conversation

@jiangphcn
Copy link
Contributor

@jiangphcn jiangphcn commented Aug 22, 2019

Overview

The problem was originally exposed when updating search index with the following error:

{error, "*** eval: {'EXIT',{{badmatch,{error,<<"Lock obtain timed out: NativeFSLock@/srv/search_index/shards/20000000-3fffffff/user1/db1.1509715496/ec8bc1dc690a15d795e65f7dce409e2c/write.lock">>}},\n                   [{erl_eval,expr,3,[]}]}}"}

Finally, it turns out that the soft-deletion path is missing a call to evict the shard from LRU (and therefore release the associated locks). This causes that lock file was left in working directory and search index failed to be updated.

To address this, the fix is to close the corresponding index in LRU before index is to be renamed/deleted.

Testing recommendations

@theburge made excellent troubleshooting and analysis, and also wrote below script to reproduce problem.

repro2.sh.txt

The problem doesn't occur after using the fix here and cloudant-labs/clouseau#23

Related Issues or Pull Requests

cloudant-labs/clouseau#23

Checklist

  • Code is written and works correctly
  • Changes are covered by tests
  • Any new configurable parameters are documented in rel/overlay/etc/default.ini
  • A PR for documentation changes has been made in https://github.com/apache/couchdb-documentation

"enable_database_recovery", false),
case RecoveryEnabled of
true ->
clouseau_rpc:close_lru(DbName),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an inconsistency here: clouseau:close_lru/1 is called in the DB event handler, while the rename message is processed by the dreyfus_index_manager gen_server (after casting to it). As I mentioned over in the partner clouseau PR, this opens up a race condition.

I think we ought to do one or the other. In general, I'd say offload work to the caller if you can, but in this case the caller is our couch_event listener, so that might not be smart (we could end up backlogging that). However, pushing more traffic through dreyfus_index_manager is also not ideal... :)

In terms of the race condition, it probably makes the most sense to move both clouseau RPCs into dreyfus_index_manager:handle_cast/3, so that the gen_server serialises the close/rename and other open (aka get_index) operations.

@theburge
Copy link
Contributor

@jiangphcn Thanks, I think this approach is more promising, but I think we need to address race windows on both sides: here by moving the two operations together, and in clouseau by adding some defensive logic to force the writer closed on an IOError.

@jiangphcn
Copy link
Contributor Author

@theburge thanks for your valuable comments on these two PRs and your patience. I already logic to force to close writer when getting IOException cloudant-labs/clouseau@ff6d496

Regarding to combine two RPC calls (close_lru and rename) into one, I think that it is good idea. But I am getting some difficulty to combine them because they are sent to different process reference (two services in clouseau side)

  • rpc({main, clouseau()}, {close_lru_by_path, DbName}). i.e. IndexCleanupService in clouseau
  • gen_server:cast({cleanup, clouseau()}, {rename, DbName}) i.e. IndexManagerService in clouseau

Back to race condition scenario, I am thinking that it is possible to have close_lru, open, and rename index calling sequence. However, how much chance to have these calls against the same shared database.

@jiangphcn jiangphcn changed the title [WIP] close LRU by database path Close LRU by database path for deleted database/index Nov 1, 2019
Copy link
Contributor

@theburge theburge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

(Assuming the CI failure is spurious)

@jiangphcn and I agreed to try this approach and see how it performs in practice, rather than do a bigger rework to eliminate the remaining races.

@jiangphcn jiangphcn force-pushed the close-lru branch 2 times, most recently from 5a3de26 to 29cad0a Compare November 4, 2019 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants