Close LRU by database path for deleted database/index #23

jiangphcn · 2019-08-22T09:43:12Z

Overview

The problem was originally exposed when updating search index with the following error:

{error, "*** eval: {'EXIT',{{badmatch,{error,<<"Lock obtain timed out: NativeFSLock@/srv/search_index/shards/20000000-3fffffff/user1/db1.1509715496/ec8bc1dc690a15d795e65f7dce409e2c/write.lock">>}},\n                   [{erl_eval,expr,3,[]}]}}"}

Finally, it turns out that the soft-deletion path is missing a call to evict the shard from LRU (and therefore release the associated locks). This causes that lock file was left in working directory and search index failed to be updated.

To address this, the fix is to close the corresponding index in LRU before index is to be renamed/deleted.

Testing recommendations

@theburge made excellent troubleshooting and analysis, and also wrote below script to reproduce problem.

repro2.sh.txt

The problem doesn't occur after using the fix here and apache/couchdb#2130

Related Issues or Pull Requests

apache/couchdb#2130

Checklist

Code is written and works correctly
Changes are covered by tests
Any new configurable parameters are documented in rel/overlay/etc/default.ini
A PR for documentation changes has been made in https://github.com/apache/couchdb-documentation

theburge · 2019-08-22T11:25:31Z

@jiangphcn Thanks, that looks much more suitable for closing an individual shard's indexes prior to renaming the shard's directory.

I'm a little concerned about the performance impact of walking the whole LRU, but in practice I've never seen it be more than 5000 items. I think this is fine for now.

I remain more concerned about the race condition that exists in dreyfus: close-then-rename as two separate operations is a potential problem, and the current implementation in dreyfus does not serialise attempts to open and calls to close_lru/1. (Note that close_lru/1 is called in the caller's process, while the rename message is handled by the dreyfus_index_manager gen_server, so an open operation could easily sneak in before the rename message is issued but after the close_lru/1 succeeds partially.)

I don't propose attempting to close that race condition, but I think the implementation here ought to include some defensive logic to clean-up properly when closing, specifically in the code that logs this:

Jul 19 13:32:55 db1.testy013 local5: 2019-07-19 13:32:55,741 clouseau.cleanup [INFO] Renaming '/srv/search_index/shards/00000000-ffffffff/test-user/test-db.1563543159' to '/srv/search_index/shards/00000000-ffffffff/test-user/test-db.20190719.133255.deleted.1563543159'
Jul 19 13:32:55 db1.testy013 local5: 2019-07-19 13:32:55,741 clouseau [WARN] shards/00000000-ffffffff/test-user/test-db.1563543159/b7449da0be4816aa552bd089d7a3f51c Error while closing writer
Jul 19 13:32:55 db1.testy013 org.apache.lucene.store.NoSuchDi directory '/srv/search_index/shards/00000000-ffffffff/test-user/test-db.1563543159/b7449da0be4816aa552bd089d7a3f51c' does not exist
Jul 19 13:32:55 db1.testy013     at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:219)
Jul 19 13:32:55 db1.testy013     at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243)
Jul 19 13:32:55 db1.testy013     at org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:324)
Jul 19 13:32:55 db1.testy013     at org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:360)
Jul 19 13:32:55 db1.testy013     at org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2095)
Jul 19 13:32:55 db1.testy013     at org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:2033)
Jul 19 13:32:55 db1.testy013     at com.cloudant.clouseau.IndexService.exit(IndexService.scala:220)
Jul 19 13:32:55 db1.testy013     at scalang.Process.trapExit(Process.scala:58)
Jul 19 13:32:55 db1.testy013     at scalang.Process.handleExit(Process.scala:46)
Jul 19 13:32:55 db1.testy013     at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:55)
Jul 19 13:32:55 db1.testy013     at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:52)
Jul 19 13:32:55 db1.testy013     at org.jetlang.channels.ChannelSubscription$1.run(ChannelSubscription.java:31)
Jul 19 13:32:55 db1.testy013     at org.jetlang.core.BatchExecutorImpl.execute(BatchExecutorImpl.java:11)
Jul 19 13:32:55 db1.testy013     at org.jetlang.fibers.PoolFiber.flush(PoolFiber.java:82)
Jul 19 13:32:55 db1.testy013     at org.jetlang.fibers.PoolFiber.access$100(PoolFiber.java:21)
Jul 19 13:32:55 db1.testy013     at org.jetlang.fibers.PoolFiber$1.run(PoolFiber.java:38)
Jul 19 13:32:55 db1.testy013     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Jul 19 13:32:55 db1.testy013     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Jul 19 13:32:55 db1.testy013     at java.lang.Thread.run(Thread.java:748)
Jul 19 13:32:55 db1.testy013     at overlock.threadpool.ErrorLoggedThread.run(NamedThreadFactory.scala:40)

The writer would need to be explicitly closed/released here when the rollback fails to avoid the writer lock being held indefinitely.

(Naturally, that close/release code should not throw an additional exception under any circumstances: In general, it's a wise precaution to have clean-up code never fail and leak resources, even if it means suppressing an otherwise fault-indicating exception. Also, in the clouseau/dreyfus context, it's usually fine to throw away changes during a failure, since it's secondary data and ought to be rebuilt later.)

What do you think?

theburge

+1

jiangphcn · 2019-11-04T11:37:42Z

Thanks again to @theburge As we discussed in apache/couchdb#2130 (review), let us try with current approach.

close LRU by database name

18ea7c5

theburge mentioned this pull request Aug 22, 2019

Close LRU by database path for deleted database/index apache/couchdb#2130

Merged

4 tasks

close writer when getting IOException

ff6d496

jiangphcn changed the title ~~[WIP]close LRU by database name~~ Close LRU by database path for deleted database/index Nov 1, 2019

theburge approved these changes Nov 4, 2019

View reviewed changes

jiangphcn merged commit 59a33f0 into master Nov 4, 2019

jiangphcn mentioned this pull request Nov 20, 2019

Remove unnecessary writer close #24

Open

4 tasks

theburge mentioned this pull request Jun 13, 2022

Lucene write locks are sometimes held open upon database (or shard) deletion #53

Open

jaydoane deleted the close-lru branch February 7, 2025 04:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Close LRU by database path for deleted database/index #23

Close LRU by database path for deleted database/index #23

Uh oh!

jiangphcn commented Aug 22, 2019 •

edited

Loading

Uh oh!

theburge commented Aug 22, 2019

Uh oh!

theburge left a comment

Uh oh!

jiangphcn commented Nov 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Close LRU by database path for deleted database/index #23

Close LRU by database path for deleted database/index #23

Uh oh!

Conversation

jiangphcn commented Aug 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Testing recommendations

Related Issues or Pull Requests

Checklist

Uh oh!

theburge commented Aug 22, 2019

Uh oh!

theburge left a comment

Choose a reason for hiding this comment

Uh oh!

jiangphcn commented Nov 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiangphcn commented Aug 22, 2019 •

edited

Loading