Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more tests to ASSERT_STATUS_CHECKED (3), API change #7715

Closed
wants to merge 10 commits into from

Conversation

adamretter
Copy link
Collaborator

@adamretter adamretter commented Nov 26, 2020

Third batch of adding more tests to ASSERT_STATUS_CHECKED.

  • db_compaction_filter_test
  • db_compaction_test
  • db_dynamic_level_test
  • db_inplace_update_test
  • db_sst_test
  • db_tailing_iter_test
  • db_io_failure_test

Also update GetApproximateSizes APIs to all return Status.

@adamretter adamretter added the bug Confirmed RocksDB bugs label Nov 27, 2020
@adamretter adamretter force-pushed the assert-status-checked-3 branch 2 times, most recently from 30a0c2b to ade8011 Compare December 1, 2020 21:48
@adamretter adamretter force-pushed the assert-status-checked-3 branch 2 times, most recently from a49a313 to ecfa9e6 Compare December 7, 2020 20:55
@adamretter
Copy link
Collaborator Author

@pdillinger This one is ready. Can you take a look please?

db/db_compaction_filter_test.cc Outdated Show resolved Hide resolved
db/db_compaction_test.cc Outdated Show resolved Hide resolved
db/db_compaction_test.cc Outdated Show resolved Hide resolved
db/db_compaction_test.cc Outdated Show resolved Hide resolved
db/db_compaction_test.cc Outdated Show resolved Hide resolved
db/db_tailing_iter_test.cc Outdated Show resolved Hide resolved
db/db_test_util.cc Outdated Show resolved Hide resolved
db/event_helpers.cc Outdated Show resolved Hide resolved
db/forward_iterator.cc Outdated Show resolved Hide resolved
file/sst_file_manager_impl.cc Outdated Show resolved Hide resolved
@adamretter
Copy link
Collaborator Author

adamretter commented Dec 10, 2020

@mrambacher @pdillinger If I run db_sst_test over-and-over again, I get intermittent ASC failures which seems odd, as looking at db_sst_test.cc:960 it looks like it should either always succeed or fail:

Failed to check Status 0x7fb15880ad18
#0   rocksdb::port::PrintStack(int) (in db_sst_test) (stack_trace.cc:121)
#1   rocksdb::Status::~Status() (in db_sst_test) (status.h:42)
#2   rocksdb::Status::~Status() (in db_sst_test) (status.h:47)
#3   rocksdb::ErrorHandler::~ErrorHandler() (in db_sst_test) (error_handler.h:28)
#4   rocksdb::ErrorHandler::~ErrorHandler() (in db_sst_test) (error_handler.h:28)
#5   rocksdb::DBImpl::~DBImpl() (in db_sst_test) (db_impl.cc:671)
#6   rocksdb::DBImpl::~DBImpl() (in db_sst_test) (db_impl.cc:671)
#7   rocksdb::DBImpl::~DBImpl() (in db_sst_test) (db_impl.cc:666)
#8   Failed to start process notifications for pid 99708 (19)	rocksdb::DBTestBase::Close() (in db_sst_test) (db_test_util.cc:0)
#9   Failed to start process notifications for pid 99708 (19)	rocksdb::DBTestBase::Destroy(rocksdb::Options const&, bool) (in db_sst_test) (db_test_util.cc:696)
#10  Failed to start process notifications for pid 99708 (19)	rocksdb::DBTestBase::DestroyAndReopen(rocksdb::Options const&) (in db_sst_test) (db_test_util.cc:683)
#11  Failed to start process notifications for pid 99708 (19)	rocksdb::DBSSTTest_DBWithMaxSpaceAllowedRandomized_Test::TestBody() (in db_sst_test) (db_sst_test.cc:960)

After a bit of digging into this, I think there may be a race-condition in the test itself... I will see if I can figure it out...

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the big set of fixes!

Suspected bug in DBTestBase::CountFiles. And please don't bundle API changes with big test refactoring/polishing. If this PR substantially depends on the API changes, please put them in another PR first, with HISTORY update.

db/db_compaction_test.cc Show resolved Hide resolved
db/db_impl/db_impl.cc Show resolved Hide resolved
db/db_test_util.cc Show resolved Hide resolved
db/db_test_util.cc Show resolved Hide resolved
db/forward_iterator.cc Outdated Show resolved Hide resolved
include/rocksdb/c.h Show resolved Hide resolved
include/rocksdb/db.h Show resolved Hide resolved
HISTORY.md Outdated Show resolved Hide resolved
HISTORY.md Outdated Show resolved Hide resolved
@pdillinger
Copy link
Contributor

After a bit of digging into this, I think there may be a race-condition in the test itself... I will see if I can figure it out...

Good catch. Seems very plausible that the flush callback sets a bg_error that is never checked. I don't know enough about background errors to say what the "right thing" is here.

@pdillinger pdillinger changed the title Add more tests to ASSERT_STATUS_CHECKED (3) Add more tests to ASSERT_STATUS_CHECKED (3), API change Dec 11, 2020
@mrambacher
Copy link
Contributor

After a bit of digging into this, I think there may be a race-condition in the test itself... I will see if I can figure it out...

Good catch. Seems very plausible that the flush callback sets a bg_error that is never checked. I don't know enough about background errors to say what the "right thing" is here.

If your digging shows a race condition, I would suggest doing the following:

  • Write a "TEST_GetErrorHandler" method in DBImpl (at least one other test could use this method)
  • Extract and clear the background error
  • Write a new PR/Issue with this problem and link to the test
  • Put a comment in the test regarding that PR

@siying
Copy link
Contributor

siying commented Dec 23, 2020

@adamretter you mentioned that there is a race condition. Can you give us a pointer?

Copy link
Contributor

@mrambacher mrambacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@adamretter
Copy link
Collaborator Author

@siying I added a bunch of debugging print statements that are happening under a mutex. When the test passes, I get a different set of print output compared to when it fails. I can still dig in further, but for your reference I attach the code I use for debugging - debug.patch.txt

@pdillinger
Copy link
Contributor

We want to merge a giant PR like this ASAP but not if it has a flaky test (race condition). Can't we cover over and defer that fix as Mark suggested? (I didn't see any evidence it had been covered over and noted.)

@adamretter
Copy link
Collaborator Author

adamretter commented Jan 2, 2021

@siying @pdillinger @mrambacher After further investigation I am no longer convinced this is a problem in the test.
I wrote some temporary code so that I could log all allocations of Status. With that when an ASSERT_STATUS_CHECKED fails I was able to print out the corresponding stack of where the un-checked Status was originally allocated.

In all instances I find that the unchecked Status is allocated with the following stack:

#1   Failed to start process notifications for pid 6740 (19)	rocksdb::Status::operator=(rocksdb::Status const&) (in db_sst_test) (status.h:543)
#2   Failed to start process notifications for pid 6740 (19)	rocksdb::ErrorHandler::SetBGError(rocksdb::Status const&, rocksdb::BackgroundErrorReason) (in db_sst_test) (error_handler.cc:313)
#3   Failed to start process notifications for pid 6740 (19)	rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) (in db_sst_test) (db_impl_compaction_flush.cc:3105)
#4   Failed to start process notifications for pid 6740 (19)	rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) (in db_sst_test) (db_impl_compaction_flush.cc:2591)
#5   Failed to start process notifications for pid 6740 (19)	rocksdb::DBImpl::BGWorkCompaction(void*) (in db_sst_test) (db_impl_compaction_flush.cc:2367)
#6   Failed to start process notifications for pid 6740 (19)	decltype(std::__1::forward<void (*&)(void*)>(fp)(std::__1::forward<void*&>(fp0))) std::__1::__invoke<void (*&)(void*), void*&>(void (*&)(void*), void*&) (in db_sst_test) (type_traits:3545)
#7   Failed to start process notifications for pid 6740 (19)	std::__1::__bind_return<void (*)(void*), std::__1::tuple<void*>, std::__1::tuple<>, __is_valid_bind_return<void (*)(void*), std::__1::tuple<void*>, std::__1::tuple<> >::value>::type std::__1::__apply_functor<void (*)(void*), std::__1::tuple<void*>, 0ul, 	td::__1::tuple<> >(void (*&)(void*), std::__1::tuple<void*>&, std::__1::__tuple_indices<0ul>, std::__1::tuple<>&&) (in db_sst_test) (functional:2845)
#8   Failed to start process notifications for pid 6740 (19)	std::__1::__bind_return<void (*)(void*), std::__1::tuple<void*>, std::__1::tuple<>, __is_valid_bind_return<void (*)(void*), std::__1::tuple<void*>, std::__1::tuple<> >::value>::type std::__1::__bind<void (*&)(void*), void*&>::operator()<>() (in db_sst_te	t) (functional:2878)
#9   Failed to start process notifications for pid 6740 (19)	decltype(std::__1::forward<std::__1::__bind<void (*&)(void*), void*&>&>(fp)()) std::__1::__invoke<std::__1::__bind<void (*&)(void*), void*&>&>(std::__1::__bind<void (*&)(void*), void*&>&) (in db_sst_test) (type_traits:3545)
#10  Failed to start process notifications for pid 6740 (19)	void std::__1::__invoke_void_return_wrapper<void>::__call<std::__1::__bind<void (*&)(void*), void*&>&>(std::__1::__bind<void (*&)(void*), void*&>&) (in db_sst_test) (__functional_base:349)
#11  Failed to start process notifications for pid 6740 (19)	std::__1::__function::__alloc_func<std::__1::__bind<void (*&)(void*), void*&>, std::__1::allocator<std::__1::__bind<void (*&)(void*), void*&> >, void ()>::operator()() (in db_sst_test) (functional:1546)
#12  Failed to start process notifications for pid 6740 (19)	std::__1::__function::__func<std::__1::__bind<void (*&)(void*), void*&>, std::__1::allocator<std::__1::__bind<void (*&)(void*), void*&> >, void ()>::operator()() (in db_sst_test) (functional:1720)
#13  Failed to start process notifications for pid 6740 (19)	std::__1::__function::__value_func<void ()>::operator()() const (in db_sst_test) (functional:1873)
#14  Failed to start process notifications for pid 6740 (19)	std::__1::function<void ()>::operator()() const (in db_sst_test) (functional:2548)
#15  Failed to start process notifications for pid 6740 (19)	rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) (in db_sst_test) (threadpool_imp.cc:264)
#16  Failed to start process notifications for pid 6740 (19)	rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) (in db_sst_test) (threadpool_imp.cc:307)

So, I am wondering who is responsible for clearing the status of recovery_error_ in ErrorHandler::SetBGError.
Is there perhaps a missing call to ClearBGError or RecoverFromBGError somewhere?

For reference, the ASSERT_STATUS_CHECKED fails (mostly) here:

Unchecked status (adam_id=1830053): IO error: Space limit reachedMax allowed space was reached
Failed to check Status 0x7fbb5800b568
#0   rocksdb::port::PrintStack(int) (in db_sst_test) (stack_trace.cc:121)
#1   rocksdb::Status::~Status() (in db_sst_test) (status.h:54)
#2   rocksdb::Status::~Status() (in db_sst_test) (status.h:70)
#3   rocksdb::ErrorHandler::~ErrorHandler() (in db_sst_test) (error_handler.h:28)
#4   rocksdb::ErrorHandler::~ErrorHandler() (in db_sst_test) (error_handler.h:28)
#5   rocksdb::DBImpl::~DBImpl() (in db_sst_test) (db_impl.cc:671)
#6   rocksdb::DBImpl::~DBImpl() (in db_sst_test) (db_impl.cc:671)
#7   rocksdb::DBImpl::~DBImpl() (in db_sst_test) (db_impl.cc:666)
#8   Failed to start process notifications for pid 8492 (19)	rocksdb::DBTestBase::Close() (in db_sst_test) (db_test_util.cc:0)
#9   Failed to start process notifications for pid 8492 (19)	rocksdb::DBTestBase::Destroy(rocksdb::Options const&, bool) (in db_sst_test) (db_test_util.cc:696)
#10  Failed to start process notifications for pid 8492 (19)	rocksdb::DBTestBase::DestroyAndReopen(rocksdb::Options const&) (in db_sst_test) (db_test_util.cc:683)
#11  Failed to start process notifications for pid 8492 (19)	rocksdb::DBSSTTest_DBWithMaxSpaceAllowedRandomized_Test::TestBody() (in db_sst_test) (db_sst_test.cc:1008)

I am not quite sure of the best way to proceed... I am looking forward to some advice ;-)

@mrambacher
Copy link
Contributor

mrambacher commented Jan 3, 2021

So I think I might understand the problem. I think there is a background job running that attempts to clear the recovery error. Sometimes that job completes earlier than others.

There are two places in error_handler.cc (around line 242 and 577) where recovery_in_prog_ is set to false. I believe these should also call recovery_error_.PermitUncheckedError() as well. I believe when recovery_in_prog_ is false, there can be no recovery_error_, and this might lead to cases where the error is not later checked.

I ran the test with a PermitUncheckedError at line 577 multiple times and did not see any failures.

Someone else @pdillinger @siying can correct me if my analysis is wrong...

@adamretter
Copy link
Collaborator Author

adamretter commented Jan 4, 2021

@mrambacher your suggested fix appears to work for me here on my Mac.
I just want to check with @pdillinger or @siying that is this the correct way to go...

@pdillinger
Copy link
Contributor

I'm not allowed to push to this branch. Please pull in https://github.com/pdillinger/rocksdb/tree/pr7715 so we can get this landed ASAP.

@pdillinger
Copy link
Contributor

pdillinger commented Jan 4, 2021

So I think I might understand the problem.

Sorry I missed this update (Chrome likes showing me old versions of web pages even when they appear to reload!)

But I don't think we should block this PR (perhaps #7831) on that answer. (I'm not nearly the expert in this area.)

@adamretter
Copy link
Collaborator Author

@pdillinger I think it is important that we get this correct. Perhaps someone else in the team who knows this area could take a look?

@ajkr
Copy link
Contributor

ajkr commented Jan 5, 2021

I think we can add PermitUncheckedError() liberally until we get everything covered. If Mark's suggestion works, that's great since it sounds more targeted than what I was thinking (call error_handler_.GetRecoveryError().PermitUncheckedError(); in DBImpl::CloseHelper()).

@pdillinger
Copy link
Contributor

I think we can add PermitUncheckedError() liberally until we get everything covered. If Mark's suggestion works, that's great

+1, though we are ending up with a good number of PermitUncheckedError(), which is why I think it's important to document the context/justification for the call when it's not obvious, especially in cases where the reverse-engineering approach of "remove it and see what breaks" might not show any results (e.g. because of race condition).

@ajkr
Copy link
Contributor

ajkr commented Jan 5, 2021

Got it. L577 and L242 make sense to me under the justification that the recovery work is thrown away anyways so doesn't matter if it was succeeding or not. I wouldn't think the existing one on L564 is needed though (

recovery_error_.PermitUncheckedError();
).

edit: @anand1976 is the expert on this area, so maybe he can confirm there's no risk of lost recovery errors some time, but I wouldn't block this PR in the meantime

@ajkr
Copy link
Contributor

ajkr commented Jan 5, 2021

L577 and L242 make sense to me under the justification that the recovery work is thrown away anyways so doesn't matter if it was succeeding or not.

I crossed this out because L577 also includes is_manual recoveries that didn't necessarily fail. We shouldn't need to permit unchecked errors in that case since ClearBGError() should be (and is) called to check for background errors during the recovery. Well, that holds as long as the recovery doesn't encounter a non-OK Status beforehand in the foreground. So, my updated guess is to just permit unchecked error when ClearBGError() is skipped - i.e., when this conditional body is not executed:

s = error_handler_.ClearBGError();

@adamretter
Copy link
Collaborator Author

@ajkr @mrambacher Thanks for the suggested fixes. I have now incorporated the fix suggested by @ajkr and all tests appear to be passing.

@pdillinger I think this is now ready for you to land. Thanks.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@anand1976
Copy link
Contributor

Sorry for jumping in late. If a background recovery is attempted, it should always check recovery_error_. In case of SpaceLimit, there is no background recovery. The recovery is by the user calling DB::Resume(), which calls FlushMemTable. The latter directly returns any new error that might have occurred, so there's no risk of lost recovery errors. The fix suggested by @ajkr is good.

@facebook-github-bot
Copy link
Contributor

@pdillinger merged this pull request in 6e0f62f.

codingrhythm pushed a commit to SafetyCulture/rocksdb that referenced this pull request Mar 5, 2021
Summary:
Third batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_compaction_filter_test
* db_compaction_test
* db_dynamic_level_test
* db_inplace_update_test
* db_sst_test
* db_tailing_iter_test
* db_io_failure_test

Also update GetApproximateSizes APIs to all return Status.

Pull Request resolved: facebook#7715

Reviewed By: jay-zhuang

Differential Revision: D25806896

Pulled By: pdillinger

fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 18, 2021
Summary:
Third batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_compaction_filter_test
* db_compaction_test
* db_dynamic_level_test
* db_inplace_update_test
* db_sst_test
* db_tailing_iter_test
* db_io_failure_test

Also update GetApproximateSizes APIs to all return Status.

Pull Request resolved: facebook/rocksdb#7715

Reviewed By: jay-zhuang

Differential Revision: D25806896

Pulled By: pdillinger

fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
Summary:
Third batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_compaction_filter_test
* db_compaction_test
* db_dynamic_level_test
* db_inplace_update_test
* db_sst_test
* db_tailing_iter_test
* db_io_failure_test

Also update GetApproximateSizes APIs to all return Status.

Pull Request resolved: facebook/rocksdb#7715

Reviewed By: jay-zhuang

Differential Revision: D25806896

Pulled By: pdillinger

fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262
Signed-off-by: Changlong Chen <levisonchen@live.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed RocksDB bugs CLA Signed Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants