Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Try to fix flaky temp-base-path-work test #13505

Merged
merged 9 commits into from
Mar 16, 2023
Merged

Conversation

bkchr
Copy link
Member

@bkchr bkchr commented Mar 1, 2023

The test is most of the time failing when checking if the database path was deleted. The assumption is that it takes a little bit more time by the OS to actually clean up the temp path under high load. The pr tries to fix this by checking multiple times if the path was deleted. Besides that it also ensures that the tests that require the benchmark feature don't fail when compiled without the feature.

The test is most of the time failing when checking if the database path was deleted. The assumption
is that it takes a little bit more time by the OS to actually clean up the temp path under high
load. The pr tries to fix this by checking multiple times if the path was deleted. Besides that it
also ensures that the tests that require the benchmark feature don't fail when compiled without the feature.
@bkchr bkchr added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit labels Mar 1, 2023
@bkchr
Copy link
Member Author

bkchr commented Mar 1, 2023

bot fmt

@command-bot
Copy link

command-bot bot commented Mar 1, 2023

@bkchr https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2471144 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh". Check out https://gitlab.parity.io/parity/mirrors/substrate/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 12-878bccc0-7dd5-4440-a18d-4f04ff85920f to cancel this command or bot cancel to cancel all commands in this pull request.

@command-bot
Copy link

command-bot bot commented Mar 1, 2023

@bkchr Command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh" has finished. Result: https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2471144 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2471144/artifacts/download.

@bkchr bkchr requested a review from a team March 2, 2023 10:17
altonen
altonen previously approved these changes Mar 2, 2023
@koute
Copy link
Contributor

koute commented Mar 2, 2023

Hmm... are we sure this actually fixes the issue?

It seems kind of weird to me that this is failing in the first place, because if I'm reading this right then the directory should be deleted synchronously before the process exits by substrate itself, and we do wait for the child to exit in this test.

The way this is deleted is that we hold a handle to TempDir that is supposed to be dropped on exit, which will then clean up the directory:

#[static_init::dynamic(drop, lazy)]
static mut BASE_PATH_TEMP: Option<TempDir> = None;

So this would have to not trigger for the test to fail.

@koute
Copy link
Contributor

koute commented Mar 2, 2023

Taking a quick look at the static_init crate I'm not really sure whether it's correct or not; there's a lot of code there, so it is possible that it could be buggy and just not call the destructor sometimes. Nevertheless, using it for only this feels like shooting at a fly with a cannon.

Maybe we could replace the current mechanism in substrate/client/service/src/config.rs and the static_init dependency with something significantly simpler? e.g. something like this should also work and maybe fix the problem? (But, again, hard to tell if I don't know why exactly it happens.)

static BASE_PATH_TEMP: Mutex<Option<TempDir>> = Mutex::new(None);

// ...

extern "C" fn on_exit() {
    BASE_PATH_TEMP.lock().take();
}

*BASE_PATH_TEMP.lock() = Some(temp_dir);
unsafe {
    libc::atexit(on_exit);
}

@michalkucharczyk
Copy link
Contributor

michalkucharczyk commented Mar 2, 2023

https://docs.rs/tempfile/latest/tempfile/struct.TempDir.html#resource-leaking says that SIGINT may leak the resource. And test uses SIGINT to terminate process.

@koute
Copy link
Contributor

koute commented Mar 2, 2023

https://docs.rs/tempfile/latest/tempfile/struct.TempDir.html#resource-leaking says that SIGINT may leak the resource. And test uses SIGINT to terminate process.

Yeah, but that is if the program doesn't catch it. We do. Or at least we're supposed to. So the SIGINT would have to come before this is registered, which should never happen because the test waits for finalized blocks before trying to send a SIGINT, so the signal handler should always be registered.

@michalkucharczyk
Copy link
Contributor

yes, good point. And the syscall is actually there at least at my machine :
unlinkat(AT_FDCWD, "/tmp/substrateH3HQ0Q", AT_REMOVEDIR) = 0

@michalkucharczyk
Copy link
Contributor

It sometimes happens that this syscall is not there:

$ strace -e trace=unlinkat ./substrate --dev
2023-03-02 12:13:46 Substrate Node    
2023-03-02 12:13:46 ✌️  version 3.0.0-dev-e581db6733f    
2023-03-02 12:13:46 ❤️  by Parity Technologies <admin@parity.io>, 2017-2023    
2023-03-02 12:13:46 📋 Chain specification: Development    
2023-03-02 12:13:46 🏷  Node name: jumbled-credit-9268    
2023-03-02 12:13:46 👤 Role: AUTHORITY    
2023-03-02 12:13:46 💾 Database: RocksDb at /tmp/substrateDQCmJB/chains/dev/db/full    
2023-03-02 12:13:46 ⛓  Native runtime: node-268 (substrate-node-0.tx2.au10)    
2023-03-02 12:13:52 [0] 💸 generated 1 npos voters, 1 from validators and 0 nominators    
2023-03-02 12:13:52 [0] 💸 generated 1 npos targets    
--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=3879417, si_uid=1000} ---
+++ killed by SIGINT +++

I see the 30s timeout in test. If the machine is actually over-loaded and initialization is laggy, then the exit/signal handler may not remove the directory.

@bkchr
Copy link
Member Author

bkchr commented Mar 2, 2023

Hmm... are we sure this actually fixes the issue?

No. I could not reproduce it locally. I have already seen it myself, but in my runs the last days I could not trigger it..

@michalkucharczyk
Copy link
Contributor

michalkucharczyk commented Mar 2, 2023

So the SIGINT would have to come before this is registered, which should never happen because the test waits for finalized blocks before trying to send a SIGINT, so the signal handler should always be registered.

Or 30s timeout :)
Which is problem here I believe...

@bkchr
Copy link
Member Author

bkchr commented Mar 2, 2023

So the SIGINT would have to come before this is registered, which should never happen because the test waits for finalized blocks before trying to send a SIGINT, so the signal handler should always be registered.

Or 30s timeout :) Which is problem here I believe...

Ahh good finding! But this would mean we don't make it past initializing the genesis block in 30 seconds and thus, the signal handler isn't installed.

@michalkucharczyk
Copy link
Contributor

So the SIGINT would have to come before this is registered, which should never happen because the test waits for finalized blocks before trying to send a SIGINT, so the signal handler should always be registered.

Or 30s timeout :) Which is problem here I believe...

Ahh good finding! But this would mean we don't make it past initializing the genesis block in 30 seconds and thus, the signal handler isn't installed.

It may happen if the load is high enough.

@bkchr
Copy link
Member Author

bkchr commented Mar 2, 2023

It may happen if the load is high enough.

Yeah! I first had thought of removing these timeouts and just putting there some big 10 minutes timeout for the entire test. I just don't have done this because the logs of the failing test did not indicate any particular error. However, your explanation @michalkucharczyk sounds correct and explains what we are seeing!

@michalkucharczyk
Copy link
Contributor

Is there any reason why the signal handler is late-registered?
My concern is that the node can still left its directory if terminated in proper timing.

@bkchr
Copy link
Member Author

bkchr commented Mar 2, 2023

Yeah valid point!

@bkchr
Copy link
Member Author

bkchr commented Mar 2, 2023

Maybe we could replace the current mechanism in substrate/client/service/src/config.rs and the static_init dependency with something significantly simpler? e.g. something like this should also work and maybe fix the problem? (But, again, hard to tell if I don't know why exactly it happens.)

https://stackoverflow.com/questions/9994150/on-exit-and-ctrlc should not work 🙈

So, we need to register the signal handler earlier as said by @michalkucharczyk!

@bkchr bkchr requested a review from a team March 7, 2023 15:10
@bkchr bkchr requested review from ggwpez and altonen March 7, 2023 15:10
@michalkucharczyk michalkucharczyk requested a review from a team March 8, 2023 07:56
@michalkucharczyk
Copy link
Contributor

nit: PR description could be updated.

async fn run_command_and_kill(signal: Signal) {
let base_path = tempdir().expect("could not create a temp dir");
let mut cmd = common::KillChildOnDrop(
common::run_with_timeout(Duration::from_secs(60 * 10), async move {
Copy link
Contributor

@michalkucharczyk michalkucharczyk Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 10 minutes is kind of master timeout, maybe this could be a const (or maybe some default within the function that could be overwritten by env?), just to avoid spreading this hard-coded value over the code.

@michalkucharczyk michalkucharczyk requested a review from a team March 8, 2023 08:03
Comment on lines 56 to 57
stderr.read_to_string(&mut data).unwrap();
let re = Regex::new(r"Database: .+ at (\S+)").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we perhaps extract this path before the node is killed (like it's done for the WS URL) and also verify that the database exists before the node is killed?

We could move this into find_ws_url_from_output, rename it to extract_info_from_output (or something like that) and get it to return a struct which would contain the WS URL and the path to the database.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@bkchr bkchr requested a review from koute March 15, 2023 10:38
Co-authored-by: Anton <anton.kalyaev@gmail.com>
@bkchr bkchr requested a review from melekes March 16, 2023 08:37
@bkchr bkchr merged commit 8a9f48b into master Mar 16, 2023
@bkchr bkchr deleted the bkchr-try-fix-flaky-test branch March 16, 2023 11:24
breathx pushed a commit to gear-tech/substrate that referenced this pull request Apr 22, 2023
* Try to fix flaky `temp-base-path-work` test

The test is most of the time failing when checking if the database path was deleted. The assumption
is that it takes a little bit more time by the OS to actually clean up the temp path under high
load. The pr tries to fix this by checking multiple times if the path was deleted. Besides that it
also ensures that the tests that require the benchmark feature don't fail when compiled without the feature.

* ".git/.scripts/commands/fmt/fmt.sh"

* Capture signals earlier

* Rewrite tests to let them having one big timeout

* Remove unneeded dep

* Update bin/node/cli/tests/common.rs

Co-authored-by: Koute <koute@users.noreply.github.com>

* Review feedback

* Update bin/node/cli/tests/common.rs

Co-authored-by: Anton <anton.kalyaev@gmail.com>

---------

Co-authored-by: command-bot <>
Co-authored-by: Koute <koute@users.noreply.github.com>
Co-authored-by: Anton <anton.kalyaev@gmail.com>
nathanwhit pushed a commit to nathanwhit/substrate that referenced this pull request Jul 19, 2023
* Try to fix flaky `temp-base-path-work` test

The test is most of the time failing when checking if the database path was deleted. The assumption
is that it takes a little bit more time by the OS to actually clean up the temp path under high
load. The pr tries to fix this by checking multiple times if the path was deleted. Besides that it
also ensures that the tests that require the benchmark feature don't fail when compiled without the feature.

* ".git/.scripts/commands/fmt/fmt.sh"

* Capture signals earlier

* Rewrite tests to let them having one big timeout

* Remove unneeded dep

* Update bin/node/cli/tests/common.rs

Co-authored-by: Koute <koute@users.noreply.github.com>

* Review feedback

* Update bin/node/cli/tests/common.rs

Co-authored-by: Anton <anton.kalyaev@gmail.com>

---------

Co-authored-by: command-bot <>
Co-authored-by: Koute <koute@users.noreply.github.com>
Co-authored-by: Anton <anton.kalyaev@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants