Try to fix flaky `temp-base-path-work` test #13505

bkchr · 2023-03-01T19:56:54Z

The test is most of the time failing when checking if the database path was deleted. The assumption is that it takes a little bit more time by the OS to actually clean up the temp path under high load. The pr tries to fix this by checking multiple times if the path was deleted. Besides that it also ensures that the tests that require the benchmark feature don't fail when compiled without the feature.

bkchr · 2023-03-01T22:06:54Z

bot fmt

command-bot · 2023-03-01T22:06:58Z

@bkchr https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2471144 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh". Check out https://gitlab.parity.io/parity/mirrors/substrate/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 12-878bccc0-7dd5-4440-a18d-4f04ff85920f to cancel this command or bot cancel to cancel all commands in this pull request.

command-bot · 2023-03-01T22:08:09Z

@bkchr Command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh" has finished. Result: https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2471144 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2471144/artifacts/download.

koute · 2023-03-02T10:30:24Z

Hmm... are we sure this actually fixes the issue?

It seems kind of weird to me that this is failing in the first place, because if I'm reading this right then the directory should be deleted synchronously before the process exits by substrate itself, and we do wait for the child to exit in this test.

The way this is deleted is that we hold a handle to TempDir that is supposed to be dropped on exit, which will then clean up the directory:

#[static_init::dynamic(drop, lazy)]
static mut BASE_PATH_TEMP: Option<TempDir> = None;

So this would have to not trigger for the test to fail.

koute · 2023-03-02T10:46:35Z

Taking a quick look at the static_init crate I'm not really sure whether it's correct or not; there's a lot of code there, so it is possible that it could be buggy and just not call the destructor sometimes. Nevertheless, using it for only this feels like shooting at a fly with a cannon.

Maybe we could replace the current mechanism in substrate/client/service/src/config.rs and the static_init dependency with something significantly simpler? e.g. something like this should also work and maybe fix the problem? (But, again, hard to tell if I don't know why exactly it happens.)

static BASE_PATH_TEMP: Mutex<Option<TempDir>> = Mutex::new(None);

// ...

extern "C" fn on_exit() {
    BASE_PATH_TEMP.lock().take();
}

*BASE_PATH_TEMP.lock() = Some(temp_dir);
unsafe {
    libc::atexit(on_exit);
}

michalkucharczyk · 2023-03-02T10:48:51Z

https://docs.rs/tempfile/latest/tempfile/struct.TempDir.html#resource-leaking says that SIGINT may leak the resource. And test uses SIGINT to terminate process.

koute · 2023-03-02T10:52:23Z

https://docs.rs/tempfile/latest/tempfile/struct.TempDir.html#resource-leaking says that SIGINT may leak the resource. And test uses SIGINT to terminate process.

Yeah, but that is if the program doesn't catch it. We do. Or at least we're supposed to. So the SIGINT would have to come before this is registered, which should never happen because the test waits for finalized blocks before trying to send a SIGINT, so the signal handler should always be registered.

michalkucharczyk · 2023-03-02T11:11:23Z

yes, good point. And the syscall is actually there at least at my machine :
unlinkat(AT_FDCWD, "/tmp/substrateH3HQ0Q", AT_REMOVEDIR) = 0

michalkucharczyk · 2023-03-02T11:16:03Z

It sometimes happens that this syscall is not there:

$ strace -e trace=unlinkat ./substrate --dev
2023-03-02 12:13:46 Substrate Node    
2023-03-02 12:13:46 ✌️  version 3.0.0-dev-e581db6733f    
2023-03-02 12:13:46 ❤️  by Parity Technologies <admin@parity.io>, 2017-2023    
2023-03-02 12:13:46 📋 Chain specification: Development    
2023-03-02 12:13:46 🏷  Node name: jumbled-credit-9268    
2023-03-02 12:13:46 👤 Role: AUTHORITY    
2023-03-02 12:13:46 💾 Database: RocksDb at /tmp/substrateDQCmJB/chains/dev/db/full    
2023-03-02 12:13:46 ⛓  Native runtime: node-268 (substrate-node-0.tx2.au10)    
2023-03-02 12:13:52 [0] 💸 generated 1 npos voters, 1 from validators and 0 nominators    
2023-03-02 12:13:52 [0] 💸 generated 1 npos targets    
--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=3879417, si_uid=1000} ---
+++ killed by SIGINT +++

I see the 30s timeout in test. If the machine is actually over-loaded and initialization is laggy, then the exit/signal handler may not remove the directory.

bkchr · 2023-03-02T11:17:00Z

Hmm... are we sure this actually fixes the issue?

No. I could not reproduce it locally. I have already seen it myself, but in my runs the last days I could not trigger it..

michalkucharczyk · 2023-03-02T11:19:11Z

So the SIGINT would have to come before this is registered, which should never happen because the test waits for finalized blocks before trying to send a SIGINT, so the signal handler should always be registered.

Or 30s timeout :)
Which is problem here I believe...

bkchr · 2023-03-02T11:21:13Z

So the SIGINT would have to come before this is registered, which should never happen because the test waits for finalized blocks before trying to send a SIGINT, so the signal handler should always be registered.

Or 30s timeout :) Which is problem here I believe...

Ahh good finding! But this would mean we don't make it past initializing the genesis block in 30 seconds and thus, the signal handler isn't installed.

michalkucharczyk · 2023-03-02T11:22:49Z

So the SIGINT would have to come before this is registered, which should never happen because the test waits for finalized blocks before trying to send a SIGINT, so the signal handler should always be registered.

Or 30s timeout :) Which is problem here I believe...

Ahh good finding! But this would mean we don't make it past initializing the genesis block in 30 seconds and thus, the signal handler isn't installed.

It may happen if the load is high enough.

bkchr · 2023-03-02T11:26:52Z

It may happen if the load is high enough.

Yeah! I first had thought of removing these timeouts and just putting there some big 10 minutes timeout for the entire test. I just don't have done this because the logs of the failing test did not indicate any particular error. However, your explanation @michalkucharczyk sounds correct and explains what we are seeing!

michalkucharczyk · 2023-03-02T11:31:22Z

Is there any reason why the signal handler is late-registered?
My concern is that the node can still left its directory if terminated in proper timing.

bkchr · 2023-03-02T13:18:31Z

Yeah valid point!

bkchr · 2023-03-02T13:38:16Z

Maybe we could replace the current mechanism in substrate/client/service/src/config.rs and the static_init dependency with something significantly simpler? e.g. something like this should also work and maybe fix the problem? (But, again, hard to tell if I don't know why exactly it happens.)

https://stackoverflow.com/questions/9994150/on-exit-and-ctrlc should not work 🙈

So, we need to register the signal handler earlier as said by @michalkucharczyk!

.

michalkucharczyk · 2023-03-08T07:58:23Z

nit: PR description could be updated.

michalkucharczyk · 2023-03-08T08:01:33Z

bin/node/cli/tests/running_the_node_and_interrupt.rs

-	async fn run_command_and_kill(signal: Signal) {
-		let base_path = tempdir().expect("could not create a temp dir");
-		let mut cmd = common::KillChildOnDrop(
+	common::run_with_timeout(Duration::from_secs(60 * 10), async move {


If 10 minutes is kind of master timeout, maybe this could be a const (or maybe some default within the function that could be overwritten by env?), just to avoid spreading this hard-coded value over the code.

bin/node/cli/tests/common.rs

koute · 2023-03-08T08:14:11Z

bin/node/cli/tests/temp_base_path_works.rs

+		stderr.read_to_string(&mut data).unwrap();
+		let re = Regex::new(r"Database: .+ at (\S+)").unwrap();


Could we perhaps extract this path before the node is killed (like it's done for the WS URL) and also verify that the database exists before the node is killed?

We could move this into find_ws_url_from_output, rename it to extract_info_from_output (or something like that) and get it to return a struct which would contain the WS URL and the path to the database.

client/cli/src/lib.rs

Co-authored-by: Koute <koute@users.noreply.github.com>

…-test

bin/node/cli/tests/common.rs

bin/node/cli/tests/telemetry.rs

Co-authored-by: Anton <anton.kalyaev@gmail.com>

* Try to fix flaky `temp-base-path-work` test The test is most of the time failing when checking if the database path was deleted. The assumption is that it takes a little bit more time by the OS to actually clean up the temp path under high load. The pr tries to fix this by checking multiple times if the path was deleted. Besides that it also ensures that the tests that require the benchmark feature don't fail when compiled without the feature. * ".git/.scripts/commands/fmt/fmt.sh" * Capture signals earlier * Rewrite tests to let them having one big timeout * Remove unneeded dep * Update bin/node/cli/tests/common.rs Co-authored-by: Koute <koute@users.noreply.github.com> * Review feedback * Update bin/node/cli/tests/common.rs Co-authored-by: Anton <anton.kalyaev@gmail.com> --------- Co-authored-by: command-bot <> Co-authored-by: Koute <koute@users.noreply.github.com> Co-authored-by: Anton <anton.kalyaev@gmail.com>

ggwpez mentioned this pull request Aug 24, 2023

☂ Fix flaky tests paritytech/polkadot-sdk#48

Open

16 tasks

ggwpez approved these changes Mar 1, 2023

View reviewed changes

".git/.scripts/commands/fmt/fmt.sh"

0798557

bkchr requested a review from a team March 2, 2023 10:17

altonen previously approved these changes Mar 2, 2023

View reviewed changes

ggwpez mentioned this pull request Mar 6, 2023

Deprecate Currency; introduce holds and freezing into fungible traits #12951

Merged

28 tasks

bkchr added 2 commits March 7, 2023 11:26

Capture signals earlier

0fb10d2

Rewrite tests to let them having one big timeout

f2ed1fd

bkchr requested a review from a team March 7, 2023 15:10

bkchr requested review from ggwpez and altonen March 7, 2023 15:10

Remove unneeded dep

700c3b7

michalkucharczyk approved these changes Mar 8, 2023

View reviewed changes

michalkucharczyk requested a review from a team March 8, 2023 07:56

michalkucharczyk reviewed Mar 8, 2023

View reviewed changes

michalkucharczyk requested a review from a team March 8, 2023 08:03

koute reviewed Mar 8, 2023

View reviewed changes

davxy reviewed Mar 8, 2023

View reviewed changes

client/cli/src/lib.rs Show resolved Hide resolved

bkchr and others added 3 commits March 15, 2023 10:06

Update bin/node/cli/tests/common.rs

358fb1d

Co-authored-by: Koute <koute@users.noreply.github.com>

Review feedback

fe2da6b

Merge remote-tracking branch 'origin/master' into bkchr-try-fix-flaky…

57e9e2b

…-test

bkchr requested a review from koute March 15, 2023 10:38

melekes reviewed Mar 16, 2023

View reviewed changes

bin/node/cli/tests/common.rs Outdated Show resolved Hide resolved

bin/node/cli/tests/telemetry.rs Show resolved Hide resolved

Update bin/node/cli/tests/common.rs

187e247

Co-authored-by: Anton <anton.kalyaev@gmail.com>

bkchr requested a review from melekes March 16, 2023 08:37

davxy approved these changes Mar 16, 2023

View reviewed changes

bkchr merged commit 8a9f48b into master Mar 16, 2023

bkchr deleted the bkchr-try-fix-flaky-test branch March 16, 2023 11:24

kacperzuk-neti mentioned this pull request Jun 23, 2023

Polkadot v0.9.43 liberland/liberland_substrate#295

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to fix flaky `temp-base-path-work` test #13505

Try to fix flaky `temp-base-path-work` test #13505

bkchr commented Mar 1, 2023

bkchr commented Mar 1, 2023

command-bot bot commented Mar 1, 2023 •

edited

Loading

command-bot bot commented Mar 1, 2023

koute commented Mar 2, 2023

koute commented Mar 2, 2023

michalkucharczyk commented Mar 2, 2023 •

edited

Loading

koute commented Mar 2, 2023 •

edited

Loading

michalkucharczyk commented Mar 2, 2023

michalkucharczyk commented Mar 2, 2023

bkchr commented Mar 2, 2023

michalkucharczyk commented Mar 2, 2023 •

edited

Loading

bkchr commented Mar 2, 2023

michalkucharczyk commented Mar 2, 2023

bkchr commented Mar 2, 2023

michalkucharczyk commented Mar 2, 2023

bkchr commented Mar 2, 2023

bkchr commented Mar 2, 2023

michalkucharczyk commented Mar 8, 2023

michalkucharczyk Mar 8, 2023 •

edited

Loading

koute Mar 8, 2023

bkchr Mar 15, 2023

		stderr.read_to_string(&mut data).unwrap();
		let re = Regex::new(r"Database: .+ at (\S+)").unwrap();

Try to fix flaky temp-base-path-work test #13505

Try to fix flaky temp-base-path-work test #13505

Conversation

bkchr commented Mar 1, 2023

bkchr commented Mar 1, 2023

command-bot bot commented Mar 1, 2023 • edited Loading

command-bot bot commented Mar 1, 2023

koute commented Mar 2, 2023

koute commented Mar 2, 2023

michalkucharczyk commented Mar 2, 2023 • edited Loading

koute commented Mar 2, 2023 • edited Loading

michalkucharczyk commented Mar 2, 2023

michalkucharczyk commented Mar 2, 2023

bkchr commented Mar 2, 2023

michalkucharczyk commented Mar 2, 2023 • edited Loading

bkchr commented Mar 2, 2023

michalkucharczyk commented Mar 2, 2023

bkchr commented Mar 2, 2023

michalkucharczyk commented Mar 2, 2023

bkchr commented Mar 2, 2023

bkchr commented Mar 2, 2023

michalkucharczyk commented Mar 8, 2023

michalkucharczyk Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

koute Mar 8, 2023

Choose a reason for hiding this comment

bkchr Mar 15, 2023

Choose a reason for hiding this comment

Try to fix flaky `temp-base-path-work` test #13505

Try to fix flaky `temp-base-path-work` test #13505

command-bot bot commented Mar 1, 2023 •

edited

Loading

michalkucharczyk commented Mar 2, 2023 •

edited

Loading

koute commented Mar 2, 2023 •

edited

Loading

michalkucharczyk commented Mar 2, 2023 •

edited

Loading

michalkucharczyk Mar 8, 2023 •

edited

Loading