-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Try to fix flaky temp-base-path-work
test
#13505
Conversation
The test is most of the time failing when checking if the database path was deleted. The assumption is that it takes a little bit more time by the OS to actually clean up the temp path under high load. The pr tries to fix this by checking multiple times if the path was deleted. Besides that it also ensures that the tests that require the benchmark feature don't fail when compiled without the feature.
bot fmt |
@bkchr https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2471144 was started for your command Comment |
@bkchr Command |
Hmm... are we sure this actually fixes the issue? It seems kind of weird to me that this is failing in the first place, because if I'm reading this right then the directory should be deleted synchronously before the process exits by substrate itself, and we do wait for the child to exit in this test. The way this is deleted is that we hold a handle to TempDir that is supposed to be dropped on exit, which will then clean up the directory: #[static_init::dynamic(drop, lazy)]
static mut BASE_PATH_TEMP: Option<TempDir> = None; So this would have to not trigger for the test to fail. |
Taking a quick look at the Maybe we could replace the current mechanism in static BASE_PATH_TEMP: Mutex<Option<TempDir>> = Mutex::new(None);
// ...
extern "C" fn on_exit() {
BASE_PATH_TEMP.lock().take();
}
*BASE_PATH_TEMP.lock() = Some(temp_dir);
unsafe {
libc::atexit(on_exit);
} |
https://docs.rs/tempfile/latest/tempfile/struct.TempDir.html#resource-leaking says that SIGINT may leak the resource. And test uses SIGINT to terminate process. |
Yeah, but that is if the program doesn't catch it. We do. Or at least we're supposed to. So the |
yes, good point. And the syscall is actually there at least at my machine : |
It sometimes happens that this syscall is not there:
I see the 30s timeout in test. If the machine is actually over-loaded and initialization is laggy, then the exit/signal handler may not remove the directory. |
No. I could not reproduce it locally. I have already seen it myself, but in my runs the last days I could not trigger it.. |
Or |
Ahh good finding! But this would mean we don't make it past initializing the genesis block in 30 seconds and thus, the signal handler isn't installed. |
It may happen if the load is high enough. |
Yeah! I first had thought of removing these timeouts and just putting there some big 10 minutes timeout for the entire test. I just don't have done this because the logs of the failing test did not indicate any particular error. However, your explanation @michalkucharczyk sounds correct and explains what we are seeing! |
Is there any reason why the signal handler is late-registered? |
Yeah valid point! |
https://stackoverflow.com/questions/9994150/on-exit-and-ctrlc should not work 🙈 So, we need to register the signal handler earlier as said by @michalkucharczyk! |
nit: PR description could be updated. |
async fn run_command_and_kill(signal: Signal) { | ||
let base_path = tempdir().expect("could not create a temp dir"); | ||
let mut cmd = common::KillChildOnDrop( | ||
common::run_with_timeout(Duration::from_secs(60 * 10), async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 10 minutes is kind of master timeout, maybe this could be a const (or maybe some default within the function that could be overwritten by env?), just to avoid spreading this hard-coded value over the code.
stderr.read_to_string(&mut data).unwrap(); | ||
let re = Regex::new(r"Database: .+ at (\S+)").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we perhaps extract this path before the node is killed (like it's done for the WS URL) and also verify that the database exists before the node is killed?
We could move this into find_ws_url_from_output
, rename it to extract_info_from_output
(or something like that) and get it to return a struct which would contain the WS URL and the path to the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Co-authored-by: Koute <koute@users.noreply.github.com>
Co-authored-by: Anton <anton.kalyaev@gmail.com>
* Try to fix flaky `temp-base-path-work` test The test is most of the time failing when checking if the database path was deleted. The assumption is that it takes a little bit more time by the OS to actually clean up the temp path under high load. The pr tries to fix this by checking multiple times if the path was deleted. Besides that it also ensures that the tests that require the benchmark feature don't fail when compiled without the feature. * ".git/.scripts/commands/fmt/fmt.sh" * Capture signals earlier * Rewrite tests to let them having one big timeout * Remove unneeded dep * Update bin/node/cli/tests/common.rs Co-authored-by: Koute <koute@users.noreply.github.com> * Review feedback * Update bin/node/cli/tests/common.rs Co-authored-by: Anton <anton.kalyaev@gmail.com> --------- Co-authored-by: command-bot <> Co-authored-by: Koute <koute@users.noreply.github.com> Co-authored-by: Anton <anton.kalyaev@gmail.com>
* Try to fix flaky `temp-base-path-work` test The test is most of the time failing when checking if the database path was deleted. The assumption is that it takes a little bit more time by the OS to actually clean up the temp path under high load. The pr tries to fix this by checking multiple times if the path was deleted. Besides that it also ensures that the tests that require the benchmark feature don't fail when compiled without the feature. * ".git/.scripts/commands/fmt/fmt.sh" * Capture signals earlier * Rewrite tests to let them having one big timeout * Remove unneeded dep * Update bin/node/cli/tests/common.rs Co-authored-by: Koute <koute@users.noreply.github.com> * Review feedback * Update bin/node/cli/tests/common.rs Co-authored-by: Anton <anton.kalyaev@gmail.com> --------- Co-authored-by: command-bot <> Co-authored-by: Koute <koute@users.noreply.github.com> Co-authored-by: Anton <anton.kalyaev@gmail.com>
The test is most of the time failing when checking if the database path was deleted. The assumption is that it takes a little bit more time by the OS to actually clean up the temp path under high load. The pr tries to fix this by checking multiple times if the path was deleted. Besides that it also ensures that the tests that require the benchmark feature don't fail when compiled without the feature.