Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate test-worker-fshandles-open-close-on-termination and test-worker-fshandles-error-on-termination crash #43499

Closed
F3n67u opened this issue Jun 20, 2022 · 4 comments · Fixed by #43533
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.

Comments

@F3n67u
Copy link
Member

F3n67u commented Jun 20, 2022

Test

test-worker-fshandles-open-close-on-termination
test-worker-fshandles-error-on-termination

Platform

test-worker-fshandles-open-close-on-termination on MacOS
test-worker-fshandles-error-on-termination on Windows

Console output

test-worker-fshandles-open-close-on-termination:

out/Release/node[53255]: ../src/node_file-inl.h:162:virtual node::fs::FSReqPromise<node::AliasedBufferBase<double, v8::Float64Array, void>>::~FSReqPromise() [AliasedBufferT = node::AliasedBufferBase<double, v8::Float64Array, void>]: Assertion `finished_' failed.
 1: 0x10767eeb5 node::Abort() [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 2: 0x10767ecd1 node::Assert(node::AssertionInfo const&) [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 3: 0x10767a806 node::fs::FSReqPromise<node::AliasedBufferBase<double, v8::Float64Array, void> >::~FSReqPromise() [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 4: 0x107683cc9 node::fs::FSReqAfterScope::~FSReqAfterScope() [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 5: 0x1076841fd node::fs::AfterOpenFileHandle(uv_fs_s*) [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 6: 0x1081162f8 uv__work_done [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 7: 0x10811a6eb uv__async_io [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 8: 0x10812dd6e uv__io_poll [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 9: 0x10811abe1 uv_run [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
10: 0x1075aee23 node::SpinEventLoop(node::Environment*) [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
11: 0x1077268bf node::worker::Worker::Run() [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
12: 0x10772a302 node::worker::Worker::StartThread(v8::FunctionCallbackInfo<v8::Value> const&)::$_3::__invoke(void*) [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
13: 0x7fff203e58fc _pthread_start [/usr/lib/system/libsystem_pthread.dylib]
14: 0x7fff203e1443 thread_start [/usr/lib/system/libsystem_pthread.dylib]

test-worker-fshandles-error-on-termination:

Administrator: npm install[8184]: C:\workspace\node-compile-windows\node\src\node_file-inl.h:162: Assertion `finished_' failed.
 1: 00007FF610F69CCF node_api_throw_syntax_error+176223
 2: 00007FF610EF7B66 v8::internal::wasm::WasmCode::safepoint_table_offset+67398
 3: 00007FF610EF7F32 v8::internal::wasm::WasmCode::safepoint_table_offset+68370
 4: 00007FF610EE0A6D v8::base::CPU::has_fpu+39565
 5: 00007FF610EE2029 v8::base::CPU::has_fpu+45129
 6: 00007FF610FCBB27 uv_timer_stop+1207
 7: 00007FF610FC80CB uv_async_send+331
 8: 00007FF610FC785C uv_loop_init+1292
 9: 00007FF610FC79FA uv_run+202
10: 00007FF610F968A5 node::SpinEventLoop+309
11: 00007FF610E2CE40 v8::internal::wasm::SignatureMap::Freeze+35904
12: 00007FF610E28568 v8::internal::wasm::SignatureMap::Freeze+17256
13: 00007FF610FB80CD uv_poll_stop+557
14: 00007FF611F71600 v8::internal::compiler::ToString+145936
15: 00007FFB74404ED0 BaseThreadInitThunk+16
16: 00007FFB75FCE39B RtlUserThreadStart+43

Build links

test-worker-fshandles-open-close-on-termination:

https://ci.nodejs.org/job/node-test-commit-osx/nodes=osx11-x64/45661/testReport/junit/(root)/test/parallel_test_worker_fshandles_open_close_on_termination/

test-worker-fshandles-error-on-termination:

https://ci.nodejs.org/job/node-test-binary-windows-js-suites/15222/RUN_SUBSET=3,nodes=win10-COMPILED_BY-vs2019/testReport/junit/(root)/test/parallel_test_worker_fshandles_error_on_termination/

Additional information

Related pr: #42910

This pr landed on 2022-07-18, modify src/node_file-inl.h file and add test-worker-fshandles-open-close-on-termination and test-worker-fshandles-error-on-termination.

cc @santigimeno @aduh95 @RaisinTen

@F3n67u F3n67u added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Jun 20, 2022
@F3n67u F3n67u changed the title Investigate test-worker-fshandles-open-close-on-termination crash Investigate test-worker-fshandles-open-close-on-termination and test-worker-fshandles-error-on-termination crash Jun 20, 2022
santigimeno added a commit to santigimeno/node that referenced this issue Jun 22, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by using the same checks everywhere.

Fixes: nodejs#43499
santigimeno added a commit to santigimeno/node that referenced this issue Jun 22, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by using the same checks everywhere.

Fixes: nodejs#43499
santigimeno added a commit to santigimeno/node that referenced this issue Jun 22, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by checking with
`can_call_into_js()` also in the destructor.

Fixes: nodejs#43499
@F3n67u F3n67u reopened this Jul 10, 2022
@RaisinTen
Copy link
Contributor

There seems to be something wrong with the stack trace in https://ci.nodejs.org/job/node-test-binary-windows-js-suites/15541/RUN_SUBSET=3,nodes=win10-COMPILED_BY-vs2019/testReport/junit/(root)/test/parallel_test_worker_fshandles_open_close_on_termination:

01:38:28 not ok 775 parallel/test-worker-fshandles-open-close-on-termination
01:38:28   ---
01:38:28   duration_ms: 3.528
01:38:28   severity: fail
01:38:28   exitcode: 134
01:38:28   stack: |-
01:38:28     C:\Windows\SYSTEM32\cmd.exe[8708]: C:\workspace\node-compile-windows\node\src\node_file-inl.h:162: Assertion `finished_' failed.
01:38:28      1: 00007FF65494964F node_api_throw_syntax_error+176223
01:38:28      2: 00007FF6548D74E6 SSL_get_quiet_shutdown+67398
01:38:28      3: 00007FF6548D78B2 SSL_get_quiet_shutdown+68370
01:38:28      4: 00007FF6548C03ED v8::base::CPU::has_fpu+38813
01:38:28      5: 00007FF6548C19A9 v8::base::CPU::has_fpu+44377
01:38:28      6: 00007FF6549AB4A7 uv_timer_stop+1207
01:38:28      7: 00007FF6549A7A4B uv_async_send+331
01:38:28      8: 00007FF6549A71DC uv_loop_init+1292
01:38:28      9: 00007FF6549A737A uv_run+202
01:38:28     10: 00007FF654976225 node::SpinEventLoop+309
01:38:28     11: 00007FF65480CCC0 v8::internal::interpreter::BytecodeLabel::bind+35904
01:38:28     12: 00007FF6548083E8 v8::internal::interpreter::BytecodeLabel::bind+17256
01:38:28     13: 00007FF654997A4D uv_poll_stop+557
01:38:28     14: 00007FF655951D70 v8::internal::compiler::ToString+145936
01:38:28     15: 00007FFB74404ED0 BaseThreadInitThunk+16
01:38:28     16: 00007FFB75FCE39B RtlUserThreadStart+43
01:38:28   ...

Not sure how node_api_throw_syntax_error got in. The same thing happened with the stack trace in the Windows crash reported in #43084 which is about a very similar test (crash during a Worker termination).

targos pushed a commit that referenced this issue Jul 12, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by checking with
`can_call_into_js()` also in the destructor.

Fixes: #43499

PR-URL: #43533
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
targos pushed a commit that referenced this issue Jul 20, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by checking with
`can_call_into_js()` also in the destructor.

Fixes: #43499

PR-URL: #43533
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
targos pushed a commit that referenced this issue Jul 31, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by checking with
`can_call_into_js()` also in the destructor.

Fixes: #43499

PR-URL: #43533
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
kvakil added a commit to kvakil/node that referenced this issue Aug 2, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: nodejs#43499
Refs: nodejs#43084
nodejs-github-bot pushed a commit that referenced this issue Aug 5, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: #43499
Refs: #43084
PR-URL: #44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
danielleadams pushed a commit that referenced this issue Aug 16, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: #43499
Refs: #43084
PR-URL: #44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
ruyadorno pushed a commit that referenced this issue Aug 23, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: #43499
Refs: #43084
PR-URL: #44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
targos pushed a commit that referenced this issue Sep 5, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: #43499
Refs: #43084
PR-URL: #44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
@aduh95 aduh95 reopened this Sep 11, 2022
Fyko pushed a commit to Fyko/node that referenced this issue Sep 15, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: nodejs#43499
Refs: nodejs#43084
PR-URL: nodejs#44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
guangwong pushed a commit to noslate-project/node that referenced this issue Oct 10, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by checking with
`can_call_into_js()` also in the destructor.

Fixes: nodejs/node#43499

PR-URL: nodejs/node#43533
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
juanarbol pushed a commit that referenced this issue Oct 10, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: #43499
Refs: #43084
PR-URL: #44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
juanarbol pushed a commit that referenced this issue Oct 11, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: #43499
Refs: #43084
PR-URL: #44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
@ywave620
Copy link
Contributor

ywave620 commented Oct 14, 2022

@aduh95 Realized that you reopen this issue, did you find any recent failure? :)

guangwong pushed a commit to noslate-project/node that referenced this issue Jan 3, 2023
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: nodejs/node#43499
Refs: nodejs/node#43084
PR-URL: nodejs/node#44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
guangwong pushed a commit to noslate-project/node that referenced this issue Jan 3, 2023
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: nodejs/node#43499
Refs: nodejs/node#43084
PR-URL: nodejs/node#44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
@lpinca
Copy link
Member

lpinca commented Jun 29, 2023

Ping @aduh95

@lpinca lpinca closed this as completed Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants