Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-async-context-frame is flaky #54808

Open
anonrig opened this issue Sep 6, 2024 · 15 comments
Open

test-async-context-frame is flaky #54808

anonrig opened this issue Sep 6, 2024 · 15 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. windows Issues and PRs related to the Windows platform.

Comments

@anonrig
Copy link
Member

anonrig commented Sep 6, 2024

Test

test-async-context-frame

Platform

Windows

Console output

duration_ms: 6718.227
exitcode: 1
severity: fail
stack: "\u25B6 AsyncContextFrame\n  \u2714 async-hooks\\test-async-local-storage-args.js\
  \ (4955.4808ms)\n  \u2714 async-hooks\\test-async-local-storage-async-await.js (4400.0139ms)\n\
  \  \u2714 async-hooks\\test-async-local-storage-async-functions.js (5106.203ms)\n\
  \  \u2714 async-hooks\\test-async-local-storage-dgram.js (4764.8517ms)\n  \u2714\
  \ async-hooks\\test-async-local-storage-enable-disable.js (5294.6636ms)\n  \u2714\
  \ async-hooks\\test-async-local-storage-enter-with.js (4803.8479ms)\n  \u2714 async-hooks\\\
  test-async-local-storage-errors.js (5115.4623ms)\n  \u2714 async-hooks\\test-async-local-storage-gcable.js\
  \ (4954.9835ms)\n  \u2714 async-hooks\\test-async-local-storage-http-agent.js (5161.0138ms)\n\
  \  \u2714 async-hooks\\test-async-local-storage-http.js (4577.1697ms)\n  \u2714\
  \ async-hooks\\test-async-local-storage-misc-stores.js (5124.7882ms)\n  \u2714 async-hooks\\\
  test-async-local-storage-nested.js (5031.4255ms)\n  \u2714 async-hooks\\test-async-local-storage-no-mix-contexts.js\
  \ (4960.6226ms)\n  \u2714 async-hooks\\test-async-local-storage-promises.js (5029.5982ms)\n\
  \  \u2716 async-hooks\\test-async-local-storage-socket.js (5107.2542ms)\n    AssertionError\
  \ [ERR_ASSERTION]: Test async-hooks\\test-async-local-storage-socket.js failed with\
  \ exit code 3221225477\n        at TestContext.<anonymous> (file:///C:/workspace/node-test-binary-windows-js-suites/node/test/parallel/test-async-context-frame.mjs:58:7)\n\
  \        at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n\
  \        at async Test.run (node:internal/test_runner/test:888:9)\n        at async\
  \ Promise.all (index 14)\n        at async Suite.run (node:internal/test_runner/test:1268:7)\n\
  \        at async startSubtestAfterBootstrap (node:internal/test_runner/harness:280:3)\
  \ {\n      generatedMessage: false,\n      code: 'ERR_ASSERTION',\n      actual:\
  \ 3221225477,\n      expected: 0,\n      operator: 'strictEqual'\n    }\n\n  \u2714\
  \ async-hooks\\test-async-local-storage-thenable.js (4896.5713ms)\n  \u2714 async-hooks\\\
  test-async-local-storage-tlssocket.js (5602.1337ms)\n  \u2714 parallel\\test-async-local-storage-bind.js\
  \ (4868.4376ms)\n  \u2714 parallel\\test-async-local-storage-contexts.js (4311.7402ms)\n\
  \  \u2714 parallel\\test-async-local-storage-deep-stack.js (4832.0565ms)\n  \u2714\
  \ parallel\\test-async-local-storage-exit-does-not-leak.js (4463.4821ms)\n  \u2714\
  \ parallel\\test-async-local-storage-http-multiclients.js (4983.6606ms)\n  \u2714\
  \ parallel\\test-async-local-storage-snapshot.js (4266.4503ms)\n\u25B6 AsyncContextFrame\
  \ (6170.0028ms)\n\u2139 tests 23\n\u2139 suites 1\n\u2139 pass 22\n\u2139 fail 1\n\
  \u2139 cancelled 0\n\u2139 skipped 0\n\u2139 todo 0\n\u2139 duration_ms 6186.8915\n\
  \n\u2716 failing tests:\n\ntest at test\\parallel\\test-async-context-frame.mjs:48:5\n\
  \u2716 async-hooks\\test-async-local-storage-socket.js (5107.2542ms)\n  AssertionError\
  \ [ERR_ASSERTION]: Test async-hooks\\test-async-local-storage-socket.js failed with\
  \ exit code 3221225477\n      at TestContext.<anonymous> (file:///C:/workspace/node-test-binary-windows-js-suites/node/test/parallel/test-async-context-frame.mjs:58:7)\n\
  \      at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n\
  \      at async Test.run (node:internal/test_runner/test:888:9)\n      at async\
  \ Promise.all (index 14)\n      at async Suite.run (node:internal/test_runner/test:1268:7)\n\
  \      at async startSubtestAfterBootstrap (node:internal/test_runner/harness:280:3)\
  \ {\n    generatedMessage: false,\n    code: 'ERR_ASSERTION',\n    actual: 3221225477,\n\
  \    expected: 0,\n    operator: 'strictEqual'\n  }"
...

Build links

Additional information

No response

@anonrig anonrig added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Sep 6, 2024
@github-actions github-actions bot added the windows Issues and PRs related to the Windows platform. label Sep 6, 2024
@jakecastelli
Copy link
Member

I have raised this one in openjs flaky ci slack channel and @Qard has responded positively 🙏 I think he would take a look soon.

@jasnell
Copy link
Member

jasnell commented Sep 6, 2024

Yeah, it appears to be a segfault (access violation) that appears to only happen on windows as far as I can determine. Will need a windows machine to investigate further. I was planning to dig in more next week on this one as I need to break out the Windows machine and update it ;-)

FWIW, this appears to be a test-runner issue, possibly with the interaction with the subtests? It seems to always happen with the local-storage-nested subtest but that test does not exercise the processTicksAndRejections path where the access violation appears to be happening. misread that... actually, I definitely think it's an issue with the test runner and the way this test is structured. Going to try an experiment that splits these out into individual tests

@cjihrig
Copy link
Contributor

cjihrig commented Sep 6, 2024

FWIW, this appears to be a test-runner issue

Are you referring to the Python runner or the JS one? I'm assuming you mean Python in this case, but just wanted to verify.

@jasnell
Copy link
Member

jasnell commented Sep 6, 2024

Not sure yet. I think maybe the python runner but can't be certain. I'm running an experiment in #54818 that removes both the node:test runner and the python test runner from the mix to see if we still see flakiness on these. If that looks good, I want to next try with the python runner but no node-test runner, following the pattern of the original test in spawning a bunch of these processes in parallel to see if we still have flakiness. Unfortunately I was not able to reproduce the failure locally at all on a quick test on my old windows machine so I'm going a bit by trial-an-error using CI.

@cjihrig
Copy link
Contributor

cjihrig commented Sep 6, 2024

16:17:18     test at test\parallel\test-async-context-frame.mjs:48:5
16:17:18     ✖ async-hooks\test-async-local-storage-socket.js (5107.2542ms)
16:17:18       AssertionError [ERR_ASSERTION]: Test async-hooks\test-async-local-storage-socket.js failed with exit code 3221225477
16:17:18           at TestContext.<anonymous> (file:///C:/workspace/node-test-binary-windows-js-suites/node/test/parallel/test-async-context-frame.mjs:58:7)
16:17:18           at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
16:17:18           at async Test.run (node:internal/test_runner/test:888:9)
16:17:18           at async Promise.all (index 14)
16:17:18           at async Suite.run (node:internal/test_runner/test:1268:7)
16:17:18           at async startSubtestAfterBootstrap (node:internal/test_runner/harness:280:3) {
16:17:18         generatedMessage: false,
16:17:18         code: 'ERR_ASSERTION',
16:17:18         actual: 3221225477,
16:17:18         expected: 0,
16:17:18         operator: 'strictEqual'
16:17:18       }

The line throwing that is:

strictEqual(code, 0, `Test ${test} failed with exit code ${code}`);

The code is an exit code of a child process running the Python runner. I'd be surprised if that was related to node:test.

@jasnell
Copy link
Member

jasnell commented Sep 6, 2024

I would be too! But just going through due diligence! It's also possible this isn't a problem with the tests at all and something wrong with the windows host running the tests! Just need to go through the steps to eliminate possible causes and narrow it down. node:test is my least likely candidate tho. Yeah, sorry I wasn't clear, when I said test runner I meant the python test runner along with possibly the spawning of multiple instances running in parallel.

@Qard
Copy link
Member

Qard commented Sep 6, 2024

I've been exploring all the results of the reported failures and they seem to fail and many different points in the run. It has a slight bias toward some particular tests but that could be a timing thing and not actually indicative of any failure of the code itself. I think I agree with @jasnell here that something seems off with running multiple python test runners in parallel. Might be worth just disabling (or reducing?) the parallelism and see how that goes?

@joyeecheung
Copy link
Member

joyeecheung commented Sep 18, 2024

IIUC this has been fixed by #54823 and the conclusion is..when using node:test, try not to stretch the concurrency that much?

@joyeecheung
Copy link
Member

Technically, when the test is in parallel, I think the concurrency would be better be just 1, because the python test runner already runs the tests concurrently based on the number of CPU cores out there. If it really needs to be concurrently run by node:test, maybe just place it in sequential, and limit the concurrency to os.cpus().length (I recall that the number this returns this is not fully accurate, but good enough for this use case).

@Qard
Copy link
Member

Qard commented Sep 18, 2024

It runs each individual test separately with the runner though rather than having one run on all matches. We could probably change it to just be a single run, but I went for breaking out the tests to have some separation.

But yeah, having a dozen or so parallel python runners going is going to eat a lot of memory.

I was thinking about if maybe we should have a more official way to run a set of tests with extra flags like that to add to the test suite matrix. Might be an idea to define something a bit better for that, though I don't know if many other cases need it. 🤔

In any case, I think we can close this now.

@Qard Qard closed this as completed Sep 18, 2024
@joyeecheung
Copy link
Member

I think the problem is more like, the Python test runner concurrency should not be mixed with node:test concurrency. Say there are 8 cores, and using Python test runner to run the tests under the parallel folder, if all the tests are written using node:test which then also runs the tests concurrently, you end up with 64 processes running concurrently in total on 8 cores. Either the test using node:test should be placed in sequential folder so there are 8 processes spawned by node:test, or the test should avoid using node:test when it’s under the parallel folder and there are just 8 processes spawned by the Python test runner.

@cjihrig
Copy link
Contributor

cjihrig commented Sep 19, 2024

node:test doesn't spawn child processes unless you're using the --test CLI. It just so happened that this particular test was spawning child processes while also using concurrency. I agree that we shouldn't do that though.

@Qard
Copy link
Member

Qard commented Sep 19, 2024

Yep, I understand that. Doing it sequentially like it's done now though means there's no N magnification, it's just the outer test runner plus a singular inner test at the moment whereas before it was the runner plus N sub-processes and on top of that all the additional parallelism of the python runner running the outer test suite.

If we had a better way to configure specific tests to run with multiple sets of flags this problem could be eliminated. Perhaps we could make the runner support parsing multiple sets of // Flags: ... comments? That might be a better way to handle this case. 🤔

@joyeecheung
Copy link
Member

joyeecheung commented Sep 19, 2024

Maybe an environment variable to specify max concurrency could help, which can be used to override concurrency coming from the JS API. The Python test runner can always set it to 1 for parallel tests.

(Although it may be a footgun if someone actually tries to test concurrency but forgot that it shouldn't be under parallel, though).

@lpinca
Copy link
Member

lpinca commented Sep 28, 2024

It is still flaky.

---
duration_ms: 33727.159
exitcode: 1
severity: flaky
stack: "\u25B6 AsyncContextFrame\n  \u2714 async-hooks\\test-async-local-storage-args.js\
  \ (2092.6293ms)\n  \u2714 async-hooks\\test-async-local-storage-async-await.js (1638.2592ms)\n\
  \  \u2714 async-hooks\\test-async-local-storage-async-functions.js (1760.2339ms)\n\
  \  \u2714 async-hooks\\test-async-local-storage-dgram.js (1574.7763ms)\n  \u2714\
  \ async-hooks\\test-async-local-storage-enable-disable.js (1615.9217ms)\n  \u2716\
  \ async-hooks\\test-async-local-storage-enter-with.js (1680.1985ms)\n    AssertionError\
  \ [ERR_ASSERTION]: Test async-hooks\\test-async-local-storage-enter-with.js failed\
  \ with exit code 3221225477\n        at TestContext.<anonymous> (file:///C:/workspace/node-test-binary-windows-js-suites/node/test/parallel/test-async-context-frame.mjs:59:7)\n\
  \        at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n\
  \        at async Test.run (node:internal/test_runner/test:931:9)\n        at async\
  \ Suite.processPendingSubtests (node:internal/test_runner/test:629:7) {\n      generatedMessage:\
  \ false,\n      code: 'ERR_ASSERTION',\n      actual: 3221225477,\n      expected:\
  \ 0,\n      operator: 'strictEqual'\n    }\n\n  \u2714 async-hooks\\test-async-local-storage-errors.js\
  \ (1725.7327ms)\n  \u2714 async-hooks\\test-async-local-storage-gcable.js (1393.2866ms)\n\
  \  \u2714 async-hooks\\test-async-local-storage-http-agent.js (1434.8587ms)\n  \u2714\
  \ async-hooks\\test-async-local-storage-http.js (1278.1881ms)\n  \u2714 async-hooks\\\
  test-async-local-storage-misc-stores.js (1380.7043ms)\n  \u2714 async-hooks\\test-async-local-storage-nested.js\
  \ (1385.9539ms)\n  \u2714 async-hooks\\test-async-local-storage-no-mix-contexts.js\
  \ (1912.6905ms)\n  \u2714 async-hooks\\test-async-local-storage-promises.js (1232.9455ms)\n\
  \  \u2714 async-hooks\\test-async-local-storage-socket.js (1209.152ms)\n  \u2714\
  \ async-hooks\\test-async-local-storage-thenable.js (1187.133ms)\n  \u2714 async-hooks\\\
  test-async-local-storage-tlssocket.js (1440.4509ms)\n  \u2714 parallel\\test-async-local-storage-bind.js\
  \ (1161.9364ms)\n  \u2714 parallel\\test-async-local-storage-contexts.js (1147.6641ms)\n\
  \  \u2714 parallel\\test-async-local-storage-deep-stack.js (1163.6765ms)\n  \u2714\
  \ parallel\\test-async-local-storage-exit-does-not-leak.js (1142.5342ms)\n  \u2714\
  \ parallel\\test-async-local-storage-http-multiclients.js (1480.5241ms)\n  \u2714\
  \ parallel\\test-async-local-storage-snapshot.js (1146.9874ms)\n\u2716 AsyncContextFrame\
  \ (33201.9919ms)\n\u2139 tests 23\n\u2139 suites 1\n\u2139 pass 22\n\u2139 fail\
  \ 1\n\u2139 cancelled 0\n\u2139 skipped 0\n\u2139 todo 0\n\u2139 duration_ms 33215.3085\n\
  \n\u2716 failing tests:\n\ntest at test\\parallel\\test-async-context-frame.mjs:49:5\n\
  \u2716 async-hooks\\test-async-local-storage-enter-with.js (1680.1985ms)\n  AssertionError\
  \ [ERR_ASSERTION]: Test async-hooks\\test-async-local-storage-enter-with.js failed\
  \ with exit code 3221225477\n      at TestContext.<anonymous> (file:///C:/workspace/node-test-binary-windows-js-suites/node/test/parallel/test-async-context-frame.mjs:59:7)\n\
  \      at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n\
  \      at async Test.run (node:internal/test_runner/test:931:9)\n      at async\
  \ Suite.processPendingSubtests (node:internal/test_runner/test:629:7) {\n    generatedMessage:\
  \ false,\n    code: 'ERR_ASSERTION',\n    actual: 3221225477,\n    expected: 0,\n\
  \    operator: 'strictEqual'\n  }"
...

@lpinca lpinca reopened this Sep 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. windows Issues and PRs related to the Windows platform.
Projects
None yet
Development

No branches or pull requests

7 participants