Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV for address: 0x0 in ConcurrentMarking::Run #37106

Closed
bolt-juri-gavshin opened this issue Jan 27, 2021 · 12 comments
Closed

SIGSEGV for address: 0x0 in ConcurrentMarking::Run #37106

bolt-juri-gavshin opened this issue Jan 27, 2021 · 12 comments
Labels
memory Issues and PRs related to the memory management or memory footprint. v8 engine Issues and PRs related to the V8 dependency.

Comments

@bolt-juri-gavshin
Copy link

bolt-juri-gavshin commented Jan 27, 2021

  • Version: v14.15.4
  • Platform: Linux 3d0c8550bcf0 4.19.121-linuxkit #1 SMP Tue Dec 1 17:50:32 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

What steps will reproduce the bug?

I wish I knew, it happens sporadically in LIVE env, started to happen right after upgrade from node 12.20.0 to 14.15.4.

How often does it reproduce? Is there a required condition?

Service has very spiky Memory usage, looks like it happens after responding with a larger response (17MB). But having only this condition is not enough, we couldn't reproduce it in staging environment.

What is the expected behavior?

No crash.

What do you see instead?

Process is crashed with segfault.

Additional information

Got stacktrace using require("segfault-handler").registerHandler:

PID 1 received SIGSEGV for address: 0x0
/app_path_replaced_for_privacy_reasons/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x2d76)[0x7f2c1c0d8d76]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f2c1e6e2980]
nodejs(_ZN2v88internal17ConcurrentMarking3RunEiPNS1_9TaskStateE+0x554)[0xcff9c4]
nodejs(_ZThn32_N2v88internal14CancelableTask3RunEv+0x3b)[0xc6c9eb]
nodejs[0xa71405]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f2c1e6d76db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f2c1e40071f]
@schamberg97
Copy link
Contributor

schamberg97 commented Jan 28, 2021

@bolt-juri-gavshin could you perhaps share some code that can help to reproduce the issue, please?

@bolt-juri-gavshin
Copy link
Author

@bolt-juri-gavshin could you perhaps share some code that can help to reproduce the issue, please?

I would gladly do it, if I could reproduce it myself. As I wrote in the report, the error happens on LIVE only, sporadically, probably triggered by specific endpoint call (or calls), but the same call works just fine in staging. I can start this service with additional Node flags (or libraries, like "segfault-handler") to gather more data in the event of it's untimely death, let me know which ones...

@schamberg97
Copy link
Contributor

schamberg97 commented Jan 28, 2021

What kind of service is it then? Is it some kind of HTTP(S) service? If so, do you use any web-frameworks?

also, do you use native modules?

@bolt-juri-gavshin
Copy link
Author

bolt-juri-gavshin commented Jan 28, 2021

This is a gateway service with HTTP (non-S) endpoints, doing HTTP (non-S) requests.
The specific endpoint is pretty simple - do single outgoing request (using "http"), that returns a large response > 15M), then that "remote" response object is cleaned up and a response is sent.
Web framework is "express".
From the logs I see, that response for the request inside the service is received and deserialized successfully, but then service dies, during either cleanup or sending.
To my knowledge the only native module which can be related to that endpoint is "unix-dgram" (for syslog).
In the background we use "deasync" (used by "node-etcd") and "worker_threads", but timing suggests segfaults are related to that super-simple, but memory-hungry endpoint.
I suspect there is a bug in memory allocation or GC. I will continue to try to reproduce it in staging environment (and maybe even have some shareable JS code to reproduce it, but I don't have high hopes for it).

Update: "deasync" is not actually used - we use "node-etcd" without the synchronous option.

@bolt-juri-gavshin
Copy link
Author

I rechecked, this specific service calls require only for 2 native modules, deasync and unix-dgram. deasync module is unused, thus, the only candidate native module to mess with V8 is unix-dgram.

@schamberg97
Copy link
Contributor

schamberg97 commented Jan 28, 2021

You most probably need to rebuild these modules due to your Node.JS version upgrade. I have little experience with rebuilding modules, as I practice clean installs, but I believe you should look for npm rebuild. You can also try reinstalling the package (so it will get compiled again). Lemme know if it helps

@bolt-juri-gavshin
Copy link
Author

Our deployment is Docker based and yarn install is called for every new build (i.e. we also practice clean installs), thus, npm rebuild will not do anything. I will upgrade "deasync" and "nan" versions, just in case.
Strage thing, hundreds of other services we have are working just fine, only one service (this super-simple one) is heavily affected (~4 restarts per hour during daytime). The only difference I see, is that it's CPU and RAM usage pattern is very spiky (memory usage jumps from base of 220MB up to 3x and back multiple times per hour)

@schamberg97
Copy link
Contributor

@bolt-juri-gavshin I know, it’s just without ability to replicate the issue, I am trying to troubleshoot other potential causes ;)

@bolt-juri-gavshin
Copy link
Author

@schamberg97 looks like the only way forward (to gather more info) is to put debug Node into LIVE env. Is there an easy way to get a debug version of Node using apt or building from source is the only way?

@targos targos added v8 engine Issues and PRs related to the V8 dependency. memory Issues and PRs related to the memory management or memory footprint. and removed v8 engine Issues and PRs related to the V8 dependency. labels Jan 31, 2021
@bolt-juri-gavshin
Copy link
Author

Good morning! There are some new developments:

  1. I compiled v14.15.4 from sources and cooked new Docker image and deployed it to half of the instances on LIVE for the weekend. As a result, processes still crashed, but the "ip" address is now changing each time:
[Sun Jan 31 07:23:15 2021] traps: nodejs[4351] general protection fault ip:55821ad1fa0e sp:7f497d565a30 error:0 in node[55821a240000+4191000]
[Sun Jan 31 14:26:01 2021] traps: nodejs[10096] general protection fault ip:560ccb67aa0e sp:7fb4a90c7a30 error:0 in node[560ccab9b000+4191000]
[Sun Jan 31 15:24:44 2021] traps: nodejs[27917] general protection fault ip:55d25f74ca0e sp:7f66fe729a30 error:0 in node[55d25ec6d000+4191000]
[Sun Jan 31 18:21:30 2021] traps: nodejs[6464] general protection fault ip:5562fed0aa0e sp:7f493df3da30 error:0 in node[5562fe22b000+4191000]

It was quite stable before (with the official binary):

[Fri Jan 29 12:13:15 2021] traps: nodejs[20138] general protection fault ip:cff9c4 sp:7fb2faffbb30 error:0 in node[400000+3e4f000]
[Fri Jan 29 12:43:42 2021] traps: nodejs[24819] general protection fault ip:cff9c4 sp:7f64c7ffdb30 error:0 in node[400000+3e4f000]
[Fri Jan 29 22:34:33 2021] traps: nodejs[9912] general protection fault ip:cff9c4 sp:7f93d5db3b30 error:0 in node[400000+3e4f000]
  1. The big responses retrieved data from the beginning of the month, as of today (February 1, 10am), there were no segfaults, which means, that with exactly the same logic as before, but smaller datasets, code is stable. I will try to create some small piece of code, that will hopefully reproduce the issue (as for a week or 2 everything should be quite stable and by the second half of month that API will most probably be fixed to limit per-request volumes).

@targos , I believe the "v8 engine" tag you added and removed is also correct here. My impression is that v8 used in 14.15.4 has a bug in GC (official Node 14.15.4 release):

0x0000000000cff9c4: v8::internal::ConcurrentMarking::Run(int, v8::internal::ConcurrentMarking::TaskState*) at ??:?

Maybe someone has access to the release build output and can map that instruction to specific line in the code?

@targos targos added the v8 engine Issues and PRs related to the V8 dependency. label Feb 1, 2021
@bolt-juri-gavshin
Copy link
Author

Hello!
I tried to run the service on 14.15.5 and the error is still there, however, it has changed a bit.
The "general protection fault" is the same on both versions:

general protection fault ip:cff9c4 sp:7f4fa12e2b30 error:0 in node[400000+3e4f000]

but segfault is now different:

14.15.4:

segfault at 35303830383d ip 0000000000cff9c4 sp 00007f8f4b7fcb30 error 4 in node[400000+3e4f000]
Code: 85 a9 33 00 00 48 39 f0 0f 84 a0 33 00 00 4d 8b 6f ff 80 bd 38 ee ff ff 00 0f 85 27 39 00 00 4c 8b bd 58 ee ff ff 4d 8d 75 0a <41> 0f b6 06 3c 47 0f 87 23 57 00 00 ff 24 c5 80 e6 19 02 4c 89 fe

14.15.5:

segfault at 6e61686f5d ip 0000000000d7bebd sp 00007ffe5ada4b70 error 4 in node[400000+3e4f000]
Code: 08 48 85 c9 0f 84 db 01 00 00 48 8d 41 ff 48 89 42 08 48 8b 44 ca 08 48 89 85 70 ff ff ff 4c 8b a5 70 ff ff ff 49 8b 44 24 ff <0f> b7 40 0b 66 2d a5 00 66 83 f8 01 0f 86 9a 01 00 00 4d 8b 74 24

Note the different instruction pointer d7bebd vs the "original" cff9c4.

addr2line -p -f -C -a cff9c4 -e /usr/bin/node
0x0000000000cff9c4: v8::internal::ConcurrentMarking::Run(int, v8::internal::ConcurrentMarking::TaskState*) at ??:?

addr2line -p -f -C -a d7bebd -e /usr/bin/node
0x0000000000d7bebd: unsigned long v8::internal::MarkCompactCollector::ProcessMarkingWorklist<(v8::internal::MarkCompactCollector::MarkingWorklistProcessingMode)0>(unsigned long) at ??:?

@schamberg97 @targos
I hope this information will give you more clues on where to look for the possible problem.

@bolt-juri-gavshin
Copy link
Author

Duplicate of #37553. No problems observed in 14.17.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
memory Issues and PRs related to the memory management or memory footprint. v8 engine Issues and PRs related to the V8 dependency.
Projects
None yet
Development

No branches or pull requests

3 participants