-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIX: npm install gives IOT/Abort trap #4083
Comments
might be related |
gcc libs root@nimvie: /home/tremch/GIT/various_python_stuff # rpm -qa | grep libgcc
libgcc10-10.3.0-6.ppc
libgcc-10-2.ppc
root@nimvie: /home/tremch/GIT/various_python_stuff # rpm -qa | grep libstd
libstdc++-10-2.ppc
libstdc++-devel-10-2.ppc
libstdc++10-10.3.0-6.ppc
libstdc++10-devel-10.3.0-6.ppc |
dbx corefile analysis seems something is wrong with pthread root@nimvie: /opt/node-v18.13.0-aix-ppc64 # dbx bin/node core
Type 'help' for help.
warning: The core file is not a fullcore. Some info may
not be available.
[using memory image in core]
reading symbolic information ...internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
internal error: assertion failed at line 6693 in file object.c
IOT/Abort trap in pthread_kill at 0x9000000006a92f8
0x9000000006a92f8 (pthread_kill+0x98) e8410028 ld r2,0x28(r1)
(dbx)
|
i tried it on an older aix level (7.1), even worse than on 7.3 packagebuilder@aixbuildhostng: /home/packagebuilder/test/node-v18.13.0-aix-ppc64/bin # ./node -v
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
IOT/Abort trap
packagebuilder@aixbuildhostng: /home/packagebuilder/test/node-v18.13.0-aix-ppc64/bin # ./npm version
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
IOT/Abort trap also points to a pthread problem.... |
I dont have 7.1 machine but I tried it on AIX 7.2 and there also I am not facing this issue. For me it is working fine for both AIX 7.2 and 7.3. I guess this might be an environment issue or gcc, libgcc package installation issue. |
dnf wont let me do this...newest libgcc10 seems to be installed anyway
|
Yes, you are right. We just figured out that "dnf" is having dependency on libgcc that's why it is not allowing you to uninstall it. Anyways, coming to the issue, I see you have given dbx core file analysis. Can you try that with gdb instead of dbx ? Are these freshly installed system because I tried on a fresh system. Now because it's working for me, I am guessing that this is more of and environment issue. Meanwhile, I will try this on AIX 7.1. Thanks |
gdb run, as suspected...aix libpthreads.a
|
meanwhile i built node from source...same error...i'm outta here...
|
@flynn1973 it's not quite clear to me which version of AIX you were trying to run the binaries on. For 18.x the community binaries only support AIX 7.2 or later - https://github.com/nodejs/node/blob/main/BUILDING.md#official-binary-platforms-and-toolchains It is also built with GCC 8 and needs the corresponding glibc installed. glibc versions |
i did all my testing, installing and building on aix 7.3...just did a quick run on 7.1 but this can be ignored. as seen above a build on 7.3 with gcc8 was succesfull but binaries shows the same problems as the prebuilt ones. |
@flynn1973 We had tried to recreate the issue in our system but We are unable to do so, Even We tried with the source code, it is working fine for us in both AIX 7.2 and 7.3. |
i skimmed through the truss output and it seems it tries to open "/root/package.json" which of course does not exist. i created an empty file then another error occurs later in the run.
|
@flynn1973 Can you please switch to frame number 5 and print the values of w,pc(pollset_ctl),pqry(pollset_query) from the gdb core dump file? |
could not get the function values...
|
@flynn1973 - looks like the values are optimised out. One option is to dump the address of https://github.com/libuv/libuv/blob/244df24bf411a396ceaf69f8a80a98e5629ee584/src/unix/aix.c#L135 we can potentially get the value of fd from the memory. And then we can validate if it is a valid |
seems there is no memory for pqry on the stack...
query for pc
|
Hello @flynn1973, Can you please run this steps: Go to frame number 5: |
here you go...
|
@flynn1973 - looks like $r31 is corrupt / core file is truncated. Is there any way to figure out if the core file is full? alternatively, is there any way you can pass the file for my inspection? |
@flynn1973 : So we cannot rely on its value. However, the truss output provides the address of
and at those locations we have:
this means:
this shows invalid |
@flynn1973 : We tried to recreate the issue with same AIX version |
basically yes, but the file is bigger than what github allows me to upload. |
the same issue was reported by another user on the aix open source forums |
I see the same error 'Assertion failed: __EX, file ../deps/uv/src/unix/aix.c, line 186 |
google drive, may be? |
as the problematic binary is from the official distribution...do you really need this binary uploaded? |
thanks @vandysn , that was quick. will analyse this to see what this means and get back. |
tldr; we got some leads, but not any great revelations. will share detailed analysis later. @vandysn - I have uploaded a new revision of the instrument. @vandysn - could you pls try again with this one? https://gist.github.com/gireeshpunathil/82185356bdc5356a68f4e65bd7c3699e |
@gireeshpunathil - The new one works well. The issue is gone. Below are the debug messages around the same area in case you need them. LIBUV: before pollset_ctl 20 1 |
@vandysn - thanks. this is interesting! we haven't made anything to however, this brings up an important aspect of the problem - looks like the issue is time dependent. This can also explain why my systems (which have the same h/w, os config as the problematic ones) did not fail even once. from the result (instrument changing the socket behaviour), it is reasonable to believe that the timing window can be related to n/w settings, system load, cpu config, or something else - we don't know. pls stay tuned, I am going to try with a small C test case simulating the polling sequence. I will update you what I find. If something interesting comes up, will raise it up with @nodejs/libuv too. |
https://gist.github.com/gireeshpunathil/d3c4b15f5a9dcea39979295f520484e3
(depending on how truly this simulates node behaviour, this can be an iterative process, just want to let you know) |
i fired up the simulation.
|
@flynn1973 - thanks for checking! the output is definitely unexpected. An EPIPE would indicate the peer socket is not connected for a SOCK_STREAM type socket, but here we are writing only after a pollset_poll that checks for the readiness of the socket. (An ENOTCONN would have been little comforting even though that too does not make sense after the poll) Our intent is to see what happens at @vandysn - feel free to test both sim1 and sim2, curious to know whether you get EPIPE too! |
@gireeshpunathil - Below is what I see
4587900: 24510759: pollset_create(32) = 6
4587924: 24576311: pollset_create(32) = 6 |
truss -f of sim2:
|
@vandysn - looks like your process is unable to establish connection with |
thanks @flynn1973. the output is similar to what we get - that is, pollset_query returns the fd that is already in the polling set. In the bad case, when pollset_ctl returns EINVAL, pollset_query was returning 0, and that was the root cause of the assertion. so in summary, this program does not look like mimic the actual scenario! let us see what happens in @vandysn 's case. |
@gireeshpunathil - I did not think it was connecting to registry.npmjs.org. Ping to registry.npmjs.org does not return a reply. The AIX server is setup with proxy. |
ok - then how does |
Sorry, I am not an expert on proxies either. But it is setup so that 'npm install' works but ping does not. I do not have a way to bypass the proxy. So is it possible to just go with what flynn1973 provided. |
i am in full control of my aix machine...so i can do/test whatever is necessary |
I don't think it's necessarily related but since it only occurs with an npm proxy I'll mention nodejs/node#48969 which was also only seen when using a proxy. The main branch now has this fix and this is the backport - nodejs/node#49016 for 18.x. If @flynn1973 recreates without the proxy, then never mind since it should not be related. |
thanks @mhdawson for this, looks like critical piece of info in this context. I will try to locally back port this fix and provide a binary for @flynn1973 and @vandysn to test. Meanwhile, I have also developed a third simulation that is a middle ground between the full blown node (fails) and the C socket connecting code (passes) that I would like to be tested in the failing systems. @flynn1973 and @vandysn - could you try this and let me know what do you get? (with and without truss) https://gist.github.com/gireeshpunathil/d3c4b15f5a9dcea39979295f520484e3#file-foo-js |
without truss:
with truss:
|
thanks @flynn1973 for the run, unfortunately this took a totally different route:
not sure what is going on. I am going to upload the back-port of double TLS bug fix to see if that makes things better. stay tuned. |
@flynn1973 - could you please test your original test case (rpm install) with this node binary and let me know how it goes? |
here you go...
|
@flynn1973 - looks like you picked up a wrong binary? could you do
|
hmm...i see..but there is nothing else for download... |
downloaded the whole container...seems this includes the correct file..
|
seems that its working now...
|
wow - thanks for sharing the news, @flynn1973 ! the only missing puzzle is how the double tis bug manifested as an assertion failure in libuv. @mhdawson - do you have additional insights on the bug? either way, this is back ported to 18.x line now, hopefully this will make it to the next v18.x release. |
@gireeshpunathil I don't know how it would have ended up with an assertion in libuv, my only guess might be corrupting some memory that then later resulted in a crash versus crashing earlier as we saw on other platforms. |
thanks @mhdawson . I agree, that would be a good explanation of what we saw in numerous number of debug iterations - an fd that was all good at the time when it was created, connected (with the remote rpm server), polled for connection establishment, suddenly becomes invalid when polled for reading! With that, we conclude this issue. Thanks everyone involved in the problem determination, and special thanks to @flynn1973 for being immensely patient and helpful throughout! Closing as resolved. |
Details
installed from here -> https://nodejs.org/dist/v18.13.0/node-v18.13.0-aix-ppc64.tar.gz
npm install
gives following error...exporting
LIBPATH
is of no helpNode.js version
Example code
No response
Operating system
Scope
runtime i guess
Module and version
Not applicable.
The text was updated successfully, but these errors were encountered: