Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV while initializing tcnative #681

Closed
martin-g opened this issue Nov 22, 2021 · 25 comments · Fixed by netty/netty#11856
Closed

SIGSEGV while initializing tcnative #681

martin-g opened this issue Nov 22, 2021 · 25 comments · Fixed by netty/netty#11856

Comments

@martin-g
Copy link

Hi,

Netty: 4.1.70
Netty tcnative: 2.0.46
Grpc: 1.42.1

I face a strange issue while trying to run a simple Grpc service on Ubuntu 20.04.3 ARM64:
The following stack trace leads to SIGSEGV:

Stack: [0x0000ffffa78b7000,0x0000ffffa7ab7000],  sp=0x0000ffffa7ab2fd0,  free space=2031k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  0x0000ffff7bc8f578
j  io.netty.internal.tcnative.Library.initialize(Ljava/lang/String;Ljava/lang/String;)Z+31
j  io.netty.handler.ssl.OpenSsl.initializeTcNative(Ljava/lang/String;)Z+3
j  io.netty.handler.ssl.OpenSsl.<clinit>()V+252
v  ~StubRoutines::call_stub
V  [libjvm.so+0x6162f4]
V  [libjvm.so+0x5b9798]
V  [libjvm.so+0x5b9c14]
V  [libjvm.so+0x5b9f00]
V  [libjvm.so+0x7a8cc8]
V  [libjvm.so+0x7a948c]
V  [libjvm.so+0x7a9538]
V  [libjvm.so+0x60ccbc]
j  io.grpc.netty.GrpcSslContexts.defaultSslProvider()Lio/netty/handler/ssl/SslProvider;+0
j  io.grpc.netty.GrpcSslContexts.configure(Lio/netty/handler/ssl/SslContextBuilder;)Lio/netty/handler/ssl/SslContextBuilder;+1
j  io.grpc.netty.GrpcSslContexts.forClient()Lio/netty/handler/ssl/SslContextBuilder;+3
j  io.grpc.netty.NettyChannelBuilder$DefaultProtocolNegotiator.newNegotiator()Lio/grpc/netty/ProtocolNegotiator;+19
j  io.grpc.netty.NettyChannelBuilder.buildTransportFactory()Lio/grpc/internal/ClientTransportFactory;+8
j  io.grpc.netty.NettyChannelBuilder$NettyChannelTransportFactoryBuilder.buildClientTransportFactory()Lio/grpc/internal/ClientTransportFactory;+4
j  io.grpc.internal.ManagedChannelImplBuilder.build()Lio/grpc/ManagedChannel;+13
j  io.grpc.internal.AbstractManagedChannelImplBuilder.build()Lio/grpc/ManagedChannel;+4
j  alluxio.hub.agent.process.AgentProcessMonitor.pingService(Ljava/net/InetSocketAddress;Lalluxio/retry/RetryPolicy;J)V+4
j  alluxio.hub.agent.process.AgentProcessMonitorTest.lambda$testProcessMonitorFailure$0()V+25
j  alluxio.hub.agent.process.AgentProcessMonitorTest$$Lambda$32.run()V+0
...

The application uses boringssl-static (see the attached mvn-dependency-tree.txt). According to https://netty.io/wiki/forked-tomcat-native.html when boringssl-static is being used then APR is not needed but the core dump says:

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  io.netty.internal.tcnative.Library.aprMajorVersion()I+0
j  io.netty.internal.tcnative.Library.initialize(Ljava/lang/String;Ljava/lang/String;)Z+31
j  io.netty.handler.ssl.OpenSsl.initializeTcNative(Ljava/lang/String;)Z+3
j  io.netty.handler.ssl.OpenSsl.<clinit>()V+252

Apr is installed on the system:

$ apr-1-config --version
1.6.5

Please let me know if I can provide more information!

mvn-dependency-tree.txt
hs_err_pid148138.log

@martin-g
Copy link
Author

Originally reported at Alluxio/alluxio#12704 (comment)

@normanmaurer
Copy link
Member

Yeah it should use the apr version which is statically compiled into it. Can you check what happens when you remove apr from the system itself ?

@martin-g
Copy link
Author

martin-g commented Nov 22, 2021

I've removed the following packages: libapr1-dev libaprutil1-dev libsctp-dev{u} libsctp1{u} libsvn-dev{a} libapr1 libaprutil1 libserf-1-1{a} libsvn1{a} libutf8proc2{u} subversion{a} but nothing changed :-/

@normanmaurer
Copy link
Member

normanmaurer commented Nov 22, 2021 via email

@martin-g
Copy link
Author

Do you have ARM64 hardware to run the Docker image ? Emulating with QEMU would be slow.

Otherwise I could give you access to my VM ? Please send me an email to mgrigorov @ apache org if you prefer the VM approach!

@normanmaurer
Copy link
Member

I have a m1 so I guess that should do it ?

@martin-g
Copy link
Author

Hopefully!
Give me some time to prepare the image!

@martin-g
Copy link
Author

# on the host
docker pull ghcr.io/martin-g/netty-sigsegv:latest
docker run -it --name debug-sigsegv ghcr.io/martin-g/netty-sigsegv

# in the Docker container
cd netty-tcnative/alluxio/hub/server/
mvn test -Dtest=AgentProcessMonitorTest#testProcessMonitorFailure

martin-g/netty-sigsegv:latest is a bit fat (~5GB) because it contains all the prereqs and Maven dependencies

@martin-g
Copy link
Author

Without Docker the steps are:

  1. git clone --branch netty-sigsegv-debug https://github.com/martin-g/alluxio.git
  2. export JAVA_HOME=/path/to/jdk-8
  3. mvn clean install -DskipTests
  4. cd hub/server/
  5. mvnDebug test -Dtest=AgentProcessMonitorTest#testProcessMonitorFailure

@martin-g
Copy link
Author

If I add -DargLine=-Dio.netty.handler.ssl.noOpenSsl=true -Dio.netty.handler.ssl.noOpenSsl=true to the mvn test command then the test passes!

@normanmaurer
Copy link
Member

I will check most likely tomorrow..

@martin-g
Copy link
Author

martin-g commented Nov 23, 2021

I think I've found the problem!
See the attached tests.log file.
It says:

java.lang.IllegalArgumentException: Failed to load any of the given libraries: [netty_tcnative_linux_aarch_64, netty_tcnative_linux_aarch_64_fedora, netty_tcnative_aarch_64, netty_tcnative]
        at io.netty.util.internal.NativeLibraryLoader.loadFirstAvailable(NativeLibraryLoader.java:104)

but down in the Caused by logs it never attempts netty_tcnative_linux_aarch_64 (which is provided by netty-tcnative-boringssl-static.jar). It tries with netty_tcnative_linux_aarch_64_fedora, netty_tcnative_aarch_64 and netty_tcnative

tests.log

@normanmaurer
Copy link
Member

interesting... that said I think it should not segfault

@martin-g
Copy link
Author

I am not sure why it segfaults but hs_err*.txt says the problem is at:

Stack: [0x0000ffffa78b7000,0x0000ffffa7ab7000],  sp=0x0000ffffa7ab2fd0,  free space=2031k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  0x0000ffff7bc8f578
j  io.netty.internal.tcnative.Library.initialize(Ljava/lang/String;Ljava/lang/String;)Z+31
j  io.netty.handler.ssl.OpenSsl.initializeTcNative(Ljava/lang/String;)Z+3
j  io.netty.handler.ssl.OpenSsl.<clinit>()V+252
v  ~StubRoutines::call_stub
V  [libjvm.so+0x6162f4]
V  [libjvm.so+0x5b9798]
V  [libjvm.so+0x5b9c14]
V  [libjvm.so+0x5b9f00]
V  [libjvm.so+0x7a8cc8]
V  [libjvm.so+0x7a948c]
V  [libjvm.so+0x7a9538]
V  [libjvm.so+0x60ccbc]
j  io.grpc.netty.GrpcSslContexts.defaultSslProvider()Lio/netty/handler/ssl/SslProvider;+0
j  io.grpc.netty.GrpcSslContexts.configure(Lio/netty/handler/ssl/SslContextBuilder;)Lio/netty/handler/ssl/SslContextBuilder;+1
j  io.grpc.netty.GrpcSslContexts.forClient()Lio/netty/handler/ssl/SslContextBuilder;+3
...

@martin-g
Copy link
Author

I've just unzipped ~/.m2/repository/io/netty/netty-tcnative-boringssl-static/2.0.46.Final/netty-tcnative-boringssl-static-2.0.46.Final.jar

$ ll /tmp/META-INF/native/
total 12M
-rw-r--r-- 1 ubuntu ubuntu 2.1M Nov 17 11:12 libnetty_tcnative_linux_aarch_64.so
-rw-r--r-- 1 ubuntu ubuntu 2.6M Nov 17 11:12 libnetty_tcnative_linux_x86_64.so
-rw-r--r-- 1 ubuntu ubuntu 2.2M Nov 17 11:12 libnetty_tcnative_osx_aarch_64.jnilib
-rw-r--r-- 1 ubuntu ubuntu 2.7M Nov 17 11:12 libnetty_tcnative_osx_x86_64.jnilib
-rw-r--r-- 1 ubuntu ubuntu 2.6M Nov 17 11:12 netty_tcnative_windows_x86_64.dll

and exported LD_LIBRARY_PATH to /tmp/META-INF/native

The test passed ! Without _fedora !
I expected that I will need to copy libnetty_tcnative_linux_aarch_64.so to netty_tcnative_linux_aarch_64_fedora.so to make it work but it wasn't needed.

@normanmaurer
Copy link
Member

@martin-g so you say you didn't need to change anything ? All you did was unzip and sett LD_LIBRARY_PATH and after this it worked ?

@martin-g
Copy link
Author

Right!
It seems for some reason it does not try to load netty_tcnative_linux_aarch_64 from the jar. For some reason it is not in the suppressed exceptions and Caused Bys.

I am debugging the test locally on x86_64 and all seems fine with the logic.

@martin-g
Copy link
Author

martin-g commented Nov 24, 2021

If you need to enable logging for the test you could add log4j.logger.io.netty.handler.ssl=debug to /netty-tcnative/alluxio/hub/server/src/test/resources/log4j.properties
The log file would be at /netty-tcnative/alluxio/hub/server/target/logs/tests.log

@martin-g
Copy link
Author

martin-g commented Nov 24, 2021

I am trying to build Netty locally with some extra logging but it fails at:

[INFO] Netty/Handler ...................................... FAILURE [  1.851 s]
[INFO] Netty/Codec/HTTP ................................... SKIPPED
[INFO] Netty/Codec/HTTP2 .................................. SKIPPED
[INFO] Netty/Codec/Memcache ............................... SKIPPED
[INFO] Netty/Codec/MQTT ................................... SKIPPED
[INFO] Netty/Codec/Redis .................................. SKIPPED
[INFO] Netty/Codec/SMTP ................................... SKIPPED
[INFO] Netty/Codec/Socks .................................. SKIPPED
[INFO] Netty/Codec/Stomp .................................. SKIPPED
[INFO] Netty/Codec/XML .................................... SKIPPED
[INFO] Netty/Handler/Proxy ................................ SKIPPED
[INFO] Netty/Resolver/DNS ................................. SKIPPED
[INFO] Netty/Transport/RXTX ............................... SKIPPED
[INFO] Netty/Transport/SCTP ............................... SKIPPED
[INFO] Netty/Transport/UDT ................................ SKIPPED
[INFO] Netty/Transport/Native/Unix/Common ................. SKIPPED
[INFO] Netty/Transport/Classes/Epoll ...................... SKIPPED
[INFO] Netty/Transport/Classes/KQueue ..................... SKIPPED
[INFO] Netty/Resolver/DNS/Classes/MacOS ................... SKIPPED
[INFO] Netty/All-in-One ................................... SKIPPED
[INFO] Netty/Resolver/DNS/Native/MacOS .................... SKIPPED
[INFO] Netty/Transport/Native/Unix/Common/Tests ........... SKIPPED
[INFO] Netty/Testsuite .................................... SKIPPED
[INFO] Netty/Transport/Native/Epoll ....................... SKIPPED
[INFO] Netty/Transport/Native/KQueue ...................... SKIPPED
[INFO] Netty/Example ...................................... SKIPPED
[INFO] Netty/Testsuite/Autobahn ........................... SKIPPED
[INFO] Netty/Testsuite/Http2 .............................. SKIPPED
[INFO] Netty/Testsuite/OSGI ............................... SKIPPED
[INFO] Netty/Testsuite/Shading ............................ SKIPPED
[INFO] Netty/Testsuite/Native ............................. SKIPPED
[INFO] Netty/Testsuite/NativeImage ........................ SKIPPED
[INFO] Netty/Testsuite/NativeImage/Client ................. SKIPPED
[INFO] Netty/Testsuite/NativeImage/ClientRuntimeInit ...... SKIPPED
[INFO] Netty/Transport/BlockHound/Tests ................... SKIPPED
[INFO] Netty/Microbench ................................... SKIPPED
[INFO] Netty/BOM .......................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.415 s
[INFO] Finished at: 2021-11-24T10:20:16Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project netty-handler: Could not resolve dependencies for project io.netty:netty-handler:jar:4.1.71.Final-SNAPSHOT: Could not find artifact io.netty:netty-tcnative:jar:linux-aarch_64:2.0.46.Final in central (https://repo.maven.apache.org/maven2) -> [Help 1]
[ERROR] 

https://repo.maven.apache.org/maven2/io/netty/netty-tcnative/2.0.46.Final/ contains only the _fedora thingy ...

Update: I've hacked it by changing ubuntu to fedora in /etc/os-release.
This hack didn't help for the failing test though.

@normanmaurer
Copy link
Member

normanmaurer commented Nov 24, 2021

@martin-g use ./mvnw clean package -Pboringssl

@martin-g
Copy link
Author

I've found the issue!

2021-11-24 10:36:36,763 [main] INFO  internal.NativeLibraryLoader (NativeLibraryLoader.java:load) - ===== Class loader returned url jar:file:/home/ubuntu/.m2/repository/org/apache/ratis/r
atis-thirdparty-misc/0.6.0/ratis-thirdparty-misc-0.6.0.jar!/META-INF/native/libnetty_tcnative_linux_aarch_64.so

It found some other library in a dependency jar.
Why it is not used - I still have no idea.

@normanmaurer Do you think using ClassLoader#getResources(String) at io.netty.util.internal.NativeLibraryLoader#load() would be a better way ? There you could filter out only the Netty jars or throw an exception if there are more than one results ?

@normanmaurer
Copy link
Member

@martin-g not sure I follow what you propose.. Can you show via a PR ?

@martin-g
Copy link
Author

I will prepare a PR!
The idea is to collect all resources (ClassLoader#getResources), not just the first one (ClassLoader#getResource).
If there are more than one results then either throw an error or filter out the good one (a jar with path containing io/netty/....
Currently by using ClassLoader#getResource() it finds the first resource with this name and in the current case it is something totally wrong.

@normanmaurer
Copy link
Member

got it... I guess we might just throw if we find multiple as it is hard to know which one is correct if shading is used.

martin-g added a commit to martin-g/netty that referenced this issue Nov 24, 2021
Fixes netty/netty-tcnative#681
Throw an exception when there are multiple netty-tcnative-** libraries with the same path in the classpath
@martin-g
Copy link
Author

I've reported the problem to Apache Ratis: https://issues.apache.org/jira/browse/RATIS-1443

normanmaurer pushed a commit to netty/netty that referenced this issue Nov 27, 2021
Throw an exception when there are multiple netty-tcnative-** libraries with the same path in the classpath

Motivation:

Currently Netty loads the first resource in the classpath with a given name.
It seems there are [libraries](netty/netty-tcnative#681 (comment)) which provide Netty's native libraries themselves.

Modification:

From now on Netty will look for all resources with the given name and throw an exception if there are more than one.
The user application needs to make sure that there is at most one provider of Netty's native libraties (netty-tcnative-**)

Result:

Fixes netty/netty-tcnative#681.
normanmaurer pushed a commit to netty/netty that referenced this issue Nov 27, 2021
Throw an exception when there are multiple netty-tcnative-** libraries with the same path in the classpath

Motivation:

Currently Netty loads the first resource in the classpath with a given name.
It seems there are [libraries](netty/netty-tcnative#681 (comment)) which provide Netty's native libraries themselves.

Modification:

From now on Netty will look for all resources with the given name and throw an exception if there are more than one.
The user application needs to make sure that there is at most one provider of Netty's native libraties (netty-tcnative-**)

Result:

Fixes netty/netty-tcnative#681.
laosijikaichele pushed a commit to laosijikaichele/netty that referenced this issue Dec 16, 2021
…#11856)

Throw an exception when there are multiple netty-tcnative-** libraries with the same path in the classpath

Motivation:

Currently Netty loads the first resource in the classpath with a given name.
It seems there are [libraries](netty/netty-tcnative#681 (comment)) which provide Netty's native libraries themselves.

Modification:

From now on Netty will look for all resources with the given name and throw an exception if there are more than one.
The user application needs to make sure that there is at most one provider of Netty's native libraties (netty-tcnative-**)

Result:

Fixes netty/netty-tcnative#681.
laosijikaichele pushed a commit to laosijikaichele/netty that referenced this issue Dec 16, 2021
…#11856)

Throw an exception when there are multiple netty-tcnative-** libraries with the same path in the classpath

Motivation:

Currently Netty loads the first resource in the classpath with a given name.
It seems there are [libraries](netty/netty-tcnative#681 (comment)) which provide Netty's native libraries themselves.

Modification:

From now on Netty will look for all resources with the given name and throw an exception if there are more than one.
The user application needs to make sure that there is at most one provider of Netty's native libraties (netty-tcnative-**)

Result:

Fixes netty/netty-tcnative#681.
raidyue pushed a commit to raidyue/netty that referenced this issue Jul 8, 2022
…#11856)

Throw an exception when there are multiple netty-tcnative-** libraries with the same path in the classpath

Motivation:

Currently Netty loads the first resource in the classpath with a given name.
It seems there are [libraries](netty/netty-tcnative#681 (comment)) which provide Netty's native libraries themselves.

Modification:

From now on Netty will look for all resources with the given name and throw an exception if there are more than one.
The user application needs to make sure that there is at most one provider of Netty's native libraties (netty-tcnative-**)

Result:

Fixes netty/netty-tcnative#681.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants