Skip to content

java.lang.UnsatisfiedLinkError when reading CSV from S3 by arrow's csv reader #46185

Open
@squalud

Description

@squalud

Describe the bug, including details regarding any error messages, version, and platform.

I build gluten by source code using the following command, which will also build arrow:
./dev/buildbundle-veloxbe.sh --enable_hdfs=ON --enable_s3=ON --enable_vcpkg=ON --spark_version=3.5

After successfully build,i run pyspark with using arrow's S3 csv reader to read csv file on S3, then i got a java.lang.UnsatisfiedLinkError:

SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (xx.xx.xx.xx executor 1): org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Error during calling Java code from native code: org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: java.lang.UnsatisfiedLinkError: /tmp/jnilib-17305424060615380389.tmp: /tmp/jnilib-17305424060615380389.tmp: undefined symbol: _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE
at java.base/jdk.internal.loader.NativeLibraries.load(Native Method)
at java.base/jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open(NativeLibraries.java:388)
at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:232)
at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:174)
at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2394)
at java.base/java.lang.Runtime.load0(Runtime.java:755)
at java.base/java.lang.System.load(System.java:1970)
at org.apache.arrow.dataset.jni.JniLoader.load(JniLoader.java:92)
at org.apache.arrow.dataset.jni.JniLoader.loadRemaining(JniLoader.java:75)
at org.apache.arrow.dataset.jni.JniLoader.ensureLoaded(JniLoader.java:61)
at org.apache.arrow.dataset.jni.NativeMemoryPool.createListenable(NativeMemoryPool.java:44)
at org.apache.gluten.memory.arrow.pool.ArrowNativeMemoryPool.(ArrowNativeMemoryPool.java:34)
at org.apache.gluten.memory.arrow.pool.ArrowNativeMemoryPool.createArrowNativeMemoryPool(ArrowNativeMemoryPool.java:47)
at org.apache.gluten.memory.arrow.pool.ArrowNativeMemoryPool.lambda$arrowPool$0(ArrowNativeMemoryPool.java:42)
at org.apache.spark.task.TaskResourceRegistry.$anonfun$addResourceIfNotRegistered$1(Task...

The following is my build log, in which all the ARROW_S3 option in the build message are switch to ON:

......

+ pushd /workspace/incubator-gluten/dev/../ep/_ep/arrow_ep/cpp
/workspace/incubator-gluten/ep/_ep/arrow_ep/cpp /workspace/incubator-gluten/dev
+ cmake_install -DARROW_S3=ON -DARROW_PARQUET=ON -DARROW_FILESYSTEM=ON -DARROW_PROTOBUF_USE_SHARED=OFF -DARROW_DEPENDENCY_USE_SHARED=OFF -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_WITH_THRIFT=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DARROW_JEMALLOC=OFF -DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE -DARROW_WITH_UTF8PROC=OFF -DARROW_TESTING=ON -DCMAKE_INSTALL_PREFIX=/usr/local -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON

......

+ COMPILER_FLAGS='-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17 -mbmi2 '
+ cmake -Wno-dev -B_build -GNinja -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DCMAKE_CXX_STANDARD=17 '' '' '-DCMAKE_CXX_FLAGS=-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17 -mbmi2 ' -DBUILD_TESTING=OFF -DARROW_S3=ON -DARROW_PARQUET=ON -DARROW_FILESYSTEM=ON -DARROW_PROTOBUF_USE_SHARED=OFF -DARROW_DEPENDENCY_USE_SHARED=OFF -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_WITH_THRIFT=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DARROW_JEMALLOC=OFF -DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE -DARROW_WITH_UTF8PROC=OFF -DARROW_TESTING=ON -DCMAKE_INSTALL_PREFIX=/usr/local -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON

......

-- ---------------------------------------------------------------------
-- Arrow version:                                 15.0.0
--
-- Build configuration summary:
--   Generator: Ninja
--   Build type: RELEASE
--   Source directory: /workspace/incubator-gluten/ep/_ep/arrow_ep/cpp
--   Install prefix: /usr/local
--
-- Compile and link options:
--
--   ARROW_CXXFLAGS="" [default=""]
--       Compiler flags to append when compiling Arrow
--   ARROW_BUILD_STATIC=ON [default=ON

......

--   ARROW_ACERO=OFF [default=OFF]
--       Build the Arrow Acero Engine Module
--   ARROW_AZURE=OFF [default=OFF]
--       Build Arrow with Azure support (requires the Azure SDK for C++)
--   ARROW_BUILD_UTILITIES=OFF [default=OFF]
--       Build Arrow commandline utilities
--   ARROW_COMPUTE=OFF [default=OFF]
--       Build all Arrow Compute kernels
--   ARROW_CSV=OFF [default=OFF]
--       Build the Arrow CSV Parser Module
--   ARROW_CUDA=OFF [default=OFF]
--       Build the Arrow CUDA extensions (requires CUDA toolkit)
--   ARROW_DATASET=OFF [default=OFF]
--       Build the Arrow Dataset Modules
--   ARROW_FILESYSTEM=ON [default=OFF]
--       Build the Arrow Filesystem Layer
--   ARROW_FLIGHT=OFF [default=OFF]
--       Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers)
--   ARROW_FLIGHT_SQL=OFF [default=OFF]
--       Build the Arrow Flight SQL extension
--   ARROW_GANDIVA=OFF [default=OFF]
--       Build the Gandiva libraries
--   ARROW_GCS=OFF [default=OFF]
--       Build Arrow with GCS support (requires the GCloud SDK for C++)
--   ARROW_HDFS=OFF [default=OFF]
--       Build the Arrow HDFS bridge
--   ARROW_IPC=ON [default=ON]
--       Build the Arrow IPC extensions
--   ARROW_JEMALLOC=OFF [default=ON]
--       Build the Arrow jemalloc-based allocator
--   ARROW_JSON=ON [default=OFF]
--       Build Arrow with JSON support (requires RapidJSON)
--   ARROW_MIMALLOC=OFF [default=OFF]
--       Build the Arrow mimalloc-based allocator
--   ARROW_PARQUET=ON [default=OFF]
--       Build the Parquet libraries
--   ARROW_ORC=OFF [default=OFF]
--       Build the Arrow ORC adapter
--   ARROW_PYTHON=OFF [default=OFF]
--       Build some components needed by PyArrow.
--       (This is a deprecated option. Use CMake presets instead.)
--   ARROW_S3=ON [default=OFF]
--       Build Arrow with S3 support (requires the AWS SDK for C++)
--   ARROW_SKYHOOK=OFF [default=OFF]
--       Build the Skyhook libraries
--   ARROW_SUBSTRAIT=OFF [default=OFF]
--       Build the Arrow Substrait Consumer Module
--   ARROW_TENSORFLOW=OFF [default=OFF]
--       Build Arrow with TensorFlow support enabled
--   ARROW_TESTING=ON [default=OFF]
--       Build the Arrow testing libraries

......

-- ---------------------------------------------------------------------
-- Arrow version:                                 15.0.0
--
-- Build configuration summary:
--   Generator: Unix Makefiles
--   Build type: RELEASE
--   Source directory: /workspace/incubator-gluten/ep/_ep/arrow_ep/cpp
--   Install prefix: /workspace/incubator-gluten/ep/_ep/arrow_ep/java-dist
--
-- Compile and link options:
--
--   ARROW_CXXFLAGS="" [default=""]
--       Compiler flags to append when compiling Arrow
--   ARROW_BUILD_STATIC=ON [default=ON]
--       Build static libraries

......

-- Project component options:
--
--   ARROW_ACERO=ON [default=OFF]
--       Build the Arrow Acero Engine Module
--   ARROW_AZURE=OFF [default=OFF]
--       Build Arrow with Azure support (requires the Azure SDK for C++)
--   ARROW_BUILD_UTILITIES=OFF [default=OFF]
--       Build Arrow commandline utilities
--   ARROW_COMPUTE=ON [default=OFF]
--       Build all Arrow Compute kernels
--   ARROW_CSV=ON [default=OFF]
--       Build the Arrow CSV Parser Module
--   ARROW_CUDA=OFF [default=OFF]
--       Build the Arrow CUDA extensions (requires CUDA toolkit)
--   ARROW_DATASET=ON [default=OFF]
--       Build the Arrow Dataset Modules
--   ARROW_FILESYSTEM=ON [default=OFF]
--       Build the Arrow Filesystem Layer
--   ARROW_FLIGHT=OFF [default=OFF]
--       Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers)
--   ARROW_FLIGHT_SQL=OFF [default=OFF]
--       Build the Arrow Flight SQL extension
--   ARROW_GANDIVA=OFF [default=OFF]
--       Build the Gandiva libraries
--   ARROW_GCS=OFF [default=OFF]
--       Build Arrow with GCS support (requires the GCloud SDK for C++)
--   ARROW_HDFS=ON [default=OFF]
--       Build the Arrow HDFS bridge
--   ARROW_IPC=ON [default=ON]
--       Build the Arrow IPC extensions
--   ARROW_JEMALLOC=ON [default=ON]
--       Build the Arrow jemalloc-based allocator
--   ARROW_JSON=ON [default=OFF]
--       Build Arrow with JSON support (requires RapidJSON)
--   ARROW_MIMALLOC=OFF [default=OFF]
--       Build the Arrow mimalloc-based allocator
--   ARROW_PARQUET=ON [default=OFF]
--       Build the Parquet libraries
--   ARROW_ORC=OFF [default=OFF]
--       Build the Arrow ORC adapter
--   ARROW_PYTHON=OFF [default=OFF]
--       Build some components needed by PyArrow.
--       (This is a deprecated option. Use CMake presets instead.)
--   ARROW_S3=ON [default=OFF]
--       Build Arrow with S3 support (requires the AWS SDK for C++)
--   ARROW_SKYHOOK=OFF [default=OFF]
--       Build the Skyhook libraries
--   ARROW_SUBSTRAIT=ON [default=OFF]
--       Build the Arrow Substrait Consumer Module
--   ARROW_TENSORFLOW=OFF [default=OFF]
--       Build Arrow with TensorFlow support enabled
--   ARROW_TESTING=OFF [default=OFF]
--       Build the Arrow testing libraries
......

When i login the spark executor, check by nm command:

$ nm -D /tmp/jnilib-17305424060615380389.tmp |grep _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE
                 U _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE

After extract the gluten jar, i got these libs:

$ find ./ -name *.so
./linux/amd64/libvelox.so
./linux/amd64/libgluten.so
./x86_64/libarrow_cdata_jni.so
./x86_64/libarrow_dataset_jni.so
$ nm -D x86_64/libarrow_dataset_jni.so |grep _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE
                 U _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE

It looks like the aws-cpp-sdk-s3 library is not statically linked in? Or do i need to install the related libs of aws-sdk in my Dockfile manually?

How can i work round?

Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions