java.lang.UnsatisfiedLinkError when reading CSV from S3 by arrow's csv reader

### Describe the bug, including details regarding any error messages, version, and platform.

I build gluten by source code using the following command, which will also build arrow:
`./dev/buildbundle-veloxbe.sh --enable_hdfs=ON --enable_s3=ON --enable_vcpkg=ON --spark_version=3.5`

After successfully build，i run pyspark with using arrow's S3 csv reader to read csv file on S3, then i got a `java.lang.UnsatisfiedLinkError`:

```
SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (xx.xx.xx.xx executor 1): org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Error during calling Java code from native code: org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: java.lang.UnsatisfiedLinkError: /tmp/jnilib-17305424060615380389.tmp: /tmp/jnilib-17305424060615380389.tmp: undefined symbol: _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE
at java.base/jdk.internal.loader.NativeLibraries.load(Native Method)
at java.base/jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open(NativeLibraries.java:388)
at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:232)
at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:174)
at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2394)
at java.base/java.lang.Runtime.load0(Runtime.java:755)
at java.base/java.lang.System.load(System.java:1970)
at org.apache.arrow.dataset.jni.JniLoader.load(JniLoader.java:92)
at org.apache.arrow.dataset.jni.JniLoader.loadRemaining(JniLoader.java:75)
at org.apache.arrow.dataset.jni.JniLoader.ensureLoaded(JniLoader.java:61)
at org.apache.arrow.dataset.jni.NativeMemoryPool.createListenable(NativeMemoryPool.java:44)
at org.apache.gluten.memory.arrow.pool.ArrowNativeMemoryPool.(ArrowNativeMemoryPool.java:34)
at org.apache.gluten.memory.arrow.pool.ArrowNativeMemoryPool.createArrowNativeMemoryPool(ArrowNativeMemoryPool.java:47)
at org.apache.gluten.memory.arrow.pool.ArrowNativeMemoryPool.lambda$arrowPool$0(ArrowNativeMemoryPool.java:42)
at org.apache.spark.task.TaskResourceRegistry.$anonfun$addResourceIfNotRegistered$1(Task...
```

The following is my build log, in which all the `ARROW_S3` option in the build message are switch to `ON`:
```
......

+ pushd /workspace/incubator-gluten/dev/../ep/_ep/arrow_ep/cpp
/workspace/incubator-gluten/ep/_ep/arrow_ep/cpp /workspace/incubator-gluten/dev
+ cmake_install -DARROW_S3=ON -DARROW_PARQUET=ON -DARROW_FILESYSTEM=ON -DARROW_PROTOBUF_USE_SHARED=OFF -DARROW_DEPENDENCY_USE_SHARED=OFF -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_WITH_THRIFT=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DARROW_JEMALLOC=OFF -DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE -DARROW_WITH_UTF8PROC=OFF -DARROW_TESTING=ON -DCMAKE_INSTALL_PREFIX=/usr/local -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON

......

+ COMPILER_FLAGS='-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17 -mbmi2 '
+ cmake -Wno-dev -B_build -GNinja -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DCMAKE_CXX_STANDARD=17 '' '' '-DCMAKE_CXX_FLAGS=-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17 -mbmi2 ' -DBUILD_TESTING=OFF -DARROW_S3=ON -DARROW_PARQUET=ON -DARROW_FILESYSTEM=ON -DARROW_PROTOBUF_USE_SHARED=OFF -DARROW_DEPENDENCY_USE_SHARED=OFF -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_WITH_THRIFT=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DARROW_JEMALLOC=OFF -DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE -DARROW_WITH_UTF8PROC=OFF -DARROW_TESTING=ON -DCMAKE_INSTALL_PREFIX=/usr/local -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON

......

-- ---------------------------------------------------------------------
-- Arrow version:                                 15.0.0
--
-- Build configuration summary:
--   Generator: Ninja
--   Build type: RELEASE
--   Source directory: /workspace/incubator-gluten/ep/_ep/arrow_ep/cpp
--   Install prefix: /usr/local
--
-- Compile and link options:
--
--   ARROW_CXXFLAGS="" [default=""]
--       Compiler flags to append when compiling Arrow
--   ARROW_BUILD_STATIC=ON [default=ON

......

--   ARROW_ACERO=OFF [default=OFF]
--       Build the Arrow Acero Engine Module
--   ARROW_AZURE=OFF [default=OFF]
--       Build Arrow with Azure support (requires the Azure SDK for C++)
--   ARROW_BUILD_UTILITIES=OFF [default=OFF]
--       Build Arrow commandline utilities
--   ARROW_COMPUTE=OFF [default=OFF]
--       Build all Arrow Compute kernels
--   ARROW_CSV=OFF [default=OFF]
--       Build the Arrow CSV Parser Module
--   ARROW_CUDA=OFF [default=OFF]
--       Build the Arrow CUDA extensions (requires CUDA toolkit)
--   ARROW_DATASET=OFF [default=OFF]
--       Build the Arrow Dataset Modules
--   ARROW_FILESYSTEM=ON [default=OFF]
--       Build the Arrow Filesystem Layer
--   ARROW_FLIGHT=OFF [default=OFF]
--       Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers)
--   ARROW_FLIGHT_SQL=OFF [default=OFF]
--       Build the Arrow Flight SQL extension
--   ARROW_GANDIVA=OFF [default=OFF]
--       Build the Gandiva libraries
--   ARROW_GCS=OFF [default=OFF]
--       Build Arrow with GCS support (requires the GCloud SDK for C++)
--   ARROW_HDFS=OFF [default=OFF]
--       Build the Arrow HDFS bridge
--   ARROW_IPC=ON [default=ON]
--       Build the Arrow IPC extensions
--   ARROW_JEMALLOC=OFF [default=ON]
--       Build the Arrow jemalloc-based allocator
--   ARROW_JSON=ON [default=OFF]
--       Build Arrow with JSON support (requires RapidJSON)
--   ARROW_MIMALLOC=OFF [default=OFF]
--       Build the Arrow mimalloc-based allocator
--   ARROW_PARQUET=ON [default=OFF]
--       Build the Parquet libraries
--   ARROW_ORC=OFF [default=OFF]
--       Build the Arrow ORC adapter
--   ARROW_PYTHON=OFF [default=OFF]
--       Build some components needed by PyArrow.
--       (This is a deprecated option. Use CMake presets instead.)
--   ARROW_S3=ON [default=OFF]
--       Build Arrow with S3 support (requires the AWS SDK for C++)
--   ARROW_SKYHOOK=OFF [default=OFF]
--       Build the Skyhook libraries
--   ARROW_SUBSTRAIT=OFF [default=OFF]
--       Build the Arrow Substrait Consumer Module
--   ARROW_TENSORFLOW=OFF [default=OFF]
--       Build Arrow with TensorFlow support enabled
--   ARROW_TESTING=ON [default=OFF]
--       Build the Arrow testing libraries

......

-- ---------------------------------------------------------------------
-- Arrow version:                                 15.0.0
--
-- Build configuration summary:
--   Generator: Unix Makefiles
--   Build type: RELEASE
--   Source directory: /workspace/incubator-gluten/ep/_ep/arrow_ep/cpp
--   Install prefix: /workspace/incubator-gluten/ep/_ep/arrow_ep/java-dist
--
-- Compile and link options:
--
--   ARROW_CXXFLAGS="" [default=""]
--       Compiler flags to append when compiling Arrow
--   ARROW_BUILD_STATIC=ON [default=ON]
--       Build static libraries

......

-- Project component options:
--
--   ARROW_ACERO=ON [default=OFF]
--       Build the Arrow Acero Engine Module
--   ARROW_AZURE=OFF [default=OFF]
--       Build Arrow with Azure support (requires the Azure SDK for C++)
--   ARROW_BUILD_UTILITIES=OFF [default=OFF]
--       Build Arrow commandline utilities
--   ARROW_COMPUTE=ON [default=OFF]
--       Build all Arrow Compute kernels
--   ARROW_CSV=ON [default=OFF]
--       Build the Arrow CSV Parser Module
--   ARROW_CUDA=OFF [default=OFF]
--       Build the Arrow CUDA extensions (requires CUDA toolkit)
--   ARROW_DATASET=ON [default=OFF]
--       Build the Arrow Dataset Modules
--   ARROW_FILESYSTEM=ON [default=OFF]
--       Build the Arrow Filesystem Layer
--   ARROW_FLIGHT=OFF [default=OFF]
--       Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers)
--   ARROW_FLIGHT_SQL=OFF [default=OFF]
--       Build the Arrow Flight SQL extension
--   ARROW_GANDIVA=OFF [default=OFF]
--       Build the Gandiva libraries
--   ARROW_GCS=OFF [default=OFF]
--       Build Arrow with GCS support (requires the GCloud SDK for C++)
--   ARROW_HDFS=ON [default=OFF]
--       Build the Arrow HDFS bridge
--   ARROW_IPC=ON [default=ON]
--       Build the Arrow IPC extensions
--   ARROW_JEMALLOC=ON [default=ON]
--       Build the Arrow jemalloc-based allocator
--   ARROW_JSON=ON [default=OFF]
--       Build Arrow with JSON support (requires RapidJSON)
--   ARROW_MIMALLOC=OFF [default=OFF]
--       Build the Arrow mimalloc-based allocator
--   ARROW_PARQUET=ON [default=OFF]
--       Build the Parquet libraries
--   ARROW_ORC=OFF [default=OFF]
--       Build the Arrow ORC adapter
--   ARROW_PYTHON=OFF [default=OFF]
--       Build some components needed by PyArrow.
--       (This is a deprecated option. Use CMake presets instead.)
--   ARROW_S3=ON [default=OFF]
--       Build Arrow with S3 support (requires the AWS SDK for C++)
--   ARROW_SKYHOOK=OFF [default=OFF]
--       Build the Skyhook libraries
--   ARROW_SUBSTRAIT=ON [default=OFF]
--       Build the Arrow Substrait Consumer Module
--   ARROW_TENSORFLOW=OFF [default=OFF]
--       Build Arrow with TensorFlow support enabled
--   ARROW_TESTING=OFF [default=OFF]
--       Build the Arrow testing libraries
......

```

When i login the spark executor, check by `nm` command:

```
$ nm -D /tmp/jnilib-17305424060615380389.tmp |grep _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE
                 U _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE

```

After extract the gluten jar, i got these libs:

```
$ find ./ -name *.so
./linux/amd64/libvelox.so
./linux/amd64/libgluten.so
./x86_64/libarrow_cdata_jni.so
./x86_64/libarrow_dataset_jni.so
$ nm -D x86_64/libarrow_dataset_jni.so |grep _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE
                 U _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE
```

It looks like the aws-cpp-sdk-s3 library is not statically linked in? Or do i need to install the related libs of aws-sdk in my Dockfile manually?

How can i work round?

### Component(s)

C++

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.lang.UnsatisfiedLinkError when reading CSV from S3 by arrow's csv reader #46185

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

java.lang.UnsatisfiedLinkError when reading CSV from S3 by arrow's csv reader #46185

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions