Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support jvm version libhdfs in velox #9835

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

JkSelf
Copy link
Collaborator

@JkSelf JkSelf commented May 16, 2024

Currently, Gluten will throw hdfs connection failures when executing queries if their HDFS system employs Kerberos authentication and Viewfs support. This is due to the fact that the existing libhdfs3 API does not support Kerberos authentication, whereas the JVM version of libhdfs is capable of invoking APIs that support Kerberos authentication. If the user's system has the HADOOP_HOME environment variable set, the JVM version of libhdfs will be used during the compilation of Gluten; if not set, the default libhdfs3 will be used instead.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2024
Copy link

netlify bot commented May 16, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit db2f658
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67122fb37d074f00080ebc96

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 16, 2024

@mbasmanova @majetideepak Can you help to review? Thanks.

@mbasmanova
Copy link
Contributor

@JkSelf CI is red. Please, take a look.

CMakeLists.txt Outdated
find_library(
LIBJVM
NAMES libjvm.so
HINTS "${JAVA_HOME}/jre/lib/amd64/server/" REQUIRED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the java path is same for all the Velox supported platform? Mac?

arrow)
target_include_directories(velox_hdfs PRIVATE ${HADOOP_HOME}/include/)
if(${VELOX_BUILD_TESTING})
add_subdirectory(tests)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we may add_subdirectory(tests) twice when both VELOX_ENABLE_HDFS and VELOX_ENABLE_HDFS3 is ON

@@ -31,12 +35,17 @@ class HdfsFileSystem::Impl {
hdfsBuilderSetNameNode(builder, endpoint.host.c_str());
hdfsBuilderSetNameNodePort(builder, atoi(endpoint.port.data()));
hdfsClient_ = hdfsBuilderConnect(builder);
#ifdef VELOX_ENABLE_HDFS3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which definition has high priority? Do we need to check this tow definitions conflict?

#elif VELOX_ENABLE_HDFS

auto errMsg =
fmt::format("Unable to get file path info for file: {}", filePath_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use this format VELOX_FAIL("Write past Buffer capacity() {}", capacity_);

@@ -33,19 +37,28 @@ struct HdfsFile {
void open(hdfsFS client, const std::string& path) {
client_ = client;
handle_ = hdfsOpenFile(client, path.data(), O_RDONLY, 0, 0, 0);
#ifdef VELOX_ENABLE_HDFS3
VELOX_CHECK_NOT_NULL(
handle_,
"Unable to open file {}. got error: {}",
path,
hdfsGetLastError());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just check `` in hdfsGetLastError, return "" or "Can not get error message when VELOX_ENABLE_HDFS3" when #ifdef VELOX_ENABLE_HDFS3, then we don't need to modify so much code

@@ -19,8 +19,18 @@ add_library(velox_hdfs RegisterHdfsFileSystem.cpp)
if(VELOX_ENABLE_HDFS)
target_sources(velox_hdfs PRIVATE HdfsFileSystem.cpp HdfsReadFile.cpp
HdfsWriteFile.cpp)
target_link_libraries(velox_hdfs Folly::folly ${LIBHDFS3} xsimd)
target_link_libraries(velox_hdfs Folly::folly ${LIBHDFS} ${LIBJVM} xsimd
arrow)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need to link arrow?

@assignUser
Copy link
Collaborator

The C++ libhdfs3 is unmaintained (no commits in 3+ years, hawq is moving into the attic (asf 'archive')) so I think this should replace it entirely! @pedroerp @kgpai we talked about this before, I think this is what arrow does as well but I haven't had a look yet.

@kgpai
Copy link
Contributor

kgpai commented May 22, 2024

cc: @majetideepak

@majetideepak
Copy link
Collaborator

majetideepak commented May 23, 2024

It seems Arrow's HDFS filesystem depends on some hdfs dynamic library to be present in the system and dynamically links against it. I see this PR does something similar as well.
https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/hdfs_internal.cc#L114

get_potential_libhdfs_paths()
get_potential_libjvm_paths()

@majetideepak
Copy link
Collaborator

We do dynamically link against LIBHDFS3(hawk) as well. It should not hurt to support it?
We should avoid the multiple ifdefs in this PR.

@assignUser
Copy link
Collaborator

It should not hurt to support it?

Well it is unmaintained so it should'nt be the default and we should probably mark it deprecated and remove it in the future.

@majetideepak
Copy link
Collaborator

Well it is unmaintained so it should'nt be the default and we should probably mark it deprecated and remove it in the future.

Switching the default makes sense since hawq is not maintained. But I wonder if the hawq implementation provides additional benefits (say performance) compared to the Java implementation.

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 27, 2024

It seems Arrow's HDFS filesystem depends on some hdfs dynamic library to be present in the system and dynamically links against it. I see this PR does something similar as well. https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/hdfs_internal.cc#L114

get_potential_libhdfs_paths()
get_potential_libjvm_paths()

@majetideepak Yes, that makes sense to me. If we proceed this way, we can avoid the need to ensure that the user's system has HDFS and JVM installed during the build phase.

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 27, 2024

Well it is unmaintained so it should'nt be the default and we should probably mark it deprecated and remove it in the future.

Switching the default makes sense since hawq is not maintained. But I wonder if the hawq implementation provides additional benefits (say performance) compared to the Java implementation.

  • In terms of performance, hawq implementation is generally considered to be superior to jvm libhdfs because it is designed for enhanced performance and reduces the additional overhead caused by JNI. However, functionality is equally important to consider. Currently, hawq implementation does not meet user needs in certain functional aspects, such as lacking support for Kerberos authentication and ViewFs.

  • If there is a preference to retain hawq implementation for its performance, we could offer two compilation options, VELOX_ENABLE_HDFS and VELOX_ENABLE_HDFS3, to support jvm libhdfs and hawq implementation respectively. However, maintaining both options might necessitate the use of multiple ifdef statements to determine which hdfs class to load.
    @majetideepak @assignUser What do you think? Thanks.

@assignUser
Copy link
Collaborator

In terms of performance, hawq implementation is generally considered to be superior t

While better performance is of course great, in this case it comes with the considerable drawback that the library in question has been unmaintained for several years and the parent project is currently in the process of being 'archived' as well. My vote would be to remove it entirely but I am not a user so I'll let others decide :)

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 29, 2024

@assignUser Agree to remove hawa implementation entirely. @majetideepak What do you think?

@majetideepak
Copy link
Collaborator

Agree to remove hawa implementation entirely.

I am not a user as well :). Let's open a discussion on this and get input from the wider community.

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 29, 2024

@assignUser @majetideepak Thank you for your response. I have initiated a poll in the community at #9969. Your participation and vote would be greatly appreciated.

CMakeLists.txt Outdated
find_library(
LIBHDFS3
NAMES libhdfs3.so libhdfs3.dylib
HINTS "${CMAKE_SOURCE_DIR}/hawq/depends/libhdfs3/_build/src/" REQUIRED)
add_definitions(-DVELOX_ENABLE_HDFS3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use add_definitions(-DVELOX_ENABLE_HDFS)?

@FelixYBW
Copy link
Contributor

@JkSelf can you resolve the conflict?

@wForget
Copy link

wForget commented Sep 2, 2024

@JkSelf Thank you for your work. Is there any progress? It would be a really nice feature for me to replace with libhdfs so that we can support more file systems based on Hadoop FileSystem.

@JkSelf JkSelf force-pushed the libhdfs branch 4 times, most recently from e7e3f29 to a77234d Compare September 25, 2024 07:09
@JkSelf
Copy link
Collaborator Author

JkSelf commented Sep 25, 2024

@majetideepak I resolved all your comments. Can you help to review again? Thanks.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf some comments.

@JkSelf
Copy link
Collaborator Author

JkSelf commented Oct 10, 2024

@majetideepak Thanks for your review. I have resolved all your comments. Can you help to review again. Thanks.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf two more comments. Thanks for the changes. The hdfs headers are cleaner now.

@JkSelf
Copy link
Collaborator Author

JkSelf commented Oct 11, 2024

@majetideepak I have resolved all your comments. Can you help to review again? Thanks.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf one last comment.

// This prevents a scenario where the first pread call in the
// checkReadErrorMessages method might succeed, while a subsequent call fails,
// leading to a mismatch in readFailErrorMessage.
sleep(10);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf Why do we need to add this sleep now which was previously not present?
Instead of sleep, can we query if the miniCluster is down?
sleep is usually a recipe for non-deterministic failures and wasted CI time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majetideepak Yes. Updated. Can you help to review again? Thanks.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf Some comments and Linux adapters CI is failing. Can you check? Thanks.

@JkSelf JkSelf force-pushed the libhdfs branch 3 times, most recently from efbe37b to 3f87095 Compare October 15, 2024 05:31
@JkSelf
Copy link
Collaborator Author

JkSelf commented Oct 15, 2024

@majetideepak Can you help to review again? Thanks.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JkSelf! Appreciate for addressing all the feedback.

@majetideepak
Copy link
Collaborator

I will add the merge label once CI is passing.

Comment on lines +31 to +32
CMAKE_ARGS -DCMAKE_USE_OPENSSL=ON -DOPENSSL_ROOT_DIR=/usr/local/opt/openssl
-DOPENSSL_LIBRARIES=/usr/local/opt/openssl/lib)
Copy link
Collaborator

@assignUser assignUser Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No hard coding of paths please. Does curl cmake not find openssl on it's own?

Comment on lines +42 to +49
velox_hdfs_insert_test
velox_file
velox_hdfs
velox_core
velox_exec_test_lib
velox_hive_connector
velox_dwio_common_exception
velox_exec
Copy link
Collaborator

@assignUser assignUser Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I only see folly, hdfs, gtest and exec being used. Please only link used libs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants