Skip to content

Conversation

@yahoNanJing
Copy link
Contributor

Which issue does this PR close?

Closes #1060. It's a refactor version of PR #1062.

Rationale for this change

Currently, we can only read parquet files from local file system. It would be nice to add support to read parquet files that reside on HDFS.

What changes are included in this PR?

Introduced hdfs as one of the object store. With the feature of hdfs, it will depend on the fs-hdfs crate which provides the Rust client to hdfs. Then we can run some sql-based unit test with data on hdfs.

Are there any user-facing changes?

One example for querying on the hdfs data can be seen in the method of hdfs.run_with_register_alltypes_parquet in tests/sql.rs. Firstly create a HadoopFileSystem. Then register it with hdfs scheme.

@alamb
Copy link
Contributor

alamb commented Nov 4, 2021

Thank you very much @yahoNanJing .

I personally would like to see such integration / connector code live in other repos (rather than a feature flag on datafusion). My primary rationale is to keep DataFusion and Ballista development easier.

DataFusion and Ballista already have fairly substantial overhead to working on this codebase, so adding another integration will make testing / coding changes in this repo take longer.

For example, I don't know how to run / test hdfs locally. I tried on this branch:

-*- mode: compilation; default-directory: "~/Software/arrow-datafusion/" -*-
cargo test --features=hdfs  -p datafusion --test sql
   Compiling fs-hdfs v0.1.3
   Compiling parquet v6.0.0
error: failed to run custom build command for `fs-hdfs v0.1.3`

Caused by:
  process didn't exit successfully: `/Volumes/RAMDisk/df-target/debug/build/fs-hdfs-e451fc9b245fa518/build-script-build` (exit status: 101)
  --- stderr
  thread 'main' panicked at 'JAVA_HOME shell environment must be set: environment variable not found', /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/fs-hdfs-0.1.3/build.rs:132:13
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...

I am sure I could go figure out how to setup the HDFS dependency, but I don't have the time right now.

Putting the HDFS code in this repo will help ensure we keep DataFusion's API resonably stable as you mentioned on #1062 (comment) and effectivly make the maintenance of the hdfs connector easier.

However, in total, I selfishly would like to optimize for the maintenance of DataFusion as it currently exists rather than the hdfs connector.

Thoughts @houqp @Dandandan @jimexist @rdettai ?

BTW if there is consensus for putting this code into the arrow-datafusion repo, prior I would like to see:

  1. CI coverage (aka testing that this works as part of the github checks -- I may have missed it but I don't think this is covered now)
  2. A developer's guide explaining how to configure / run the tests

@houqp
Copy link
Member

houqp commented Nov 5, 2021

I also prefer us maintaining these 3rd party dependency heavy extensions in separate repos. At the very least it should be managed as a separate crate. As @alamb already mentioned, the datafusion python binding and ballista crates we have right now are already slowing us down. We can link to these extensions in our readme and docs for better discovery so it's still easy for users to add. From the end user's point of view, there is not that much of a difference between adding a new feature flag v.s. adding a new dependency in Cargo.toml.

@yahoNanJing
Copy link
Contributor Author

Hi @alamb and @houqp, I agree that the core of DataFusion should be as simplified as possible. However, if we position it to be used in the production environment with some large scale cluster. It's better to support some remote shared storage as part of the architecture of storage and compute separation.

If we don't include the connectors into DataFusion, it's still better to provide an official repository for them. Otherwise, it might not be good for the community unity.

By the way, for the interface of register_store in ObjectStoreRegistry, there may be several ObjectStore with different endpoints for the same scheme. For example, hdfs://localhost:60000 and hdfs://localhost:60001

@yjshen
Copy link
Member

yjshen commented Nov 6, 2021

I also prefer to have source/sink connectors such as S3 and HDFS in separate repo or even in separate org for fast iterations. Adding links for these connectors in DataFusion's README is sufficient for those interested in these connectors.

Maybe it's time for us to create the org @houqp mentioned #907 (comment), or @andygrove mentioned arrow-experiment-* #907 (comment) ?

@yjshen
Copy link
Member

yjshen commented Nov 6, 2021

FYI. Apache Pulsar has a separate repo for connectors https://github.com/apache/pulsar-connectors and a hub hosting all ecosystem repos https://hub.streamnative.io/.

@alamb
Copy link
Contributor

alamb commented Nov 6, 2021

I also prefer to have source/sink connectors such as S3 and HDFS in separate repo or even in separate org for fast iterations.

I think the separate org for fast iterations is an excellent point. Shall we make one?

@houqp
Copy link
Member

houqp commented Nov 7, 2021

@yahoNanJing I agree with you that pooling these extensions into a single namespace would be better than splitting into personal namespaces. I started a discussion thread in the dev mailing list to see if it's something we could do: https://mail-archives.apache.org/mod_mbox/arrow-dev/202111.mbox/browser.

@houqp
Copy link
Member

houqp commented Nov 14, 2021

@yahoNanJing I have invited you to https://github.com/datafusion-contrib, let me know if you have access to create new repos there. If not, I can help create one for you.

@jimexist
Copy link
Member

I see that the support for hdfs is moved to the contrib org now that this pull request can be closed?

@houqp
Copy link
Member

houqp commented Nov 17, 2021

@yahoNanJing 's version is based on the jvm hdfs implementation I believe, so it's different from https://github.com/datafusion-contrib/datafusion-hdfs-native. I think there is still value in porting this PR into a self-contained extension repo into datafusion-contrib if @yahoNanJing is still interested in maintaining this code base as an alternative implementation.

@houqp houqp closed this Nov 17, 2021
andygrove pushed a commit to andygrove/datafusion that referenced this pull request Jan 31, 2025
… array_funcs expressions to folders based on spark grouping (apache#1223)
unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
… array_funcs expressions to folders based on spark grouping (apache#1223)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support of HDFS as remote object store

5 participants