Add support of HDFS as remote object store #1223

yahoNanJing · 2021-11-03T02:06:52Z

Which issue does this PR close?

Closes #1060. It's a refactor version of PR #1062.

Rationale for this change

Currently, we can only read parquet files from local file system. It would be nice to add support to read parquet files that reside on HDFS.

What changes are included in this PR?

Introduced hdfs as one of the object store. With the feature of hdfs, it will depend on the fs-hdfs crate which provides the Rust client to hdfs. Then we can run some sql-based unit test with data on hdfs.

Are there any user-facing changes?

One example for querying on the hdfs data can be seen in the method of hdfs.run_with_register_alltypes_parquet in tests/sql.rs. Firstly create a HadoopFileSystem. Then register it with hdfs scheme.

alamb · 2021-11-04T18:38:56Z

Thank you very much @yahoNanJing .

I personally would like to see such integration / connector code live in other repos (rather than a feature flag on datafusion). My primary rationale is to keep DataFusion and Ballista development easier.

DataFusion and Ballista already have fairly substantial overhead to working on this codebase, so adding another integration will make testing / coding changes in this repo take longer.

For example, I don't know how to run / test hdfs locally. I tried on this branch:

-*- mode: compilation; default-directory: "~/Software/arrow-datafusion/" -*-
cargo test --features=hdfs  -p datafusion --test sql
   Compiling fs-hdfs v0.1.3
   Compiling parquet v6.0.0
error: failed to run custom build command for `fs-hdfs v0.1.3`

Caused by:
  process didn't exit successfully: `/Volumes/RAMDisk/df-target/debug/build/fs-hdfs-e451fc9b245fa518/build-script-build` (exit status: 101)
  --- stderr
  thread 'main' panicked at 'JAVA_HOME shell environment must be set: environment variable not found', /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/fs-hdfs-0.1.3/build.rs:132:13
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...

I am sure I could go figure out how to setup the HDFS dependency, but I don't have the time right now.

Putting the HDFS code in this repo will help ensure we keep DataFusion's API resonably stable as you mentioned on #1062 (comment) and effectivly make the maintenance of the hdfs connector easier.

However, in total, I selfishly would like to optimize for the maintenance of DataFusion as it currently exists rather than the hdfs connector.

Thoughts @houqp @Dandandan @jimexist @rdettai ?

BTW if there is consensus for putting this code into the arrow-datafusion repo, prior I would like to see:

CI coverage (aka testing that this works as part of the github checks -- I may have missed it but I don't think this is covered now)
A developer's guide explaining how to configure / run the tests

houqp · 2021-11-05T03:21:37Z

I also prefer us maintaining these 3rd party dependency heavy extensions in separate repos. At the very least it should be managed as a separate crate. As @alamb already mentioned, the datafusion python binding and ballista crates we have right now are already slowing us down. We can link to these extensions in our readme and docs for better discovery so it's still easy for users to add. From the end user's point of view, there is not that much of a difference between adding a new feature flag v.s. adding a new dependency in Cargo.toml.

yahoNanJing · 2021-11-06T03:35:57Z

Hi @alamb and @houqp, I agree that the core of DataFusion should be as simplified as possible. However, if we position it to be used in the production environment with some large scale cluster. It's better to support some remote shared storage as part of the architecture of storage and compute separation.

If we don't include the connectors into DataFusion, it's still better to provide an official repository for them. Otherwise, it might not be good for the community unity.

By the way, for the interface of register_store in ObjectStoreRegistry, there may be several ObjectStore with different endpoints for the same scheme. For example, hdfs://localhost:60000 and hdfs://localhost:60001

yjshen · 2021-11-06T08:09:18Z

I also prefer to have source/sink connectors such as S3 and HDFS in separate repo or even in separate org for fast iterations. Adding links for these connectors in DataFusion's README is sufficient for those interested in these connectors.

Maybe it's time for us to create the org @houqp mentioned #907 (comment), or @andygrove mentioned arrow-experiment-* #907 (comment) ?

yjshen · 2021-11-06T08:16:42Z

FYI. Apache Pulsar has a separate repo for connectors https://github.com/apache/pulsar-connectors and a hub hosting all ecosystem repos https://hub.streamnative.io/.

alamb · 2021-11-06T10:36:27Z

I also prefer to have source/sink connectors such as S3 and HDFS in separate repo or even in separate org for fast iterations.

I think the separate org for fast iterations is an excellent point. Shall we make one?

houqp · 2021-11-07T01:37:36Z

@yahoNanJing I agree with you that pooling these extensions into a single namespace would be better than splitting into personal namespaces. I started a discussion thread in the dev mailing list to see if it's something we could do: https://mail-archives.apache.org/mod_mbox/arrow-dev/202111.mbox/browser.

houqp · 2021-11-14T22:05:58Z

@yahoNanJing I have invited you to https://github.com/datafusion-contrib, let me know if you have access to create new repos there. If not, I can help create one for you.

jimexist · 2021-11-17T06:34:07Z

I see that the support for hdfs is moved to the contrib org now that this pull request can be closed?

houqp · 2021-11-17T06:53:45Z

@yahoNanJing 's version is based on the jvm hdfs implementation I believe, so it's different from https://github.com/datafusion-contrib/datafusion-hdfs-native. I think there is still value in porting this PR into a self-contained extension repo into datafusion-contrib if @yahoNanJing is still interested in maintaining this code base as an alternative implementation.

… array_funcs expressions to folders based on spark grouping (apache#1223)

Add support of HDFS as remote object store

2ecdb16

github-actions bot added the datafusion label Nov 3, 2021

yahoNanJing mentioned this pull request Nov 3, 2021

Add support of HDFS as remote object store #1062

Closed

houqp closed this Nov 17, 2021

andygrove pushed a commit to andygrove/datafusion that referenced this pull request Jan 31, 2025

chore: extract conversion_funcs, conditional_funcs, bitwise_funcs and…

2cd57b5

… array_funcs expressions to folders based on spark grouping (apache#1223)

unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025

chore: extract conversion_funcs, conditional_funcs, bitwise_funcs and…

c4ef5ed

… array_funcs expressions to folders based on spark grouping (apache#1223)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add support of HDFS as remote object store #1223

Add support of HDFS as remote object store #1223

Uh oh!

yahoNanJing commented Nov 3, 2021

Uh oh!

alamb commented Nov 4, 2021

Uh oh!

houqp commented Nov 5, 2021

Uh oh!

yahoNanJing commented Nov 6, 2021

Uh oh!

yjshen commented Nov 6, 2021

Uh oh!

yjshen commented Nov 6, 2021 •

edited

Loading

Uh oh!

alamb commented Nov 6, 2021

Uh oh!

houqp commented Nov 7, 2021

Uh oh!

houqp commented Nov 14, 2021

Uh oh!

jimexist commented Nov 17, 2021

Uh oh!

houqp commented Nov 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Add support of HDFS as remote object store #1223

Add support of HDFS as remote object store #1223

Uh oh!

Conversation

yahoNanJing commented Nov 3, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

alamb commented Nov 4, 2021

Uh oh!

houqp commented Nov 5, 2021

Uh oh!

yahoNanJing commented Nov 6, 2021

Uh oh!

yjshen commented Nov 6, 2021

Uh oh!

yjshen commented Nov 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Nov 6, 2021

Uh oh!

houqp commented Nov 7, 2021

Uh oh!

houqp commented Nov 14, 2021

Uh oh!

jimexist commented Nov 17, 2021

Uh oh!

houqp commented Nov 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yjshen commented Nov 6, 2021 •

edited

Loading