-
Couldn't load subscription status.
- Fork 1.7k
Add support of HDFS as remote object store #1223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you very much @yahoNanJing . I personally would like to see such integration / connector code live in other repos (rather than a feature flag on datafusion). My primary rationale is to keep DataFusion and Ballista development easier. DataFusion and Ballista already have fairly substantial overhead to working on this codebase, so adding another integration will make testing / coding changes in this repo take longer. For example, I don't know how to run / test hdfs locally. I tried on this branch: I am sure I could go figure out how to setup the HDFS dependency, but I don't have the time right now. Putting the HDFS code in this repo will help ensure we keep DataFusion's API resonably stable as you mentioned on #1062 (comment) and effectivly make the maintenance of the hdfs connector easier. However, in total, I selfishly would like to optimize for the maintenance of DataFusion as it currently exists rather than the hdfs connector. Thoughts @houqp @Dandandan @jimexist @rdettai ? BTW if there is consensus for putting this code into the arrow-datafusion repo, prior I would like to see:
|
|
I also prefer us maintaining these 3rd party dependency heavy extensions in separate repos. At the very least it should be managed as a separate crate. As @alamb already mentioned, the datafusion python binding and ballista crates we have right now are already slowing us down. We can link to these extensions in our readme and docs for better discovery so it's still easy for users to add. From the end user's point of view, there is not that much of a difference between adding a new feature flag v.s. adding a new dependency in Cargo.toml. |
|
Hi @alamb and @houqp, I agree that the core of DataFusion should be as simplified as possible. However, if we position it to be used in the production environment with some large scale cluster. It's better to support some remote shared storage as part of the architecture of storage and compute separation. If we don't include the connectors into DataFusion, it's still better to provide an official repository for them. Otherwise, it might not be good for the community unity. By the way, for the interface of register_store in ObjectStoreRegistry, there may be several ObjectStore with different endpoints for the same scheme. For example, hdfs://localhost:60000 and hdfs://localhost:60001 |
|
I also prefer to have source/sink connectors such as S3 and HDFS in separate repo or even in separate org for fast iterations. Adding links for these connectors in DataFusion's README is sufficient for those interested in these connectors. Maybe it's time for us to create the org @houqp mentioned #907 (comment), or @andygrove mentioned arrow-experiment-* #907 (comment) ? |
|
FYI. Apache Pulsar has a separate repo for connectors https://github.com/apache/pulsar-connectors and a hub hosting all ecosystem repos https://hub.streamnative.io/. |
I think the separate org for fast iterations is an excellent point. Shall we make one? |
|
@yahoNanJing I agree with you that pooling these extensions into a single namespace would be better than splitting into personal namespaces. I started a discussion thread in the dev mailing list to see if it's something we could do: https://mail-archives.apache.org/mod_mbox/arrow-dev/202111.mbox/browser. |
|
@yahoNanJing I have invited you to https://github.com/datafusion-contrib, let me know if you have access to create new repos there. If not, I can help create one for you. |
|
I see that the support for hdfs is moved to the contrib org now that this pull request can be closed? |
|
@yahoNanJing 's version is based on the jvm hdfs implementation I believe, so it's different from https://github.com/datafusion-contrib/datafusion-hdfs-native. I think there is still value in porting this PR into a self-contained extension repo into datafusion-contrib if @yahoNanJing is still interested in maintaining this code base as an alternative implementation. |
… array_funcs expressions to folders based on spark grouping (apache#1223)
… array_funcs expressions to folders based on spark grouping (apache#1223)
Which issue does this PR close?
Closes #1060. It's a refactor version of PR #1062.
Rationale for this change
Currently, we can only read parquet files from local file system. It would be nice to add support to read parquet files that reside on HDFS.
What changes are included in this PR?
Introduced hdfs as one of the object store. With the feature of hdfs, it will depend on the fs-hdfs crate which provides the Rust client to hdfs. Then we can run some sql-based unit test with data on hdfs.
Are there any user-facing changes?
One example for querying on the hdfs data can be seen in the method of hdfs.run_with_register_alltypes_parquet in tests/sql.rs. Firstly create a HadoopFileSystem. Then register it with hdfs scheme.