-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Subtask] Add Python Arrow FileSystem implementation for fileset. #2059
Comments
@jerryshao I just created this subtask. So we can talk more about it here. I'm thinking if we should support a python client. In the python Arrow Filesystem, we have to load the fileset and get the physical storage location. |
Yeah, I think so. We should have a python client beforehand. |
@jerryshao I found the client module dependent on I think Ray engine(#1355) also needs a Python library. Are we already working on this? |
We planned to do the python library, but we haven't yet kicked off this. From my wild thinking, because we use REST protocol to communicate with server, so we don't have to use py4j to bridge Java code. One possible way is to also write a Python counterpart. Also, I was thinking that since we are using REST protocol, maybe we can build a model and using some code generation tools to generate sdks for different languages. |
@jerryshao @xunliu @shaofengshi Hi, I'm coming to talk about how to implement Python gvfs. After talking with @xloya offline, I think we have two ways.
By the way, I saw the issue about the tensorflow-connector and pytorch-connector. Will we provide some advanced APIs for greater ease of use?Can you share your thoughts? Thanks. |
Some test codes for solution 2, it has been preliminarily verified and can be used:
|
Thanks @coolderli @xloya for your input, for how to support Arrow/Ray, I may need more investigation to see how we can support this, just give me some time, appreciated. |
Hi @jerryshao @coolderli @xloya , I would love to help this, is there anything I can help? I have already tried Solution 2 above and it does work. |
I feel like solution 2 is a python wrapper for Hadoop client, while solution 1 is a pure python solution, right? I feel like solution 1 could be more generic, what do you think? |
According to my understanding and research, solution 2 reuses the capabilities of |
@xloya Yes. In solution 1, the object storage such s3 will use the |
I also think the Solution 1 is better, the Hadoop configuration is redundant for the case if we only need other storage like S3, and it depends on the Hadoop Native Libraries (a.k.a. |
@noidname01 Hi, we had a brief discussion on this issue with @jerryshao yesterday, and we agreed with solution 1. We can directly connect to storage SDKs such as S3 / OSS through this solution, but I think HDFS also needs support. So I will investigate the feasibility of solution 1 firstly. If you are interested, please participate in the subsequent design and development. Thanks! |
@xloya Sounds great, I'm in👍 |
@jerryshao @noidname01 @coolderli Hi, I have opened a draft PR(#3528, the code is not complete yet) for this. I will implement GVFS based on fsspec interfaces, some popular cloud storages or companies also choose this solution. And I will only support HDFS firstly, and improve it to support more storages and auth types in the next sub-tasks(@noidname01 You could participate in these sub-tasks). Do you have any additional feedback on this? |
@xloya Can you please explain more about why chose |
Like Ray support PyArrow filesystem, I'm not sure it supports fsspec as well🤔 |
I consider some resons from the following aspects:
|
I think it is theoretically possible, because judging from the descriptions in the fsspec and PyArrow documents, the file systems of the two are compatible with each other. In addition, fsspec is naturally supported by pandas, so I think ray can analyze and process data through pandas. |
Tesorflow and Pytorch Lighting natively support fsspec-compatible file systems: |
@xloya do you have some documents about fsspec, maybe we should discuss more about the pros and cons of fsspec, pyarrow filesystem and others. |
Yeah, I will post a doc tomorrow. |
@jerryshao @noidname01 @coolderli Hi, I have open a document for implementation selections. Please take a look and comment if you have any questions. |
…em in Python (apache#3528) ### What changes were proposed in this pull request? Support Gravitino Virtual File System in Python so that we can read and write Fileset storage data. The first PR only supports HDFS. After research, the following popular cloud storages or companies have implemented their own FileSystem based on fsspec(https://filesystem-spec.readthedocs.io/en/latest/index.html): 1. S3(https://github.com/fsspec/s3fs) 2. Azure(https://github.com/fsspec/adlfs) 3. Gcs(https://github.com/fsspec/gcsfs) 4. OSS(https://github.com/fsspec/ossfs) 5. Databricks(https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/dbfs.py) 6. Snowflake(https://github.com/snowflakedb/snowflake-ml-python), So this PR will implement GVFS based on the fsspec interface. ### Why are the changes needed? Fix: apache#2059 ### How was this patch tested? Add some UTs and ITs. --------- Co-authored-by: xiaojiebao <xiaojiebao@xiaomi.com>
Describe the subtask
Add Python Arrow FileSystem implementation for fileset.
Parent issue
#1241
The text was updated successfully, but these errors were encountered: