Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#2059] feat(client-python): Support Gravitino Virtual FileSystem in Python #3528

Merged
merged 21 commits into from
Jun 18, 2024

Conversation

xloya
Copy link
Contributor

@xloya xloya commented May 23, 2024

What changes were proposed in this pull request?

Support Gravitino Virtual File System in Python so that we can read and write Fileset storage data. The first PR only supports HDFS.

After research, the following popular cloud storages or companies have implemented their own FileSystem based on fsspec(https://filesystem-spec.readthedocs.io/en/latest/index.html):

  1. S3(https://github.com/fsspec/s3fs)
  2. Azure(https://github.com/fsspec/adlfs)
  3. Gcs(https://github.com/fsspec/gcsfs)
  4. OSS(https://github.com/fsspec/ossfs)
  5. Databricks(https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/dbfs.py)
  6. Snowflake(https://github.com/snowflakedb/snowflake-ml-python),

So this PR will implement GVFS based on the fsspec interface.

Why are the changes needed?

Fix: #2059

How was this patch tested?

Add some UTs and ITs.

@xloya xloya force-pushed the support-python-virtual-filesystem branch from 2420ab8 to 18dcf58 Compare May 23, 2024 09:05
@xloya xloya force-pushed the support-python-virtual-filesystem branch 4 times, most recently from 52b038a to 230b4d4 Compare May 24, 2024 07:55
@xloya xloya force-pushed the support-python-virtual-filesystem branch 2 times, most recently from 08f8477 to c25726c Compare May 27, 2024 07:30
@noidname01
Copy link
Collaborator

Hi @xloya I just found that fsspec have their wrapper for PyArrow and HadoopFileSystem, maybe we can develop our HDFS solution based on theirs?

@xloya
Copy link
Contributor Author

xloya commented May 28, 2024

Hi @xloya I just found that fsspec have their wrapper for PyArrow and HadoopFileSystem, maybe we can develop our HDFS solution based on theirs?

Yeah, actually current implementation is customized based on the ArrowWrapper provided by fsspec and HadoopFileSystem provided by PyArrow.

@noidname01
Copy link
Collaborator

Hi @xloya I just found that fsspec have their wrapper for PyArrow and HadoopFileSystem, maybe we can develop our HDFS solution based on theirs?

Yeah, actually current implementation is customized based on the ArrowWrapper provided by fsspec and HadoopFileSystem provided by PyArrow.

Yeah, but why not just use their implementation and add our features (ex. lock, custom handlers) on them?

@xloya
Copy link
Contributor Author

xloya commented May 28, 2024

Hi @xloya I just found that fsspec have their wrapper for PyArrow and HadoopFileSystem, maybe we can develop our HDFS solution based on theirs?

Yeah, actually current implementation is customized based on the ArrowWrapper provided by fsspec and HadoopFileSystem provided by PyArrow.

Yeah, but why not just use their implementation and add our features (ex. lock, custom handlers) on them?

ArrowWrapper some logics will call the self methods, but our GVFS need check the path in the args if valid (such as start with gvfs://), this may cause a path exception, so we need customize the code.

@xloya
Copy link
Contributor Author

xloya commented May 28, 2024

Btw, the code is not complete yet, I need to do more tests about the interface compatibility.

@noidname01
Copy link
Collaborator

Btw, the code is not complete yet, I need to do more tests about the interface compatibility.

Sure, would love to help if needed.

@jerryshao
Copy link
Contributor

Does it support read/write now? I would like to do a local test to see the current status.

@xloya
Copy link
Contributor Author

xloya commented May 28, 2024

Does it support read/write now? I would like to do a local test to see the current status.

Currently I have tested the following interfaces and they can all be executed normally with the fileset mounts on HDFS:
Configure these in the env:

export HADOOP_HOME={HADOOP_HOME}/pack
export HADOOP_CONF_DIR={HADOOP_HOME}/conf
export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`

The python script:

from gravitino import gvfs

if __name__ == "__main__":
    server_uri = "http://localhost:8090"
    metalake_name = "test_metalake"
    fs = gvfs.GravitinoVirtualFileSystem(server_uri=server_uri, metalake_name=metalake_name)
    print(fs.ls(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset"))
    print(fs.info(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset"))
    print(fs.exists(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset"))
    with fs.open(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test3.txt", mode="wb") as output:
        print(output.writable())
        output.write(b"hello world")

    with fs.open(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test3.txt", mode="rb") as input:
        print(input.readable())
        print(input.read(1024))

@xloya xloya force-pushed the support-python-virtual-filesystem branch 4 times, most recently from a9abe92 to 8156fe6 Compare June 4, 2024 09:29
@xloya xloya marked this pull request as ready for review June 4, 2024 11:13
@xloya
Copy link
Contributor Author

xloya commented Jun 4, 2024

Currently, Python integration testings lacks the support for some Docker environments, so I am considering not adding HDFS integration testings to the current PR, but supporting it separately in another PR (#3760) to reduce the difficulty of code review.

@xloya
Copy link
Contributor Author

xloya commented Jun 4, 2024

@jerryshao @xunliu @shaofengshi @noidname01 This PR is ready for review, could you take a look when you have time? Thanks.

@jerryshao
Copy link
Contributor

I will finish the review before this weekend, thanks a lot @xloya for your work.

@jerryshao
Copy link
Contributor

@noidname01 can you please also help to review?

@noidname01
Copy link
Collaborator

@noidname01 can you please also help to review?

Sure, I will review on this weekend.

@xloya xloya force-pushed the support-python-virtual-filesystem branch from 3881a31 to 7c87f33 Compare June 12, 2024 10:23
"test_pandas", f"{_fileset_dir}/test_pandas"
),
)
def test_pandas(self, mock_method1, mock_method2, mock_method3, mock_method4):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some unit tests for engines' apis like pandas, pyarrow, llama_index.

@xloya
Copy link
Contributor Author

xloya commented Jun 13, 2024

@jerryshao @noidname01 Have addressed the comments, could you take a look again? Thanks.

@noidname01
Copy link
Collaborator

LGTM, @jerryshao WDYT?

@jerryshao
Copy link
Contributor

Sorry for late response, I need to take another round of review, will finish this before Tuesday.

clients/client-python/gravitino/filesystem/gvfs.py Outdated Show resolved Hide resolved
licenses/cachetools.txt Outdated Show resolved Hide resolved
clients/client-python/gravitino/filesystem/gvfs.py Outdated Show resolved Hide resolved
readerwriterlock==1.0.9
fsspec==2024.3.1
pyarrow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to specify the version of pyarrow here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyarrow version is already specified in requirement-dev.txt, which I think is no longer needed here.

@@ -6,3 +6,7 @@ pylint==3.2.2
black==24.4.2
twine==5.1.0
coverage==7.5.1
pandas==2.0.3
pyarrow==15.0.2
llama-index==0.10.40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need package like pandas, llama-index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because unit tests and integration tests will include tests with these third-party libs' APIs. In order to ensure the stability of unit tests and integration tests, the corresponding feasible versions are specified here. I'm not sure if we need to add tests for integration with third-party libs.

LICENSE Outdated Show resolved Hide resolved
xiaojiebao added 2 commits June 18, 2024 15:20
Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @xloya for your contribution.

@jerryshao jerryshao merged commit cddef88 into apache:main Jun 18, 2024
13 checks passed
shaofengshi pushed a commit to shaofengshi/gravitino that referenced this pull request Jun 24, 2024
…em in Python (apache#3528)

### What changes were proposed in this pull request?

Support Gravitino Virtual File System in Python so that we can read and
write Fileset storage data. The first PR only supports HDFS.

After research, the following popular cloud storages or companies have
implemented their own FileSystem based on
fsspec(https://filesystem-spec.readthedocs.io/en/latest/index.html):
1. S3(https://github.com/fsspec/s3fs)
2. Azure(https://github.com/fsspec/adlfs)
3. Gcs(https://github.com/fsspec/gcsfs)
4. OSS(https://github.com/fsspec/ossfs)
5.
Databricks(https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/dbfs.py)
6. Snowflake(https://github.com/snowflakedb/snowflake-ml-python), 

So this PR will implement GVFS based on the fsspec interface.

### Why are the changes needed?

Fix: apache#2059 

### How was this patch tested?

Add some UTs and ITs.

---------

Co-authored-by: xiaojiebao <xiaojiebao@xiaomi.com>
jerryshao pushed a commit that referenced this pull request Jul 5, 2024
…tion tests (#3876)

### What changes were proposed in this pull request?

Add Hive Docker env for client-python, and add integration tests for
PyGVFS + HDFS. Depends on #3528.

### Why are the changes needed?

Fix: #3760 

### How was this patch tested?

Add some ITs.

---------

Co-authored-by: xiaojiebao <xiaojiebao@xiaomi.com>
coolderli pushed a commit that referenced this pull request Dec 17, 2024
…Python (#3528)

### What changes were proposed in this pull request?

Support Gravitino Virtual File System in Python so that we can read and
write Fileset storage data. The first PR only supports HDFS.

After research, the following popular cloud storages or companies have
implemented their own FileSystem based on
fsspec(https://filesystem-spec.readthedocs.io/en/latest/index.html):
1. S3(https://github.com/fsspec/s3fs)
2. Azure(https://github.com/fsspec/adlfs)
3. Gcs(https://github.com/fsspec/gcsfs)
4. OSS(https://github.com/fsspec/ossfs)
5.
Databricks(https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/dbfs.py)
6. Snowflake(https://github.com/snowflakedb/snowflake-ml-python),

So this PR will implement GVFS based on the fsspec interface.

### Why are the changes needed?

Fix: #2059

### How was this patch tested?

Add some UTs and ITs.

---------

Co-authored-by: xiaojiebao <xiaojiebao@xiaomi.com>
(cherry picked from commit cddef88)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] Add Python Arrow FileSystem implementation for fileset.
4 participants