Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add Gravitino Hadoop File System #1616

Closed
Tracked by #1241
xloya opened this issue Jan 19, 2024 · 7 comments · Fixed by #1700
Closed
Tracked by #1241

[FEATURE] Add Gravitino Hadoop File System #1616

xloya opened this issue Jan 19, 2024 · 7 comments · Fixed by #1700
Assignees

Comments

@xloya
Copy link
Collaborator

xloya commented Jan 19, 2024

Describe the feature

When Gravitino supports managing fileset and for storage systems that support reading and writing through the Hadoop File System interface, we can provide a Gravitino File System. In this file system, we would implement access to the Gravitino server to obtain the actual storage location and act as a proxy file system for accessing these storage systems. In this way, engines such as Spark can be easily accessed and used.

Motivation

No response

Describe the solution

No response

Additional context

No response

@xloya
Copy link
Collaborator Author

xloya commented Jan 19, 2024

@jerryshao What do you think about this issue? Is it possible to consider implementing a File System framework first, and then consider linking it with the Server later?

@jerryshao
Copy link
Contributor

I think we can have a prototype/skeleton first, but I'm afraid it cannot work until the fileset REST API is ready.

@jerryshao
Copy link
Contributor

jerryshao commented Jan 19, 2024

You can have some basic design (so we can discuss more about it) if you want to take a shot 😄 .

@xloya
Copy link
Collaborator Author

xloya commented Jan 19, 2024

Sure, I can open a prototype patch first next week, please take a look at that time

@jerryshao jerryshao added this to the Gravitino 0.5.0 milestone Feb 4, 2024
jerryshao pushed a commit that referenced this issue Mar 22, 2024
### What changes were proposed in this pull request?

This PR proposes to add the code skeleton for gravitino hadoop file
system to proxy hdfs file system.

### Why are the changes needed?

Fix: #1616 

### How was this patch tested?

Add uts to cover the main interface method.

---------

Co-authored-by: xiaojiebao <xiaojiebao@xiaomi.com>
coolderli pushed a commit to coolderli/gravitino that referenced this issue Apr 2, 2024
…apache#1700)

### What changes were proposed in this pull request?

This PR proposes to add the code skeleton for gravitino hadoop file
system to proxy hdfs file system.

### Why are the changes needed?

Fix: apache#1616 

### How was this patch tested?

Add uts to cover the main interface method.

---------

Co-authored-by: xiaojiebao <xiaojiebao@xiaomi.com>
@zuston
Copy link
Member

zuston commented Apr 30, 2024

From the doc, I don't find the real cases to show the usage about this gvfs. Is this a virtual unified namespace for different remote filesystems? And If using the managed(is local file on gravitino server?) fileset, how to access it for spark on Yarn ?

@jerryshao
Copy link
Contributor

@zuston the usage of gvfs is exactly the same as using hdfs, the major difference is that the path is a virtual path, not a physical path. You can use it from Spark, it is a Hadoop compatible filesystem, like you use hdfs, s3, oss and others.

Here's the doc, it has more details on how to use it with Spark https://datastrato.ai/docs/0.5.0/how-to-use-gvfs.

@zuston
Copy link
Member

zuston commented Apr 30, 2024

Yes. I have seen the doc and understand the hcfs interface that gvfs used. And I want to know the gvfs design motivation, I'm evaluating wether this could be used for unified namespace. From current doc, this feature is only for the remote filesystem one-one mapping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants