[#4278] feat(filesystem): Refactor the `getFileLocation` logics in hadoop GVFS #4320

xloya · 2024-07-31T08:39:18Z

What changes were proposed in this pull request?

Refactor the getFileLocation logic in Hadoop GVFS so that when sending a request to the server to obtain the file location, the current data operation and operation path information are reported.

Why are the changes needed?

Fix: #4278

How was this patch tested?

Add and refactor some UTs, and existing ITs maintain normally.

…s-logics

xloya · 2024-09-13T07:01:30Z

@jerryshao Please take a look of this PR when you have time, thanks. Since some logic in GVFS needs to be refactored into Server Catalog Hadoop, I did not split it to another PR to make the logical changes clearer.

xloya · 2024-09-20T03:18:45Z

Gentle pin @jerryshao.

...atalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java

jerryshao · 2024-09-19T14:20:56Z

...atalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java

+      return locationPath.getFileSystem(hadoopConf).getFileStatus(locationPath).isFile();
+    } catch (FileNotFoundException e) {
+      // We should always return false here, same with the logic in `FileSystem.isFile(Path f)`.
+      return false;


I'm not sure about the behavior of returning false here, can you explain why it should be false?

The implementation of the isFile() method of Hadoop FileSystem is referenced here. Since this method has been deprecated in Hadoop 3, its implementation is directly copied here.

jerryshao · 2024-09-20T08:12:46Z

...hadoop3/src/main/java/org/apache/gravitino/filesystem/hadoop/GravitinoVirtualFileSystem.java

@@ -65,8 +69,10 @@ public class GravitinoVirtualFileSystem extends FileSystem {
  private URI uri;
  private GravitinoClient client;
  private String metalakeName;
-  private Cache<NameIdentifier, Pair<Fileset, FileSystem>> filesetCache;
-  private ScheduledThreadPoolExecutor scheduler;
+  private Cache<NameIdentifier, FilesetCatalog> catalogCache;


Why do we need this cache?

Currently, to obtain the actual file location, we need to loadCatalog first, and then call asFilesetCatalog().getFileLocation(). If the Catalog is not cached here, two RPCs are required for each file operation. Considering that changes to the Catalog are not very frequent, these requests may be unnecessary.

jerryshao · 2024-09-20T08:13:12Z

...hadoop3/src/main/java/org/apache/gravitino/filesystem/hadoop/GravitinoVirtualFileSystem.java

-  private ScheduledThreadPoolExecutor scheduler;
+  private Cache<NameIdentifier, FilesetCatalog> catalogCache;
+  private ScheduledThreadPoolExecutor catalogCleanScheduler;
+  private Cache<String, FileSystem> internalFileSystemCache;


What is the usage of this cache?

The cache here ensures that the created FileSystem can be cleaned up at the same time when GVFS is closed, otherwise we need to rely on the Hadoop FileSystem mechanism to close it. In addition, we use the method of creating new instances for accessing the underlying storage FileSystem in GVFS, see: https://github.com/apache/gravitino/blob/main/clients/filesystem-hadoop3/src/main/java/org/apache/gravitino/filesystem/hadoop/GravitinoVirtualFileSystem.java#L407. The reason for not using Hadoop's FileSystem cache is that in a multi-tenant scenario, an unauthorized user may obtain the authenticated FileSystem through FileSystem.get(), and this user may not have authorization for the corresponding storage.

jerryshao · 2024-09-20T08:26:56Z

...hadoop3/src/main/java/org/apache/gravitino/filesystem/hadoop/GravitinoVirtualFileSystem.java

+    NameIdentifier catalogIdent = NameIdentifier.of(metalakeName, identifier.namespace().level(1));
+    FilesetCatalog filesetCatalog =
+        catalogCache.get(
+            catalogIdent, ident -> client.loadCatalog(catalogIdent.name()).asFilesetCatalog());


Will there be an authorization issue if we cache the catalog in local?

If the permissions on the catalog are changed frequently, I think it is possible. However, in our scenario, we may not change the read permissions of the catalog frequently, because only the read permission is needed here, and we basically grant the read permission of this catalog to all users. I think we can remove the cache here, but the premise is that our client can support direct calls to getFileLocation(), instead of loading Catalog and then calling getFileLocation() every time.

jerryshao · 2024-09-20T08:29:18Z

...hadoop3/src/main/java/org/apache/gravitino/filesystem/hadoop/GravitinoVirtualFileSystem.java

+
+    URI uri = new Path(actualFileLocation).toUri();
+    // we cache the fs for the same scheme, so we can reuse it
+    FileSystem fs =


I think hadoop client will also do the cache, do we need to do this in our side?

I think it is necessary not to use Hadoop's FileSystem cache but to maintain our own FileSystem cache, because if Hadoop's FileSystem cache is used, the user may directly obtain the authenticated FileSystem operation of FileSystem.get(), which does not belong to his storage resources. In addition, here we use the FileSystem.newInstance(storageUri, getConf()) method, which ensures that each time a new FileSystem instance is created, the user cannot directly obtain the authenticated FileSystem instance through FileSystem.get().

jerryshao · 2024-09-21T08:42:50Z

...atalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java

+    } catch (IOException e) {
+      throw new GravitinoRuntimeException(
+          "Exception occurs when checking whether fileset: %s mounts a single file, exception: %s",
+          fileset.name(), e.getCause());


I think we can change to like:

throw new GravitinoRuntimeException(e, "xxxxx", xx, xx)

jerryshao · 2024-09-21T08:57:10Z

...hadoop3/src/main/java/org/apache/gravitino/filesystem/hadoop/GravitinoVirtualFileSystem.java

+    // we cache the fs for the same scheme, so we can reuse it
+    FileSystem fs =
+        internalFileSystemCache.get(
+            uri.getScheme(),


Do we make sure that getScheme will always return a string, not null? As I remembered, the URI spec doesn't require a scheme, it is optional, and the call of getScheme will return null.

Yes, logically, there will be not null here, because the URI here is composed of the storage location of the fileset (when creating a fileset, the storage location will be formalized, which will make the storage location always having the scheme, see https://github.com/apache/gravitino/blob/main/catalogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java#L233) and the sub path on the server side. But I think we can add a null value check here to remind users that they are using the wrong actual path.

jerryshao · 2024-09-21T09:03:16Z

Hi @xloya just some small comments. Seems like currently we only have Java implementation for client and gvfs, we should also have Python counterpart, am I right?

xloya · 2024-09-23T01:34:12Z

Hi @xloya just some small comments. Seems like currently we only have Java implementation for client and gvfs, we should also have Python counterpart, am I right?

Yeah, I will finish the implementations of getFileLocation and refactors of python gvfs in the following PRs.

jerryshao

LGTM, thanks @xloya for your work.

… skeleton in Python Client (#4373) ### What changes were proposed in this pull request? Added an interface code skeleton in Python Client for obtaining the file location so that the client can report some necessary information for the server to audit and simplify some check logics in Python GVFS later. Depend on #4320. ### Why are the changes needed? Fix: #4279 ### How was this patch tested? Add UTs and ITs.

xloya force-pushed the refactor-hadoop-gvfs-logics branch 4 times, most recently from ca1c7cf to ca37ec5 Compare August 5, 2024 10:31

xloya mentioned this pull request Aug 5, 2024

[#4279] feat(client-python): Add the getFileLocation interface code skeleton in Python Client #4373

Merged

xloya force-pushed the refactor-hadoop-gvfs-logics branch from ca37ec5 to 2608f5a Compare August 5, 2024 11:34

xloya changed the title ~~[#4278] feat(filesystem): Refactor the getFilesetContext logics in hadoop GVFS~~ [#4278] feat(filesystem): Refactor the getFileLocation logics in hadoop GVFS Aug 9, 2024

xloya added 4 commits September 12, 2024 09:55

add its

a4b4168

fix

93a57a3

fix decode

ee2f475

move its into hadoop catalog it

ec410f0

xloya force-pushed the refactor-hadoop-gvfs-logics branch from 2608f5a to 99bd742 Compare September 12, 2024 11:09

xloya closed this Sep 12, 2024

xloya reopened this Sep 12, 2024

xloya force-pushed the refactor-hadoop-gvfs-logics branch from 99bd742 to 12f66ad Compare September 12, 2024 11:41

refactor

9bb958d

xloya force-pushed the refactor-hadoop-gvfs-logics branch from 12f66ad to 9bb958d Compare September 12, 2024 11:41

xloya self-assigned this Sep 12, 2024

xloya added 3 commits September 12, 2024 20:23

Merge branch 'add-its-for-get-file-location' into refactor-hadoop-gvf…

348856d

…s-logics

Merge branch 'main' into refactor-hadoop-gvfs-logics

9cee44c

add tests

2c0e79e

xloya requested a review from jerryshao September 13, 2024 06:58

jerryshao reviewed Sep 20, 2024

View reviewed changes

xloya added 3 commits September 20, 2024 18:02

Merge branch 'main' into refactor-hadoop-gvfs-logics

e5b6d71

fix comments

3365d2e

fix tests

730d8c4

jerryshao reviewed Sep 21, 2024

View reviewed changes

fix comments

c2b2b49

jerryshao approved these changes Sep 23, 2024

View reviewed changes

jerryshao merged commit be7a5e6 into apache:main Sep 23, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#4278] feat(filesystem): Refactor the `getFileLocation` logics in hadoop GVFS #4320

[#4278] feat(filesystem): Refactor the `getFileLocation` logics in hadoop GVFS #4320

xloya commented Jul 31, 2024 •

edited

Loading

xloya commented Sep 13, 2024

xloya commented Sep 20, 2024

jerryshao Sep 19, 2024

xloya Sep 20, 2024

jerryshao Sep 20, 2024

xloya Sep 20, 2024 •

edited

Loading

jerryshao Sep 20, 2024

xloya Sep 20, 2024

jerryshao Sep 20, 2024

xloya Sep 20, 2024

jerryshao Sep 20, 2024

xloya Sep 20, 2024 •

edited

Loading

jerryshao Sep 21, 2024

xloya Sep 23, 2024

jerryshao Sep 21, 2024

xloya Sep 23, 2024

xloya Sep 23, 2024

jerryshao commented Sep 21, 2024

xloya commented Sep 23, 2024

jerryshao left a comment

[#4278] feat(filesystem): Refactor the getFileLocation logics in hadoop GVFS #4320

[#4278] feat(filesystem): Refactor the getFileLocation logics in hadoop GVFS #4320

Conversation

xloya commented Jul 31, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

xloya commented Sep 13, 2024

xloya commented Sep 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xloya Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xloya Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao commented Sep 21, 2024

xloya commented Sep 23, 2024

jerryshao left a comment

Choose a reason for hiding this comment

[#4278] feat(filesystem): Refactor the `getFileLocation` logics in hadoop GVFS #4320

[#4278] feat(filesystem): Refactor the `getFileLocation` logics in hadoop GVFS #4320

xloya commented Jul 31, 2024 •

edited

Loading

xloya Sep 20, 2024 •

edited

Loading

xloya Sep 20, 2024 •

edited

Loading