Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#4278] feat(filesystem): Refactor the getFileLocation logics in hadoop GVFS #4320

Merged
merged 12 commits into from
Sep 23, 2024

Conversation

xloya
Copy link
Collaborator

@xloya xloya commented Jul 31, 2024

What changes were proposed in this pull request?

Refactor the getFileLocation logic in Hadoop GVFS so that when sending a request to the server to obtain the file location, the current data operation and operation path information are reported.

Why are the changes needed?

Fix: #4278

How was this patch tested?

Add and refactor some UTs, and existing ITs maintain normally.

@xloya xloya force-pushed the refactor-hadoop-gvfs-logics branch 4 times, most recently from ca1c7cf to ca37ec5 Compare August 5, 2024 10:31
@xloya xloya force-pushed the refactor-hadoop-gvfs-logics branch from ca37ec5 to 2608f5a Compare August 5, 2024 11:34
@xloya xloya changed the title [#4278] feat(filesystem): Refactor the getFilesetContext logics in hadoop GVFS [#4278] feat(filesystem): Refactor the getFileLocation logics in hadoop GVFS Aug 9, 2024
@xloya xloya closed this Sep 12, 2024
@xloya xloya reopened this Sep 12, 2024
@xloya xloya self-assigned this Sep 12, 2024
@xloya
Copy link
Collaborator Author

xloya commented Sep 13, 2024

@jerryshao Please take a look of this PR when you have time, thanks. Since some logic in GVFS needs to be refactored into Server Catalog Hadoop, I did not split it to another PR to make the logical changes clearer.

@xloya
Copy link
Collaborator Author

xloya commented Sep 20, 2024

Gentle pin @jerryshao.

return locationPath.getFileSystem(hadoopConf).getFileStatus(locationPath).isFile();
} catch (FileNotFoundException e) {
// We should always return false here, same with the logic in `FileSystem.isFile(Path f)`.
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the behavior of returning false here, can you explain why it should be false?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of the isFile() method of Hadoop FileSystem is referenced here. Since this method has been deprecated in Hadoop 3, its implementation is directly copied here.
image

@@ -65,8 +69,10 @@ public class GravitinoVirtualFileSystem extends FileSystem {
private URI uri;
private GravitinoClient client;
private String metalakeName;
private Cache<NameIdentifier, Pair<Fileset, FileSystem>> filesetCache;
private ScheduledThreadPoolExecutor scheduler;
private Cache<NameIdentifier, FilesetCatalog> catalogCache;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this cache?

Copy link
Collaborator Author

@xloya xloya Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, to obtain the actual file location, we need to loadCatalog first, and then call asFilesetCatalog().getFileLocation(). If the Catalog is not cached here, two RPCs are required for each file operation. Considering that changes to the Catalog are not very frequent, these requests may be unnecessary.

private ScheduledThreadPoolExecutor scheduler;
private Cache<NameIdentifier, FilesetCatalog> catalogCache;
private ScheduledThreadPoolExecutor catalogCleanScheduler;
private Cache<String, FileSystem> internalFileSystemCache;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usage of this cache?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache here ensures that the created FileSystem can be cleaned up at the same time when GVFS is closed, otherwise we need to rely on the Hadoop FileSystem mechanism to close it. In addition, we use the method of creating new instances for accessing the underlying storage FileSystem in GVFS, see: https://github.com/apache/gravitino/blob/main/clients/filesystem-hadoop3/src/main/java/org/apache/gravitino/filesystem/hadoop/GravitinoVirtualFileSystem.java#L407. The reason for not using Hadoop's FileSystem cache is that in a multi-tenant scenario, an unauthorized user may obtain the authenticated FileSystem through FileSystem.get(), and this user may not have authorization for the corresponding storage.

NameIdentifier catalogIdent = NameIdentifier.of(metalakeName, identifier.namespace().level(1));
FilesetCatalog filesetCatalog =
catalogCache.get(
catalogIdent, ident -> client.loadCatalog(catalogIdent.name()).asFilesetCatalog());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be an authorization issue if we cache the catalog in local?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the permissions on the catalog are changed frequently, I think it is possible. However, in our scenario, we may not change the read permissions of the catalog frequently, because only the read permission is needed here, and we basically grant the read permission of this catalog to all users. I think we can remove the cache here, but the premise is that our client can support direct calls to getFileLocation(), instead of loading Catalog and then calling getFileLocation() every time.


URI uri = new Path(actualFileLocation).toUri();
// we cache the fs for the same scheme, so we can reuse it
FileSystem fs =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hadoop client will also do the cache, do we need to do this in our side?

Copy link
Collaborator Author

@xloya xloya Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is necessary not to use Hadoop's FileSystem cache but to maintain our own FileSystem cache, because if Hadoop's FileSystem cache is used, the user may directly obtain the authenticated FileSystem operation of FileSystem.get(), which does not belong to his storage resources. In addition, here we use the FileSystem.newInstance(storageUri, getConf()) method, which ensures that each time a new FileSystem instance is created, the user cannot directly obtain the authenticated FileSystem instance through FileSystem.get().

} catch (IOException e) {
throw new GravitinoRuntimeException(
"Exception occurs when checking whether fileset: %s mounts a single file, exception: %s",
fileset.name(), e.getCause());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change to like:

throw new GravitinoRuntimeException(e, "xxxxx", xx, xx)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// we cache the fs for the same scheme, so we can reuse it
FileSystem fs =
internalFileSystemCache.get(
uri.getScheme(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we make sure that getScheme will always return a string, not null? As I remembered, the URI spec doesn't require a scheme, it is optional, and the call of getScheme will return null.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, logically, there will be not null here, because the URI here is composed of the storage location of the fileset (when creating a fileset, the storage location will be formalized, which will make the storage location always having the scheme, see https://github.com/apache/gravitino/blob/main/catalogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java#L233) and the sub path on the server side. But I think we can add a null value check here to remind users that they are using the wrong actual path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jerryshao
Copy link
Contributor

Hi @xloya just some small comments. Seems like currently we only have Java implementation for client and gvfs, we should also have Python counterpart, am I right?

@xloya
Copy link
Collaborator Author

xloya commented Sep 23, 2024

Hi @xloya just some small comments. Seems like currently we only have Java implementation for client and gvfs, we should also have Python counterpart, am I right?

Yeah, I will finish the implementations of getFileLocation and refactors of python gvfs in the following PRs.

Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @xloya for your work.

@jerryshao jerryshao merged commit be7a5e6 into apache:main Sep 23, 2024
26 checks passed
jerryshao pushed a commit that referenced this pull request Sep 24, 2024
… skeleton in Python Client (#4373)

### What changes were proposed in this pull request?

Added an interface code skeleton in Python Client for obtaining the file
location so that the client can report some necessary information for
the server to audit and simplify some check logics in Python GVFS later.
Depend on #4320.

### Why are the changes needed?

Fix: #4279 

### How was this patch tested?

Add UTs and ITs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] Refactor the getFileLocation logic in the Hadoop GVFS
2 participants