-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#4278] feat(filesystem): Refactor the getFileLocation
logics in hadoop GVFS
#4320
Conversation
ca1c7cf
to
ca37ec5
Compare
ca37ec5
to
2608f5a
Compare
getFilesetContext
logics in hadoop GVFSgetFileLocation
logics in hadoop GVFS
2608f5a
to
99bd742
Compare
99bd742
to
12f66ad
Compare
12f66ad
to
9bb958d
Compare
@jerryshao Please take a look of this PR when you have time, thanks. Since some logic in GVFS needs to be refactored into Server Catalog Hadoop, I did not split it to another PR to make the logical changes clearer. |
Gentle pin @jerryshao. |
...atalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java
Outdated
Show resolved
Hide resolved
...atalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java
Show resolved
Hide resolved
...atalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java
Show resolved
Hide resolved
...atalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java
Outdated
Show resolved
Hide resolved
return locationPath.getFileSystem(hadoopConf).getFileStatus(locationPath).isFile(); | ||
} catch (FileNotFoundException e) { | ||
// We should always return false here, same with the logic in `FileSystem.isFile(Path f)`. | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the behavior of returning false
here, can you explain why it should be false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -65,8 +69,10 @@ public class GravitinoVirtualFileSystem extends FileSystem { | |||
private URI uri; | |||
private GravitinoClient client; | |||
private String metalakeName; | |||
private Cache<NameIdentifier, Pair<Fileset, FileSystem>> filesetCache; | |||
private ScheduledThreadPoolExecutor scheduler; | |||
private Cache<NameIdentifier, FilesetCatalog> catalogCache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, to obtain the actual file location, we need to loadCatalog
first, and then call asFilesetCatalog().getFileLocation()
. If the Catalog is not cached here, two RPCs are required for each file operation. Considering that changes to the Catalog are not very frequent, these requests may be unnecessary.
private ScheduledThreadPoolExecutor scheduler; | ||
private Cache<NameIdentifier, FilesetCatalog> catalogCache; | ||
private ScheduledThreadPoolExecutor catalogCleanScheduler; | ||
private Cache<String, FileSystem> internalFileSystemCache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the usage of this cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache here ensures that the created FileSystem can be cleaned up at the same time when GVFS is closed, otherwise we need to rely on the Hadoop FileSystem mechanism to close it. In addition, we use the method of creating new instances for accessing the underlying storage FileSystem in GVFS, see: https://github.com/apache/gravitino/blob/main/clients/filesystem-hadoop3/src/main/java/org/apache/gravitino/filesystem/hadoop/GravitinoVirtualFileSystem.java#L407. The reason for not using Hadoop's FileSystem cache is that in a multi-tenant scenario, an unauthorized user may obtain the authenticated FileSystem through FileSystem.get()
, and this user may not have authorization for the corresponding storage.
NameIdentifier catalogIdent = NameIdentifier.of(metalakeName, identifier.namespace().level(1)); | ||
FilesetCatalog filesetCatalog = | ||
catalogCache.get( | ||
catalogIdent, ident -> client.loadCatalog(catalogIdent.name()).asFilesetCatalog()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be an authorization issue if we cache the catalog in local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the permissions on the catalog are changed frequently, I think it is possible. However, in our scenario, we may not change the read permissions of the catalog frequently, because only the read permission is needed here, and we basically grant the read permission of this catalog to all users. I think we can remove the cache here, but the premise is that our client can support direct calls to getFileLocation()
, instead of loading Catalog and then calling getFileLocation()
every time.
|
||
URI uri = new Path(actualFileLocation).toUri(); | ||
// we cache the fs for the same scheme, so we can reuse it | ||
FileSystem fs = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hadoop client will also do the cache, do we need to do this in our side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is necessary not to use Hadoop's FileSystem cache but to maintain our own FileSystem cache, because if Hadoop's FileSystem cache is used, the user may directly obtain the authenticated FileSystem operation of FileSystem.get()
, which does not belong to his storage resources. In addition, here we use the FileSystem.newInstance(storageUri, getConf())
method, which ensures that each time a new FileSystem instance is created, the user cannot directly obtain the authenticated FileSystem instance through FileSystem.get().
} catch (IOException e) { | ||
throw new GravitinoRuntimeException( | ||
"Exception occurs when checking whether fileset: %s mounts a single file, exception: %s", | ||
fileset.name(), e.getCause()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can change to like:
throw new GravitinoRuntimeException(e, "xxxxx", xx, xx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// we cache the fs for the same scheme, so we can reuse it | ||
FileSystem fs = | ||
internalFileSystemCache.get( | ||
uri.getScheme(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we make sure that getScheme
will always return a string, not null? As I remembered, the URI spec doesn't require a scheme, it is optional, and the call of getScheme
will return null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, logically, there will be not null here, because the URI here is composed of the storage location of the fileset (when creating a fileset, the storage location will be formalized, which will make the storage location always having the scheme, see https://github.com/apache/gravitino/blob/main/catalogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java#L233) and the sub path on the server side. But I think we can add a null value check here to remind users that they are using the wrong actual path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Hi @xloya just some small comments. Seems like currently we only have Java implementation for client and gvfs, we should also have Python counterpart, am I right? |
Yeah, I will finish the implementations of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @xloya for your work.
… skeleton in Python Client (#4373) ### What changes were proposed in this pull request? Added an interface code skeleton in Python Client for obtaining the file location so that the client can report some necessary information for the server to audit and simplify some check logics in Python GVFS later. Depend on #4320. ### Why are the changes needed? Fix: #4279 ### How was this patch tested? Add UTs and ITs.
What changes were proposed in this pull request?
Refactor the
getFileLocation
logic in Hadoop GVFS so that when sending a request to the server to obtain the file location, the current data operation and operation path information are reported.Why are the changes needed?
Fix: #4278
How was this patch tested?
Add and refactor some UTs, and existing ITs maintain normally.