-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#1244] feat(API, meta): Add basic API and metadata design for file catalog #1487
Conversation
} | ||
|
||
/** @return Name of the file object. */ | ||
String name(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name
of the File entity is it's full path or just the name of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the name.
The virtual path is derived from the name identifier, and the physical path is a must-set property. About the physical path, I'm still thinking about using a property or a must-set field, I'm not so sure about which way is a good way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to use a must-set field, just like the warehouse location. If we use a property to keep the physical path, where should the file storage if we forget to set it?
I think we can corporate with OpenDAL community after we implemented a file catalog. https://github.com/apache/incubator-opendal If we want to implement Rust client. It may be useful, too. |
In general, we have three points to ensure.
|
File is a virtual concept from metadata perspective, it will refer to both physical file or directory, I don't want additional directory API, which seems not necessary.
This is not the scope of this PR. And from my random thinking, I think truncate is not necessary as object store doesn't support it natively. Also for soft-link. We don't have to implement all the posix FS API once for all.
Upload is not an API, it is just a |
* @param ident A file identifier. | ||
* @return true If the file is dropped, false the file did not exist. | ||
*/ | ||
boolean dropFile(NameIdentifier ident); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the api purgeFile
? It may be a little weird to use the name purge
. You should get the point of mine. It means that we can delete the file of external system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API will drop all files underlying the path, right? Do we support the subdirectory? If not, how can we deal with the TTL? for example I want to drop the files in /metalake/catalog/schema/my_file/date=20201010
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I don't plan to add purgeFile API, this seems not so necessary for now. We can add it when we really have this requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API will drop all files underlying the path, right? Do we support the subdirectory? If not, how can we deal with the TTL? for example I want to drop the files in
/metalake/catalog/schema/my_file/date=20201010
Yes, we can support it using FS API as I mentioned in the design doc.
* @param newName The new name of the file. | ||
* @return The file change. | ||
*/ | ||
static FileChange rename(String newName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this api will call underlying filesystem api rename
? Will it modify the physical path property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it will not. It will only manage the file metadata, not the physical file path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an API to modify the physical path? We may want to change the cloud storage to another one but keep the virtual path unchanged. Then our user job will not need to recompile anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it a bit. Yearh, it is useful, but we can add it later, this PR is mainly for adding the skeleton, we can always add it later on.
message File { | ||
enum Format { | ||
CSV = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UNKNOWN = 0
may be better. You can see https://blog.51cto.com/u_16162321/6496255
@coolderli Can you please take a review when you have time? Thanks. |
} | ||
|
||
/** @return Name of the file object. */ | ||
String name(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to use a must-set field, just like the warehouse location. If we use a property to keep the physical path, where should the file storage if we forget to set it?
* of a file object. A catalog implementation with {@link FileCatalog} should implement this | ||
* interface. | ||
*/ | ||
public interface File extends Auditable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very clear about the File
. Is the File
like a Volume
on Databricks?
How should I understand a file in a schema
? It only one real file or one directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the File
is ambiguous. There may be some misunderstandings during communication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The File
metadata concept is similar to a Volume
in UC, it can be referred to a physical file or directory.
Yes, it is ambiguous, but I cannot figure out a better name, do you have a better suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using fileset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think the fileset
is much better than file
* @param ident A file identifier. | ||
* @return true If the file is dropped, false the file did not exist. | ||
*/ | ||
boolean dropFile(NameIdentifier ident); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API will drop all files underlying the path, right? Do we support the subdirectory? If not, how can we deal with the TTL? for example I want to drop the files in /metalake/catalog/schema/my_file/date=20201010
* @param newName The new name of the file. | ||
* @return The file change. | ||
*/ | ||
static FileChange rename(String newName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an API to modify the physical path? We may want to change the cloud storage to another one but keep the virtual path unchanged. Then our user job will not need to recompile anymore.
} | ||
|
||
/** @return the format of the underlying filesets that is managed by this file object. */ | ||
Format format(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set the format for the File
? If we did, it will assume all files in a File
must be the same format.
is there possible different file format on different directory?
For example, we use orc as our current file format, but one day we want to change it to parquet. What should I do with the history files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I delete the format
, seems it is not so useful and brings inconvenience for users.
* | ||
* @return The storage location of the fileset object. | ||
*/ | ||
String storageLocation(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does need a virtual path property here? I saw that there is a virtual path in the volume of Databricks, and resources can be accessed through this virtual path:
After thinking about it again, if the conversion from virtual path to metalake.catalog.schema.fileset can be realized in Proxy FS, it seems that it is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Virtual path is generated from NameIdentifier, which is "metalake.catalog.schema.fileset", so the virtual path will be "/metadata/catalog/schema/fileset".
Another discussion. Do we want to have some another data optimization in the file catalog? For example, we can transfer some cold data from the HDFS to S3. we can use more effective compression codec. We can clean some temporary files. |
This seems out of the scope of current file catalog. But I think it is doable with file catalog. What we need is to have several tools around file catalog. |
@qqqttt123 @yuqi1129 @xloya @coolderli Can you please review again? |
Overall looks nice to me, can not wait to see next pr :) |
LGTM! |
Maybe later we can provide some services to govern tables and files on gravitino, for example, manage compaction、expire snapshots, etc. |
optional string comment = 3; | ||
Type type = 4; | ||
optional string storage_location = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we provide a default storage_location if storage_location is absent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will use the catalog/schema's location to figure out a location for fileset if it is not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. thanks @jerryshao @coolderli @xloya
Thanks all for your comments, greatly appreciated. |
What changes were proposed in this pull request?
This PR proposes to add the
Fileset
metadata support in Gravitino. Include the API, metadata design and implementation.Why are the changes needed?
This is the first work of support file-based non-tabular data management in Gravitino.
Fix: #1244
Does this PR introduce any user-facing change?
Yes.
How was this patch tested?
Add some UTs to test metadata, API test will be added when the implementation is achieved.