Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1244] feat(API, meta): Add basic API and metadata design for file catalog #1487

Merged
merged 6 commits into from
Jan 16, 2024

Conversation

jerryshao
Copy link
Contributor

@jerryshao jerryshao commented Jan 15, 2024

What changes were proposed in this pull request?

This PR proposes to add the Fileset metadata support in Gravitino. Include the API, metadata design and implementation.

Why are the changes needed?

This is the first work of support file-based non-tabular data management in Gravitino.

Fix: #1244

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Add some UTs to test metadata, API test will be added when the implementation is achieved.

}

/** @return Name of the file object. */
String name();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the File entity is it's full path or just the name of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the name.

The virtual path is derived from the name identifier, and the physical path is a must-set property. About the physical path, I'm still thinking about using a property or a must-set field, I'm not so sure about which way is a good way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to use a must-set field, just like the warehouse location. If we use a property to keep the physical path, where should the file storage if we forget to set it?

@qqqttt123
Copy link
Contributor

qqqttt123 commented Jan 15, 2024

I think we can corporate with OpenDAL community after we implemented a file catalog. https://github.com/apache/incubator-opendal If we want to implement Rust client. It may be useful, too.

@qqqttt123
Copy link
Contributor

In general, we have three points to ensure.

  1. Do we need the concept of directory? Do you need the api of directories?
  2. For Hadoop compatible file apis, do we need the truncate api? Do we need to support soft-link? Do we need read/write api? It may be useful to record audit information for writing.
  3. Do we need managed file system? If so, maybe we should support upload api. Databricks implement upload api. https://docs.databricks.com/api/workspace/dbfs/put

@jerryshao
Copy link
Contributor Author

In general, we have three points to ensure.

  1. Do we need the concept of directory? Do you need the api of directories?

File is a virtual concept from metadata perspective, it will refer to both physical file or directory, I don't want additional directory API, which seems not necessary.

  1. For Hadoop compatible file apis, do we need the truncate api? Do we need to support soft-link? Do we need read/write api? It may be useful to record audit information for writing.

This is not the scope of this PR. And from my random thinking, I think truncate is not necessary as object store doesn't support it natively. Also for soft-link. We don't have to implement all the posix FS API once for all.

  1. Do we need managed file system? If so, maybe we should support upload api. Databricks implement upload api. https://docs.databricks.com/api/workspace/dbfs/put

Upload is not an API, it is just a read and write, we can do it but it will be supported naturally when HCFS interface is implemented.

* @param ident A file identifier.
* @return true If the file is dropped, false the file did not exist.
*/
boolean dropFile(NameIdentifier ident);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the api purgeFile? It may be a little weird to use the name purge. You should get the point of mine. It means that we can delete the file of external system.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API will drop all files underlying the path, right? Do we support the subdirectory? If not, how can we deal with the TTL? for example I want to drop the files in /metalake/catalog/schema/my_file/date=20201010

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I don't plan to add purgeFile API, this seems not so necessary for now. We can add it when we really have this requirement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API will drop all files underlying the path, right? Do we support the subdirectory? If not, how can we deal with the TTL? for example I want to drop the files in /metalake/catalog/schema/my_file/date=20201010

Yes, we can support it using FS API as I mentioned in the design doc.

* @param newName The new name of the file.
* @return The file change.
*/
static FileChange rename(String newName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this api will call underlying filesystem api rename? Will it modify the physical path property?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will not. It will only manage the file metadata, not the physical file path.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an API to modify the physical path? We may want to change the cloud storage to another one but keep the virtual path unchanged. Then our user job will not need to recompile anymore.

Copy link
Contributor Author

@jerryshao jerryshao Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it a bit. Yearh, it is useful, but we can add it later, this PR is mainly for adding the skeleton, we can always add it later on.

message File {
enum Format {
CSV = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNKNOWN = 0 may be better. You can see https://blog.51cto.com/u_16162321/6496255

@jerryshao
Copy link
Contributor Author

@coolderli Can you please take a review when you have time? Thanks.

}

/** @return Name of the file object. */
String name();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to use a must-set field, just like the warehouse location. If we use a property to keep the physical path, where should the file storage if we forget to set it?

* of a file object. A catalog implementation with {@link FileCatalog} should implement this
* interface.
*/
public interface File extends Auditable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very clear about the File. Is the File like a Volume on Databricks?
How should I understand a file in a schema? It only one real file or one directory

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the File is ambiguous. There may be some misunderstandings during communication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The File metadata concept is similar to a Volume in UC, it can be referred to a physical file or directory.

Yes, it is ambiguous, but I cannot figure out a better name, do you have a better suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using fileset?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think the fileset is much better than file

* @param ident A file identifier.
* @return true If the file is dropped, false the file did not exist.
*/
boolean dropFile(NameIdentifier ident);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API will drop all files underlying the path, right? Do we support the subdirectory? If not, how can we deal with the TTL? for example I want to drop the files in /metalake/catalog/schema/my_file/date=20201010

* @param newName The new name of the file.
* @return The file change.
*/
static FileChange rename(String newName) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an API to modify the physical path? We may want to change the cloud storage to another one but keep the virtual path unchanged. Then our user job will not need to recompile anymore.

}

/** @return the format of the underlying filesets that is managed by this file object. */
Format format();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set the format for the File? If we did, it will assume all files in a File must be the same format.
is there possible different file format on different directory?

For example, we use orc as our current file format, but one day we want to change it to parquet. What should I do with the history files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I delete the format, seems it is not so useful and brings inconvenience for users.

*
* @return The storage location of the fileset object.
*/
String storageLocation();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does need a virtual path property here? I saw that there is a virtual path in the volume of Databricks, and resources can be accessed through this virtual path:
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does need a virtual path property here? I saw that there is a virtual path in the volume of Databricks, and resources can be accessed through this virtual path: image

After thinking about it again, if the conversion from virtual path to metalake.catalog.schema.fileset can be realized in Proxy FS, it seems that it is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Virtual path is generated from NameIdentifier, which is "metalake.catalog.schema.fileset", so the virtual path will be "/metadata/catalog/schema/fileset".

@qqqttt123
Copy link
Contributor

Another discussion. Do we want to have some another data optimization in the file catalog? For example, we can transfer some cold data from the HDFS to S3. we can use more effective compression codec. We can clean some temporary files.

@jerryshao
Copy link
Contributor Author

Another discussion. Do we want to have some another data optimization in the file catalog? For example, we can transfer some cold data from the HDFS to S3. we can use more effective compression codec. We can clean some temporary files.

This seems out of the scope of current file catalog. But I think it is doable with file catalog. What we need is to have several tools around file catalog.

@jerryshao
Copy link
Contributor Author

@qqqttt123 @yuqi1129 @xloya @coolderli Can you please review again?

@xloya
Copy link
Collaborator

xloya commented Jan 16, 2024

Overall looks nice to me, can not wait to see next pr :)

@coolderli
Copy link
Collaborator

LGTM!

@coolderli
Copy link
Collaborator

coolderli commented Jan 16, 2024

Another discussion. Do we want to have some another data optimization in the file catalog? For example, we can transfer some cold data from the HDFS to S3. we can use more effective compression codec. We can clean some temporary files.

This seems out of the scope of current file catalog. But I think it is doable with file catalog. What we need is to have several tools around file catalog.

Maybe later we can provide some services to govern tables and files on gravitino, for example, manage compaction、expire snapshots, etc.

optional string comment = 3;
Type type = 4;
optional string storage_location = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide a default storage_location if storage_location is absent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will use the catalog/schema's location to figure out a location for fileset if it is not set.

Copy link
Contributor

@qqqttt123 qqqttt123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao jerryshao merged commit f0129d4 into apache:main Jan 16, 2024
12 checks passed
@jerryshao
Copy link
Contributor Author

Thanks all for your comments, greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask]Complete the metadata and interface design for file type catalog
5 participants