Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5220] improvment(hadoop-catalog): Optimize the name properties keys for Hadoop catalog. #5244

Merged
merged 174 commits into from
Oct 30, 2024

Conversation

yuqi1129
Copy link
Contributor

What changes were proposed in this pull request?

Replace the properties keys with gravitino.bypass prefix with a more elegant one.

Why are the changes needed?

For better user experience.

Fix: #5220

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

Existing UTs and ITs.

@@ -95,10 +96,8 @@ protected String defaultBaseLocation() {

protected void createCatalog() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these ITs to each bundle?

Copy link
Contributor Author

@yuqi1129 yuqi1129 Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can try it but I will not do it in this PR, It will make this PR large and hard to test the changes introduced by this PR, so I created #5325 to follow it.

Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should think if user implement a custom FS with FS provider, can he integrate into Gravitino server Hadoop catalog, Java and Python gvfs without code change in Gravitino? Take this as a principle, if not, please make it work.


private static final String OSS_FILESYSTEM_IMPL = "fs.oss.impl";

public static final Map<String, String> GRAVITINO_KEY_TO_OSS_HADOOP_KEY =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to make this public?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, why do we turn it back to public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value will be used in GravitinoVirtualFileSystemOSSIT test and others are similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because you don't put IT and code together, so you have to make it public, can you please add a visible for test tag on it?

bundles/gcp-bundle/build.gradle.kts Outdated Show resolved Hide resolved
@@ -32,6 +32,8 @@
*/
public interface FileSystemProvider {

String GRAVITINO_BYPASS = "gravitino.bypass.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some comments to this variable, to let users know how to leverage this variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -107,5 +112,46 @@ public class GravitinoVirtualFileSystemConfiguration {
public static final long FS_GRAVITINO_FILESET_CACHE_EVICTION_MILLS_AFTER_ACCESS_DEFAULT =
1000L * 60 * 60;

private static final Map<String, String> GVFS_S3_KEY_TO_HADOOP_KEY =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a new FS provider, do we need to update the GVFS code to make it work? If so, I think it is not a good design, we should also hide the client side configuration into each provider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run, we can provide a method like requiredPropertyKey in the FileSystemProvider and load this provider by service provider. By doing so, we can register this property to the client dynamically. Currently, we cannot avoid this hard coding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still suggest to move this to each provider, we don't have to expose them here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this will make module hadoop3-filesystem depend on bundle jars in the runtime explicitly will hinder me from doing so. If this is not a big problem, I would like to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there is a dependency? If user sets the configurations via Hadoop Configuration in gvfs, when gvfs create a provider, it can pass the configs from gvfs' Configuration to provider's config map. Each provider can use this config map to check the configurations and initialize the FS.

@yuqi1129
Copy link
Contributor Author

I think you should think if user implement a custom FS with FS provider, can he integrate into Gravitino server Hadoop catalog, Java and Python gvfs without code change in Gravitino? Take this as a principle, if not, please make it work.

In most cases, the current implementation can satisfy users' needs. They can use the gravitino.bypass mechanism to pass to keys with prefixes gravitino.bypass to the filesystem providers, the file system provider then will remove the prefix and get the real keys. or just directly pass the properties.

For example, if we use the following key/value pair as the properties when creating a Hadoop catalog,

k1=value1
gravitino.bypass.k2=value2
....

Then, we will pass these configurations on to the file system provider. It's quite flexible now and it depends on the users that custom a new file system provider what a property key stands for. key k1 may stand for a core configuration or just a useless one.

k1=value1
k2=value2

The following three keys are identical and users can use any key that contains endpoint information to the file system provider and then assign it to the key fs.s3a.endpoint

fs.s3a.endpoint = xxxxx
s3-endpoint=xxxx
gravitino.bypass.fs.s3a.endpoint = xxxx

I'm not very sure whether the kinds of property conversion are acceptable to you.

If we need to specify each required property explicitly, like the current S3 implementation, users need to use s3-endpoint in the Gravitino server and s3.endpoint in the GVFS Java client to pass endpoint information, I believe we need to add a new interface method like requireConfigurationKey() to tell users what the properties they required to pass, I'm hesitant to do so as it seems to make thing more complex and not very extensive enough.

@yuqi1129 yuqi1129 dismissed a stale review via 27393e5 October 29, 2024 10:18
Comment on lines 120 to 142
config.forEach(
(k, v) -> {
if (predefinedKeys.containsKey(k)) {
String key = predefinedKeys.get(k);
highestPriorityKeys.add(key);
result.put(key, v);
}

if (!k.startsWith(GRAVITINO_BYPASS)) {
// If the key does not start with 'gravitino.bypass' and is not in the highest priority
// keys, set the value to the configuration.
if (!highestPriorityKeys.contains(k)) {
result.put(k, v);
}
} else {
// If the key starts with 'gravitino.bypass', remove the prefix and set the value to the
// configuration if the key does not exist in the configuration.
String key = k.replace(GRAVITINO_BYPASS, "");
if (!result.containsKey(key)) {
result.put(key, v);
}
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this code? It's a bit hard to understand, I think it is OK to iterate for several rounds, just make it easy to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a UT for this?

@@ -32,6 +32,8 @@
*/
public interface FileSystemProvider {

String GRAVITINO_BYPASS = "gravitino.bypass.";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -107,4 +112,64 @@ protected void createCatalog() {
protected String generateLocation(String filesetName) {
return String.format("%s/%s", defaultBaseLocation, filesetName);
}

@Test
public void testCreateSchemaAndFilesetWithSpecialLocation() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to fix the bug introduced by #5296, we need to consider other Hadoop catalog like S3, OSS and GCS.

@@ -107,5 +112,46 @@ public class GravitinoVirtualFileSystemConfiguration {
public static final long FS_GRAVITINO_FILESET_CACHE_EVICTION_MILLS_AFTER_ACCESS_DEFAULT =
1000L * 60 * 60;

private static final Map<String, String> GVFS_S3_KEY_TO_HADOOP_KEY =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run, we can provide a method like requiredPropertyKey in the FileSystemProvider and load this provider by service provider. By doing so, we can register this property to the client dynamically. Currently, we cannot avoid this hard coding.

FileSystemProvider provider = fileSystemProvidersMap.get(scheme);
if (provider == null) {
throw new GravitinoRuntimeException(
"Unsupported file system scheme: %s for %s.",
scheme, GravitinoVirtualFileSystemConfiguration.GVFS_SCHEME);
}

Map<String, String> maps = getConfigMap(getConf());
Copy link
Contributor

@jerryshao jerryshao Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the gvfs configuration for now, is it aligned with server side conf, or still use "."? I cannot find the code anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are the same as those in Gravitino server side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified manually using gvfs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , GravitinoVirtualFileSystemS3IT can be running in the CI and cover the changes.

@jerryshao
Copy link
Contributor

Overall LGTM, please fix the ci issue.

@jerryshao jerryshao merged commit b5a3723 into apache:main Oct 30, 2024
26 checks passed
github-actions bot pushed a commit that referenced this pull request Oct 30, 2024
… for Hadoop catalog. (#5244)

### What changes were proposed in this pull request?

Replace the properties keys with `gravitino.bypass` prefix with a more
elegant one.

### Why are the changes needed?

For better user experience.

Fix: #5220 

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Existing UTs and ITs.
jerqi pushed a commit to qqqttt123/gravitino that referenced this pull request Oct 30, 2024
…s keys for Hadoop catalog. (apache#5244)

### What changes were proposed in this pull request?

Replace the properties keys with `gravitino.bypass` prefix with a more
elegant one.

### Why are the changes needed?

For better user experience.

Fix: apache#5220 

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Existing UTs and ITs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-0.7 Automatically cherry-pick commit to branch-0.7
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Optimize cloud storage property keys when creating Hadoop catalog
2 participants