Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve(filesystem-hadoop): support path without scheme for gvfs api #2779

Merged
merged 6 commits into from
Apr 15, 2024

Conversation

coolderli
Copy link
Collaborator

@coolderli coolderli commented Apr 2, 2024

What changes were proposed in this pull request?

It will use the path without scheme in tensorflow. This MR will support the path without gvfs scheme.
https://github.com/tensorflow/io/blob/master/tensorflow_io/core/filesystems/hdfs/hadoop_filesystem.cc#L618
https://github.com/tensorflow/io/blob/master/tensorflow_io/core/filesystems/hdfs/hadoop_filesystem.cc#L116

Why are the changes needed?

Does this PR introduce any user-facing change?

  • no

How was this patch tested?

  • UTs pass

@coolderli coolderli changed the title improve(client-filesystem): support path without scheme for hadoop api improve(filesystem-hadoop): support path without scheme for hadoop api Apr 2, 2024
@coolderli coolderli closed this Apr 3, 2024
@coolderli coolderli reopened this Apr 3, 2024
@coolderli coolderli force-pushed the support-fileset-withoutscheme branch 2 times, most recently from 254debc to e66a1d2 Compare April 8, 2024 08:37
@coolderli coolderli changed the title improve(filesystem-hadoop): support path without scheme for hadoop api improve(filesystem-hadoop): support path without scheme for gvfs api Apr 8, 2024
@coolderli coolderli closed this Apr 8, 2024
@coolderli coolderli reopened this Apr 8, 2024
@coolderli
Copy link
Collaborator Author

@jerryshao @xloya Please help review this MR when you have time. Thanks.

@@ -51,6 +53,9 @@ public class GravitinoVirtualFileSystem extends FileSystem {
private Cache<NameIdentifier, Pair<Fileset, FileSystem>> filesetCache;
private ScheduledThreadPoolExecutor scheduler;

private static final Pattern IDENTIFIER_PATTERN =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a description for this pattern which string will be matched?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@xloya
Copy link
Collaborator

xloya commented Apr 9, 2024

Overall LGTM, left a minor suggestion.

@jerryshao
Copy link
Contributor

Is it the limitation of tensorflow? If so, how does tensorflow distinguish different filesystems?

String virtualPath = virtualUri.toString();
if (StringUtils.isBlank(virtualPath)) {
throw new InvalidPathException(
virtualPath, "Uri which need be extracted cannot be null or empty.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Preconditions for simplicity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jerryshao
Copy link
Contributor

jerryshao commented Apr 9, 2024

Seems you don't create a related issue, can you please create one and associate with this PR?

@coolderli coolderli force-pushed the support-fileset-withoutscheme branch from 55efc01 to 99a8c66 Compare April 11, 2024 01:33
@coolderli
Copy link
Collaborator Author

Seems you don't create a related issue, can you please create one and associate with this PR?

create issue #2860

@coolderli
Copy link
Collaborator Author

Is it the limitation of tensorflow? If so, how does tensorflow distinguish different filesystems?

Yes. I added a description of the issue.

Tensorflow will parseHadoopPath after creating the HadoopFileSystem(GravitinoVirtualFIleSystem). After that, it will use the path that has no scheme prefix to invoke API.

void ParseHadoopPath(const std::string& fname, std::string* scheme,
                     std::string* namenode, std::string* path) {
  size_t scheme_end = fname.find("://") + 2;
  // We don't want `://` in scheme.
  *scheme = fname.substr(0, scheme_end - 2);
  size_t nn_end = fname.find("/", scheme_end + 1);
  if (nn_end == std::string::npos) {
    *namenode = fname.substr(scheme_end + 1);
    *path = "";
    return;
  }
  *namenode = fname.substr(scheme_end + 1, nn_end - scheme_end - 1);
  // We keep `/` in path.
  *path = fname.substr(nn_end);  // here, truncate the scheme prefix
}
 ParseHadoopPath(path, &scheme, &namenode, &hdfs_path);

  auto handle = libhdfs->hdfsOpenFile(fs, hdfs_path.c_str(), O_WRONLY, 0, 0, 0);   // here. the hdfs_path has no gvfs:// prefix

@coolderli
Copy link
Collaborator Author

@jerryshao @xloya I have fixed. Please take a look. Thanks.

@coolderli coolderli closed this Apr 11, 2024
@coolderli coolderli reopened this Apr 11, 2024
@coolderli
Copy link
Collaborator Author

@jerryshao Github CI is passed. Please help review this when you have time. Thanks.

@@ -51,6 +52,13 @@ public class GravitinoVirtualFileSystem extends FileSystem {
private Cache<NameIdentifier, Pair<Fileset, FileSystem>> filesetCache;
private ScheduledThreadPoolExecutor scheduler;

// The pattern is used to match gvfs path. The scheme prefix (gvfs://) is optional.
Copy link
Contributor

@jerryshao jerryshao Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheme prefix should be "gvfs://fileset", not "gvfs://", right? If so, can you please clarify it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao You are right. I have fixed it. Please take a look. Thanks.

@jerryshao
Copy link
Contributor

Just one minor issue, @coolderli can you please clarify it?

Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jerryshao jerryshao merged commit 802e539 into apache:main Apr 15, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants