Skip to content

Initial commit of hadoop file system (credit vnvo2409) #1197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Nov 18, 2020

Conversation

yongtang
Copy link
Member

This PR is an initial commit of Hadoop modular file system that was moved from tensorflow to here. Credit goes to author @vnvo2409.

This PR consists of mostly mechanic changes, though there are some small change to create a wrapper so that the loading of libhdfs.so will only be done lazily (not loaded when file system is initialized but loaded at the first usage). This avoids the case of the following error in case user does not even want to use hdfs:

Library (libhdfs.dylib) not found: dlopen(libhdfs.dylib, 6): image not found']

It is a little challenging to setup Hadoop for tests so this PR defers. Will create a follow up PR to set up the tests.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

typedef struct HadoopFileSystem {
absl::Mutex mu;
std::unique_ptr<HadoopFileSystemImplementation> ptr ABSL_GUARDED_BY(mu);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test failed because you forgot to add #include<memory>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vnvo2409 Thanks. The PR has been updated.

Comment on lines +86 to +87
static void* plugin_memory_allocate(size_t size) { return calloc(1, size); }
static void plugin_memory_free(void* ptr) { free(ptr); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions are unused. Likely because I copied from posix. Let just merge this PR and I will make a follow up PR ( mostly cleanup and better memory management ).

Comment on lines 44 to 47
#if defined(_MSC_VER)
TF_SetStatus(status, TF_UNIMPLEMENTED, "LoadSharedLibrary not implemented");
return nullptr;
#else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to use TF_LoadSharedLibrary as we did in experiemental plugin. And does it mean that we drop support for hdfs on Windows ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vnvo2409 Inside tensorflow while the functions are defined, they somehow are not part of the pip package so we cannot include them directly. I tried to modify the tensorflow repo but had some trouble due to the big circular dependency. Give we only need those two functions and they are small, I think it makes sense to just include them here.

Also, I have added the Windows counterpart as well so now hdfs should be supported on Windows.

Comment on lines +204 to +207
#if defined(_WIN32)
if (home.back() != '\\') home.push_back('\\');
return home + "lib\\native\\" + lib;
#else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we drop support for hdfs on Windows, I think this isn't necessary anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vnvo2409 See my other comments. The Windows part has been added.

@vnghia
Copy link
Contributor

vnghia commented Nov 17, 2020

Thank you for this PR @yongtang !

In addition, I am thinking of a loading dynamically mechanics for the plugins. Maybe we could change
https://github.com/tensorflow/tensorflow/blob/fa6c1bb0fab42d153fa375ad31713fbda99aba0c/tensorflow/core/platform/env.cc#L112-L120

Status Env::GetFileSystemForFile(const std::string& fname,
                                 FileSystem** result) {
  StringPiece scheme, host, path;
  io::ParseURI(fname, &scheme, &host, &path);
  FileSystem* file_system = file_system_registry_->Lookup(std::string(scheme));
  if (!file_system) {
    if (scheme.empty()) {
      scheme = "[local]";
    }

    return errors::Unimplemented("File system scheme '", scheme,
                                 "' not implemented (file: '", fname, "')");
  }
  *result = file_system;
  return Status::OK();
}

Lookup to LookupOrLoad so we don't have to load all the plugins into TF in the first place.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
@yongtang
Copy link
Member Author

Thanks @vnvo2409 for the review. The PR has been updated now to address the review feedback.

@vnghia
Copy link
Contributor

vnghia commented Nov 18, 2020

LGTM. Thank you 😄

@yongtang yongtang merged commit 593032e into tensorflow:master Nov 18, 2020
@yongtang yongtang deleted the hdfs-plugin branch November 18, 2020 00:27
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
* Initial commit of hadoop file system

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Update namespace to avoid collision/conflict

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Add a wrapper to HadoopFileSystem so that it will only be lazily loaded.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Address review comments

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants