-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gfile: add support for fsspec filesystems #5248
Conversation
@_translate_errors | ||
def join(self, path, *paths): | ||
"""Join paths with a slash.""" | ||
return os.path.join(path, *paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like fsspec doesn't have a join function, so this might be okay. But using the OS' path joiner seems a little risky to me.... I remember Windows has some weirdnesses with backwards slashes.
Maybe you can build a simple one using the pathsep
of the filesystem class in fsspec? That won't cover relative -> absolute path resolution but it should theoretically be more compatible with the remote filesystems on a variety of OSs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to use url_to_fs and fs.sep
with fs.open(path, mode, encoding=encoding) as f: | ||
if offset is not None: | ||
f.seek(offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ever possible for this to throw a 'not implemented' error? You might want to check seekable()
before you attempt to seek, in case of user error when they request an offset but the fs doesn't support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do the check for this a few lines up though using seekable is probably the better approach here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if isinstance(filename, bytes): | ||
filename = filename.decode("utf-8") | ||
fs, _, _ = fsspec.get_fs_token_paths( | ||
filename, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question: are you repeating the logic from _fs_path
here to avoid running the validation at the end of that function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, get_fs_token_paths expands the filename globs and returns the files. I should probably change this to just grab the filesystem and not do any of the filepath resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to url_to_fs
SEPARATOR = "://" | ||
|
||
prefixed_files = [] | ||
if SEPARATOR in filename: | ||
for file in files: | ||
if SEPARATOR not in file: | ||
file = fs.protocol + SEPARATOR + file | ||
prefixed_files.append(file) | ||
return prefixed_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain this will work in all cases... there are chained systems (ex: dask::s3://bucket/key
) which fsspec supports, local systems where paths don't have a protocol (ex: C:\Users\reubend\file.txt
) and a few other edge cases.
Have you tested this with those? My initial guess is that putting the protocol and then the separator might break those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the diff to handle chained file systems and added test cases for them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated after feedback + now supports chained file systems such as simplecache::zip://foo.txt::s3://bucket/some/path
.
with fs.open(path, mode, encoding=encoding) as f: | ||
if offset is not None: | ||
f.seek(offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if isinstance(filename, bytes): | ||
filename = filename.decode("utf-8") | ||
fs, _, _ = fsspec.get_fs_token_paths( | ||
filename, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to url_to_fs
@_translate_errors | ||
def join(self, path, *paths): | ||
"""Join paths with a slash.""" | ||
return os.path.join(path, *paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to use url_to_fs and fs.sep
SEPARATOR = "://" | ||
|
||
prefixed_files = [] | ||
if SEPARATOR in filename: | ||
for file in files: | ||
if SEPARATOR not in file: | ||
file = fs.protocol + SEPARATOR + file | ||
prefixed_files.append(file) | ||
return prefixed_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the diff to handle chained file systems and added test cases for them
added file system chaining support to join() -- chained file systems is pretty tricky to get working right |
made filesystem existence checking much cheaper |
One thing I want to point out that doesn't matter for this PR but will matter in the future: Because TB currently implements As an example, the generic implementation we use now lists a dir's contents and then for each one it checks if that's a dir or not. But an efficient implementation (like the ones that fsspec has) might get info on which ones are directories from the ls call directly since some filesystems support that. In the case where there are N items inside a folder that would mean the difference between 1 API call and N+1 API calls (one for the list call, and then N So in the future, we should refactor this to allow custom walk implementations, and add a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @d4l3k for contributing this and @Ruebend for the review so far! Apologies for the wait. I've tested it out a bit with local files and GCS and it looks like it's working great, just a few questions plus some minor stuff to fix the presubmit checks etc.
So in the future, we should refactor this to allow custom walk implementations, and add a
walk
function for fsspec. Then if/when we replacegfile
with fsspec entirely, we can just use that walk function instead.
This is a great point (although FWIW, most regular users of TensorBoard who aren't doing anything special should now be going through our data server for the fast load path, which is in Rust and has its own filesystem abstraction separate from these changes). I've made note of it as part of a larger issue I filed: #5286.
CHAIN_SEPARATOR = "::" | ||
|
||
def __init__(self): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is equivalent to omitting __init__
entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
% (subdir, expected_listing, gotten_listing), | ||
) | ||
|
||
def testNonExistantFilesystem(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: Existent with an e - also below on 392, 394, 396
protocol = segment.split(FSSpecFileSystem.SEPARATOR)[0] | ||
if fsspec.get_filesystem_class(protocol): | ||
return _FSSPEC_FILESYSTEM | ||
except (ImportError, ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give some indication to what would likely trigger the ImportError and ValueError? This seems very broad so it would be nice to narrow this, since a blanket swallow of ImportError/ValueError could mask issues with fsspec that we aren't anticipating and makes debugging difficult.
Also, if the ImportError one would be triggered in the case where fsspec requires a dependency in order to load a given filesystem, it seems better if we actually surface that to the user. Right now it looks like we just swallow any message from fsspec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I was looking at https://github.com/intake/filesystem_spec/blob/master/fsspec/registry.py#L208-L213 and trying to match the value semantics of the existing code but I think bubbling up the actual ValueError/ImportError is best here
updated
@@ -401,6 +411,222 @@ def stat(self, filename): | |||
raise | |||
|
|||
|
|||
class FSSpecFileSystem(object): | |||
"""Provides filesystem access via fsspec.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a reference in this docstring to #5286 that mentions the caveat that the existing filesystem abstraction limits our ability to take full advantage of fsspec? E.g. since it doesn't allow actually maintaining file objects, repeated reads or writes to a single file require re-opening the file each time, and also the point @Reubend made about walk()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -21,6 +21,8 @@ pandas~=1.0 | |||
# For gfile S3 test | |||
boto3==1.9.86 | |||
moto==1.3.7 | |||
# For gfile fsspec test | |||
fsspec=2021.7.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs an extra =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tensorboard/summary/writer/BUILD
Outdated
], | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the extra newline to satisfy buildifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
f.write(data) | ||
with gfile.GFile(ckpt_path, "r") as f: | ||
f.buff_chunk_size = 4 # Test buffering by reducing chunk size | ||
with self.assertRaises(errors.InvalidArgumentError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it expected that this would raise InvalidArgumentError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fsspec implementation here doesn't support seekable streams for text, though digging deeper into it I'm actually not quite sure why here. It uses io.TextIOWrapper which does support them, need to take a closer look at this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fsspec/filesystem_spec#743 seems to be a bug with fsspec local filesystem in text mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for filing the upstream bug and fix! Mind adding a comment here with a reference to the fix PR so we know to correct this test once there's a newer release?
Also, would be good to have a test for the success case (i.e. if we don't force buffering by setting the small chunk size). We could either omit the buffered case entirely (with the comment), or if we keep it, mind making it assertRaisesRegex(errors.InvalidArgumentError, "not seekable")
just to ensure we're targeting narrowly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed that check for now. We can add it once that's fixed. Don't want to add a check for the bad behavior only for it to fail on dep update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated after feedback
@@ -401,6 +411,222 @@ def stat(self, filename): | |||
raise | |||
|
|||
|
|||
class FSSpecFileSystem(object): | |||
"""Provides filesystem access via fsspec.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
CHAIN_SEPARATOR = "::" | ||
|
||
def __init__(self): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
protocol = segment.split(FSSpecFileSystem.SEPARATOR)[0] | ||
if fsspec.get_filesystem_class(protocol): | ||
return _FSSPEC_FILESYSTEM | ||
except (ImportError, ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I was looking at https://github.com/intake/filesystem_spec/blob/master/fsspec/registry.py#L208-L213 and trying to match the value semantics of the existing code but I think bubbling up the actual ValueError/ImportError is best here
updated
@@ -21,6 +21,8 @@ pandas~=1.0 | |||
# For gfile S3 test | |||
boto3==1.9.86 | |||
moto==1.3.7 | |||
# For gfile fsspec test | |||
fsspec=2021.7.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tensorboard/summary/writer/BUILD
Outdated
], | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@nfelt can you approve running workflows? want to make sure CI is happy but it says "First-time contributors need a maintainer to approve running workflows." |
9d8b3b9
to
ee1db05
Compare
buildifier + flake8 should be happy now |
|
||
def testExistance(self): | ||
self.assertIsInstance( | ||
gfile.get_filesystem("simplecache::nonexistent::s3://blah/blah"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like now that the ImportError isn't swallowed, these tests are failing because having s3://
in the path is prompting fsspec to require s3fs
. Maybe we can just use file://
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
f.write(data) | ||
with gfile.GFile(ckpt_path, "r") as f: | ||
f.buff_chunk_size = 4 # Test buffering by reducing chunk size | ||
with self.assertRaises(errors.InvalidArgumentError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for filing the upstream bug and fix! Mind adding a comment here with a reference to the fix PR so we know to correct this test once there's a newer release?
Also, would be good to have a test for the success case (i.e. if we don't force buffering by setting the small chunk size). We could either omit the buffered case entirely (with the comment), or if we keep it, mind making it assertRaisesRegex(errors.InvalidArgumentError, "not seekable")
just to ensure we're targeting narrowly?
compatify = compat.as_bytes if "b" in mode else compat.as_text | ||
f.write(compatify(file_content)) | ||
|
||
def _apply_prefix(self, filename, files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to simplify this to something like the following, which could perhaps be inlined:
prefix, chain_sep, base_url = filename.rpartition(self.CHAIN_SEPARATOR)
prefix += chain_sep
protocol, path = fsspec.core.split_protocol(base_url)
if protocol:
prefix += protocol + self.SEPARATOR
return [prefix + file for file in files]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return fs.exists(path) | ||
|
||
@_translate_errors | ||
def join(self, path, *paths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to simplify this to something like:
_, _, base_url = path.rpartition(self.CHAIN_SEPARATOR)
protocol, path = fsspec.core.split_protocol(base_url)
fs = fsspec.get_filesystem_class(protocol)
return fs.sep.join((path,) + paths)
Also, it'd be better if the join step would work like os.path.join() and collapse adjacent separators (and handle absolute paths), though I see the existing S3 implementation doesn't do that. I think a helper like this would suffice:
def _join(sep, parts):
result = []
for part in parts:
if part.startswith(sep):
result = []
if result and result[-1] and not result[-1].endswith(sep):
result.append(sep)
result.append(part)
return "".join(result)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize it's an edge case, but mind adding back the and result[-1]
part of the condition? That was deliberate; without it we get different behavior from os.path.join()
:
>>> _join("/", ("foo", "", "bar"))
'foo/bar'
>>> _join_without_extra_check("/", ("foo", "", "bar"))
'foo//bar'
>>> os.path.join("foo", "", "bar")
'foo/bar'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added and updated unit test to check for this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to simplify path logic + other feedback
return fs.exists(path) | ||
|
||
@_translate_errors | ||
def join(self, path, *paths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
compatify = compat.as_bytes if "b" in mode else compat.as_text | ||
f.write(compatify(file_content)) | ||
|
||
def _apply_prefix(self, filename, files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
f.write(data) | ||
with gfile.GFile(ckpt_path, "r") as f: | ||
f.buff_chunk_size = 4 # Test buffering by reducing chunk size | ||
with self.assertRaises(errors.InvalidArgumentError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed that check for now. We can add it once that's fixed. Don't want to add a check for the bad behavior only for it to fail on dep update.
|
||
def testExistance(self): | ||
self.assertIsInstance( | ||
gfile.get_filesystem("simplecache::nonexistent::s3://blah/blah"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making the changes! Everything LGTM except a few last things, and aside from the one mentioned issue I've confirmed that we can land this internally as well.
I'll be away on vacation starting today through the end of next week, but I'll leave some instructions with my teammates re: trying to get this merged assuming it's ready before I'm back.
return [ | ||
file | ||
if (self.SEPARATOR in file or self.CHAIN_SEPARATOR in file) | ||
else self.join(filename, file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not quite right - reusing join()
for this only works when the underlying fsspec filesystem's glob()
can be relied on to return absolute paths that start with fs.sep
, so that they truncate filename
to just the chaining/protocol prefix and drop the glob characters that shouldn't be preserved.
Our tests don't catch this, since they use absolute local paths throughout. But the GCS filesystem doesn't:
>>> fsspec.open("gcs://").fs.glob("gs://tensorboard-bench-logs/*")
['tensorboard-bench-logs/README', 'tensorboard-bench-logs/edge_cgan', 'tensorboard-bench-logs/mnist', 'tensorboard-bench-logs/nndiv_100m', 'tensorboard-bench-logs/nndiv_100m_composite']
TensorBoard's read path does use globbing, so this breaks it for GCS, since it ends up producing incorrect globs like gs://tensorboard-bench-logs/mnist/*/tensorboard-bench-logs/mnist/*/*
.
As for fixing this, glob()
doesn't really specify what form the returned paths should take. But the local filesystem glob implementation seems to always normalize glob patterns to their absolute version before globbing, which means it always emits absolute paths. So it seems reasonable to assume the returned paths are meant to be "logically absolute" in that you can prepend the filesystem protocol to the path and use that to open the file. If that's the case, that's probably what we should do (i.e. explicitly add the chain prefix + protocol) rather than attempting to reuse join()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh yeah you're right about that, the underspecified glob output is causing problems. I was wondering about the absolute vs relative case.
I think this is now correct and I added a test case with the in memory file system which returns relative paths for glob similar to gcs
return fs.exists(path) | ||
|
||
@_translate_errors | ||
def join(self, path, *paths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize it's an edge case, but mind adding back the and result[-1]
part of the condition? That was deliberate; without it we get different behavior from os.path.join()
:
>>> _join("/", ("foo", "", "bar"))
'foo/bar'
>>> _join_without_extra_check("/", ("foo", "", "bar"))
'foo//bar'
>>> os.path.join("foo", "", "bar")
'foo/bar'
filename = filename.decode("utf-8") | ||
self._validate_path(filename) | ||
|
||
fs, _, paths = fsspec.get_fs_token_paths( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that internally we're on an older version of fsspec (0.7.4) that doesn't have fsspec/filesystem_spec@8f96ee5, which fixes get_fs_token_paths()
to handle unchaining itself.
Could we instead use fsspec.core.url_to_fs()
here? That is present in 0.7.4 onwards and does handle unchaining, and it actually returns both fs and path, so it even simplifies this code to just return fsspec.core.url_to_fs(filename)
. With that change I can get the tests to pass internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed feedback + expanded unit tests to cover those cases
I did file an issue on the upstream on the pain points when handling these paths so hopefully we can get a first party maintained solution for this. fsspec/filesystem_spec#747
return fs.exists(path) | ||
|
||
@_translate_errors | ||
def join(self, path, *paths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added and updated unit test to check for this case
return [ | ||
file | ||
if (self.SEPARATOR in file or self.CHAIN_SEPARATOR in file) | ||
else self.join(filename, file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh yeah you're right about that, the underspecified glob output is causing problems. I was wondering about the absolute vs relative case.
I think this is now correct and I added a test case with the in memory file system which returns relative paths for glob similar to gcs
filename = filename.decode("utf-8") | ||
self._validate_path(filename) | ||
|
||
fs, _, paths = fsspec.get_fs_token_paths( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! Nick is now on long vacation so I will be taking over the review and sync. After CI passes, I will take some time to test sync again and merge.
I don't mean to do the review from the top but I have spotted minor things like typos which you may want to address. Also, the failure in CI is real.
line 531, in testGlobNonAbsolute
self.assertCountEqual(
AssertionError: Element counts were not equal:
First has 1, Second has 0: 'memory:///dir/bar.txt'
First has 1, Second has 0: 'memory:///dir/foo.txt'
First has 0, Second has 1: 'memory://dir/foo.txt'
First has 0, Second has 1: 'memory://dir/bar.txt'
with self.assertRaises(ValueError): | ||
gfile.get_filesystem("nonexistent::blah://filesystem") | ||
|
||
def testExistance(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: Existance -> Existence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow I just cannot spell existence correctly
gfile.FSSpecFileSystem, | ||
), | ||
|
||
def testJoin(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more completeness, would you mind adding test cases for these?
# Trailing slash on the last element
>>> os.path.join("hello", "world/")
'hello/world/'
# empty path at the end
>>> os.path.join("hello", "world", "")
'hello/world/'
# absolute path in the middle
>>> os.path.join("hello", "", "world", "", "/wow")
'/wow'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,541 @@ | |||
# Copyright 2018 The TensorFlow Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2018 -> 2021.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if not FSSPEC_ENABLED: | ||
return None | ||
|
||
segment = filename.split(FSSpecFileSystem.CHAIN_SEPARATOR)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps pass a second argument to split
. e.g., segment = filename.split(FSSpecFileSystem.CHAIN_SEPARATOR, 1)[0]
. Here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to use partition
@@ -21,6 +21,8 @@ pandas~=1.0 | |||
# For gfile S3 test | |||
boto3==1.9.86 | |||
moto==1.3.7 | |||
# For gfile fsspec test | |||
fsspec==2021.8.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind changing this to 0.7.4 for now? As Nick indicated, we internally are using older version of fsspec and don't want to inadvertently use a newer APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test failure with memory paths seems to be a fsspec difference, memory uses absolute paths in 2021.8.1 but not 0.7.4, CI should be fixed with the version change
@@ -21,6 +21,8 @@ pandas~=1.0 | |||
# For gfile S3 test | |||
boto3==1.9.86 | |||
moto==1.3.7 | |||
# For gfile fsspec test | |||
fsspec==2021.8.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if not FSSPEC_ENABLED: | ||
return None | ||
|
||
segment = filename.split(FSSpecFileSystem.CHAIN_SEPARATOR)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to use partition
@@ -0,0 +1,541 @@ | |||
# Copyright 2018 The TensorFlow Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
gfile.FSSpecFileSystem, | ||
), | ||
|
||
def testJoin(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
with self.assertRaises(ValueError): | ||
gfile.get_filesystem("nonexistent::blah://filesystem") | ||
|
||
def testExistance(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow I just cannot spell existence correctly
Nick is OOO and conditionally OK-ed merging on some changes that are addressed.
Thank you for the contribution! |
This adds support for fsspec filesystem backends to the gfile compat library. This allows for tensorboard to be used with a wider range of file system backends. This adds a new GFile filesystem backend that's used if no other registered filesystems are present. In that case fsspec is checked to see if it has an available one and then returns a wrapped version. fsspec provides similar semantics to normal usage however, it doesn't support seeking for files opened in text mode which can be problematic for longer text files. However, this seems to work regardless since most of tensorboards larger files are in a binary format. Tests: ```sh $ bazel run //tensorboard -- --logdir file:///tmp/torchxrun/lightning_logs/version_1/ $ bazel test //tensorboard/compat/tensorflow_stub:gfile_fsspec_test //tensorboard/summary/writer:event_file_writer_fsspec_test --test_output=errors ``` See: tensorflow#5165
This adds support for fsspec filesystem backends to the gfile compat library. This allows for tensorboard to be used with a wider range of file system backends. This adds a new GFile filesystem backend that's used if no other registered filesystems are present. In that case fsspec is checked to see if it has an available one and then returns a wrapped version. fsspec provides similar semantics to normal usage however, it doesn't support seeking for files opened in text mode which can be problematic for longer text files. However, this seems to work regardless since most of tensorboards larger files are in a binary format. Tests: ```sh $ bazel run //tensorboard -- --logdir file:///tmp/torchxrun/lightning_logs/version_1/ $ bazel test //tensorboard/compat/tensorflow_stub:gfile_fsspec_test //tensorboard/summary/writer:event_file_writer_fsspec_test --test_output=errors ``` See: tensorflow#5165
This adds support for fsspec filesystem backends to the gfile compat library. This allows for tensorboard to be used with a wider range of file system backends.
See: #5165
This adds a new GFile filesystem backend that's used if no other registered filesystems are present. In that case fsspec is checked to see if it has an available one and then returns a wrapped version.
fsspec provides similar semantics to normal usage however, it doesn't support seeking for files opened in text mode which can be problematic for longer text files. However, this seems to work regardless since most of tensorboards larger files are in a binary format.
I've copied the existing local filesystem unit tests and changed the paths to be prefixed with
file://
so they go through the fsspec backend.