-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak file opening #213
Tweak file opening #213
Conversation
if name.startswith(protocol): | ||
return name | ||
return protocol + "://" + name | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martindurant - could you explain this else
block? Under what circumstances would protocol
not be a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be a tuple of strings, like ("gcs", "gs")
Is there any possibility that this alone would help with #177? We only see that hanging when opening from cache... |
of = self.fs.open(full_path, **kwargs) | ||
logger.debug(f"FSSpecTarget.open yielding {of}") | ||
yield of | ||
logger.debug("FSSpecTarget.open yielded") | ||
of.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a subtle but important change. We used to do
with self.fs.open(full_path, **kwargs) as f:
yield f # -> _io.BufferedReader
Now we basically do
yield self.fs.open(full_path, **kwargs) # -> fsspec.core.OpenFile
This passes all our tests, but this is a sensitive area of the code.
@martindurant, can you think of anything that could go wrong here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: self.fs.open returns a file-like object (io.IOBase subclass, AbstractBufferedFile for s3, gcs, etc.)
For an AbstractBufferedFile, entering the context is a no-op yielding the same object, and exiting calls .close()
. For other file types, it's something else, with the expectation that exiting the context closes. For example, LocalFileSystem.open -> LocalFileOpener ; LocalFileSystem.enter -> _io.BufferedReader (and exit calls the actual file's exit, calling close()). You could do with
on the file-like.
For fsspec.open, you get OpenFile objects, which can contain more than one file-like layer, for compression and text mode. You must use this with a context, to get the outermost file-like object back. Leaving the context calls close on the file-like objects in reverse order. If you, instead, call .open
of the OpenFile, you get the same file-like object, but its .close method has been patched to execute the same chain of close calls, since this object might in this case outlive the OpenFile instance that made it.
Yes it's possible. I wish there were an easy way to answer this. |
I have an idea for how to investigate this through the reproducer work I'm doing on #177. Will check back here if/when I find anything of note. |
pangeo_forge_recipes/storage.py
Outdated
# A note about something frustrating about this context manager: | ||
# Sometimes it yields an fsspec.OpenFile object | ||
# Other times it yields something lower level, like an _io.BufferedReader | ||
# This can lead to unpredictable behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is outdated and needs to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# A note about something frustrating about this context manager: | |
# Sometimes it yields an fsspec.OpenFile object | |
# Other times it yields something lower level, like an _io.BufferedReader | |
# This can lead to unpredictable behavior |
pangeo_forge_recipes/storage.py
Outdated
# A note about something frustrating about this context manager: | ||
# Sometimes it yields an fsspec.OpenFile object | ||
# Other times it yields something lower level, like an _io.BufferedReader | ||
# This can lead to unpredictable behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# A note about something frustrating about this context manager: | |
# Sometimes it yields an fsspec.OpenFile object | |
# Other times it yields something lower level, like an _io.BufferedReader | |
# This can lead to unpredictable behavior |
pangeo_forge_recipes/storage.py
Outdated
@@ -116,14 +115,15 @@ def size(self, path: str) -> int: | |||
return self.fs.size(self._full_path(path)) | |||
|
|||
@contextmanager | |||
def open(self, path: str, **kwargs) -> Iterator[None]: | |||
def open(self, path: str, **kwargs) -> Iterator[fsspec.core.OpenFile]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def open(self, path: str, **kwargs) -> Iterator[fsspec.core.OpenFile]: | |
def open(self, path: str, **kwargs) -> Iterator[io.IOBase]: |
@martindurant, do I understand your comment here correctly to mean that the return type of self.fs.open
is not always an AbstractBufferedFile
, and therefore this annotation should be changed as proposed here, to be compatible with reading from certain implementations (maybe LocalFileSystem, e.g.)?
According to python/mypy#2984 (comment), I believe mypy should accept subclasses of BaseClass
for an Iterator[BaseClass]
annotation.
(If so, we'll need to import io
in this module, of course.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the return type of self.fs.open is not always an AbstractBufferedFile
Correct. It is a subclass of AbstractBufferedFile for s3, gcsfs, abfs, ftp, http and others, but can also be LocalFileOpener (io.IOBase), SMBFileOpener (object), SFTPFile (paramiko.BufferedFile), ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following assertion fails, so I do see your point
with file_opener('/etc/hosts') as fp:
assert isinstance(fp, fsspec.core.OpenFile)
So my question is this: is there some other type besides fsspec.core.OpenFile
we can use which captures the fact that this is an ffspec
-related object.
I added this test, basically asserting that whatever comes out of file_opener
has an fs
attribute:
pangeo-forge-recipes/tests/test_storage.py
Line 101 in c001e71
assert hasattr(fp, "fs") # should be true for fsspec.OpenFile objects |
This is required by some of the reference-maker code, e.g. here
protocol = getattr(getattr(fp, "fs", None), "protocol", None) # make mypy happy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better
of = file_opener(..)
assert isinstance(of, fsspec.core.OpenFile)
with of as fp:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's not how we use file_opener
in Pangeo Forge. We also always invoke it as a context manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no fool-proof method to tell whether the output of the context was created by fsspec. You could check for the few concrete classes that we expect, like AbstractBufferedFile, LocalFileOpener.
@rabernat, I wonder if we should rename |
This PR is in preparation for making XarrayZarrRecipe use fsspec-references for opening NetCDF files.
I have moved some of the reference stuff into its own module
reference.py
and call this fromreference_hdf_zarr.py
.I have also changed
reference_hdf_zarr
to use thestorage.file_opener
function. This revealed some inconsistencies with return types of that function. Specifically,file_opener
was previously yielding and_io.BufferedReader
object in some cases and anfsspec.OpenFile
object in others (specifically, when opening from cache). Now it will always yieldfsspec.OpenFile
s.