Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak file opening #213

Merged
merged 7 commits into from
Oct 2, 2021
Merged

Conversation

rabernat
Copy link
Contributor

This PR is in preparation for making XarrayZarrRecipe use fsspec-references for opening NetCDF files.

I have moved some of the reference stuff into its own module reference.py and call this from reference_hdf_zarr.py.

I have also changed reference_hdf_zarr to use the storage.file_opener function. This revealed some inconsistencies with return types of that function. Specifically, file_opener was previously yielding and _io.BufferedReader object in some cases and an fsspec.OpenFile object in others (specifically, when opening from cache). Now it will always yield fsspec.OpenFiles.

if name.startswith(protocol):
return name
return protocol + "://" + name
else:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant - could you explain this else block? Under what circumstances would protocol not be a string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be a tuple of strings, like ("gcs", "gs")

@cisaacstern
Copy link
Member

Now it will always yield fsspec.OpenFiles.

Is there any possibility that this alone would help with #177? We only see that hanging when opening from cache...

Comment on lines +122 to +126
of = self.fs.open(full_path, **kwargs)
logger.debug(f"FSSpecTarget.open yielding {of}")
yield of
logger.debug("FSSpecTarget.open yielded")
of.close()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a subtle but important change. We used to do

with self.fs.open(full_path, **kwargs) as f:
    yield f  # -> _io.BufferedReader

Now we basically do

yield self.fs.open(full_path, **kwargs)  # -> fsspec.core.OpenFile

This passes all our tests, but this is a sensitive area of the code.

@martindurant, can you think of anything that could go wrong here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: self.fs.open returns a file-like object (io.IOBase subclass, AbstractBufferedFile for s3, gcs, etc.)

For an AbstractBufferedFile, entering the context is a no-op yielding the same object, and exiting calls .close(). For other file types, it's something else, with the expectation that exiting the context closes. For example, LocalFileSystem.open -> LocalFileOpener ; LocalFileSystem.enter -> _io.BufferedReader (and exit calls the actual file's exit, calling close()). You could do with on the file-like.

For fsspec.open, you get OpenFile objects, which can contain more than one file-like layer, for compression and text mode. You must use this with a context, to get the outermost file-like object back. Leaving the context calls close on the file-like objects in reverse order. If you, instead, call .open of the OpenFile, you get the same file-like object, but its .close method has been patched to execute the same chain of close calls, since this object might in this case outlive the OpenFile instance that made it.

@rabernat
Copy link
Contributor Author

Is there any possibility that this alone would help with #177?

Yes it's possible. I wish there were an easy way to answer this.

@cisaacstern
Copy link
Member

I wish there were an easy way to answer this.

I have an idea for how to investigate this through the reproducer work I'm doing on #177. Will check back here if/when I find anything of note.

Comment on lines 184 to 187
# A note about something frustrating about this context manager:
# Sometimes it yields an fsspec.OpenFile object
# Other times it yields something lower level, like an _io.BufferedReader
# This can lead to unpredictable behavior
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is outdated and needs to be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# A note about something frustrating about this context manager:
# Sometimes it yields an fsspec.OpenFile object
# Other times it yields something lower level, like an _io.BufferedReader
# This can lead to unpredictable behavior

Comment on lines 184 to 187
# A note about something frustrating about this context manager:
# Sometimes it yields an fsspec.OpenFile object
# Other times it yields something lower level, like an _io.BufferedReader
# This can lead to unpredictable behavior
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# A note about something frustrating about this context manager:
# Sometimes it yields an fsspec.OpenFile object
# Other times it yields something lower level, like an _io.BufferedReader
# This can lead to unpredictable behavior

@@ -116,14 +115,15 @@ def size(self, path: str) -> int:
return self.fs.size(self._full_path(path))

@contextmanager
def open(self, path: str, **kwargs) -> Iterator[None]:
def open(self, path: str, **kwargs) -> Iterator[fsspec.core.OpenFile]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def open(self, path: str, **kwargs) -> Iterator[fsspec.core.OpenFile]:
def open(self, path: str, **kwargs) -> Iterator[io.IOBase]:

@martindurant, do I understand your comment here correctly to mean that the return type of self.fs.open is not always an AbstractBufferedFile, and therefore this annotation should be changed as proposed here, to be compatible with reading from certain implementations (maybe LocalFileSystem, e.g.)?

According to python/mypy#2984 (comment), I believe mypy should accept subclasses of BaseClass for an Iterator[BaseClass] annotation.

(If so, we'll need to import io in this module, of course.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return type of self.fs.open is not always an AbstractBufferedFile

Correct. It is a subclass of AbstractBufferedFile for s3, gcsfs, abfs, ftp, http and others, but can also be LocalFileOpener (io.IOBase), SMBFileOpener (object), SFTPFile (paramiko.BufferedFile), ...

Copy link
Contributor Author

@rabernat rabernat Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following assertion fails, so I do see your point

 with file_opener('/etc/hosts') as fp:
     assert isinstance(fp, fsspec.core.OpenFile)

So my question is this: is there some other type besides fsspec.core.OpenFile we can use which captures the fact that this is an ffspec-related object.

I added this test, basically asserting that whatever comes out of file_opener has an fs attribute:

assert hasattr(fp, "fs") # should be true for fsspec.OpenFile objects

This is required by some of the reference-maker code, e.g. here

protocol = getattr(getattr(fp, "fs", None), "protocol", None) # make mypy happy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better

of = file_opener(..)
assert isinstance(of, fsspec.core.OpenFile)
with of as fp:
   ...

Copy link
Contributor Author

@rabernat rabernat Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's not how we use file_opener in Pangeo Forge. We also always invoke it as a context manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no fool-proof method to tell whether the output of the context was created by fsspec. You could check for the few concrete classes that we expect, like AbstractBufferedFile, LocalFileOpener.

@cisaacstern
Copy link
Member

@rabernat, I wonder if we should rename _get_opener to _get_source_opener or similar? Reviewing this PR made me realize that _get_opener is perhaps too generic a name for a function which doesn't do all opening in the module. (Your tweaked FSSpecTarget.open handles target opening.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants