Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os.path equivalent for fsspec #747

Open
d4l3k opened this issue Sep 9, 2021 · 17 comments
Open

os.path equivalent for fsspec #747

d4l3k opened this issue Sep 9, 2021 · 17 comments

Comments

@d4l3k
Copy link
Contributor

d4l3k commented Sep 9, 2021

One of the things that have come up when trying to integrate fsspec into tensorboard (tensorflow/tensorboard#5248) is that there aren't any standard path operations as part of fsspec as far as I can tell. When dealing with more complex chained filesystems the rules get pretty complex for users to implement.

Ideally fsspec would provide the common operations such as:

  • join
  • basename
  • dirname

I.e.

>>> path = "simplecache::zip://foo::s3://bucket/path.zip"
>>> fsspec.path.join(path, "bar") 
"simplecache::zip://foo/bar::s3://bucket/path.zip"
>>> fsspec.path.dirname(path)
"foo"
>>> fsspec.path.basename(path)
"simplecache::zip://::s3://bucket/path.zip"

With the chained filesystems this gets really complex and I'm not 100% sure how to implement this in all cases/filesystems so feedback would be appreciated here

An option here might be to try and implement https://docs.python.org/3/library/pathlib.html#pathlib.PurePath

@martindurant
Copy link
Member

On the pathlib option, please see the https://github.com/Quansight/universal_pathlib project, a layer on top of fsspec.

Otherwise, the situation is indeed complex and I'm not certain we can easily define the expected behaviour of os.path.* for fsspec-compatible paths. Indeed, you can apply those builtin functions right now, and get something reasonable back (which are not what you were after, though!).

>>> os.path.dirname("simplecache::zip://foo/bar::s3://bucket/path.zip")
'simplecache::zip://foo/bar::s3://bucket'
>>> os.path.basename("simplecache::zip://foo/bar::s3://bucket/path.zip")
'path.zip'

(I would probably argue that fsspec.path.basename(path) -> "foo")

@d4l3k
Copy link
Contributor Author

d4l3k commented Sep 14, 2021

Just took a look at universal_pathlib. It seems to throw exceptions for most chained file systems. For most filesystems posixpath works but since it defines the sep as a top level field seems like there should be an implementation that respects that. With chained paths, it's certainly more complex.

Potentially each FS could define the path operations and the top level fsspec join handles filesystem specific implementation? That would allow defining a "standard" posixpath based solution as well as allowing for overrides for specific filesystems with non posix style paths. Though there may be file systems without any concept of directories which could be painful to support here

@martindurant
Copy link
Member

So what we would need is not fsspec.path., but fs.path. (i.e., accessed though the filesystem instance), because different file systems will follow different patterns. It sounds doable. The universal_pathlib can call those things to complete the circle.

@martindurant
Copy link
Member

cc @andrewfulton9 @brl0

@andrewfulton9
Copy link

I'd definitely be interested in integrating something like that into universal_pathlib. Right now, I am defaulting on using pathlib._PosixFlavour to handle a lot of path manipulations. Adding one or more flavours is currently an open issue that I haven't gotten around to addressing yet. Having something on the filesystem to help with some of those operations would make it a lot easier to implement though.

@d4l3k
Copy link
Contributor Author

d4l3k commented Dec 7, 2021

One thing that would help a lot would be to expose _unstrip_protocol as part of core.

def _unstrip_protocol(name, fs):
if isinstance(fs.protocol, str):
if name.startswith(fs.protocol):
return name
return fs.protocol + "://" + name
else:
if name.startswith(tuple(fs.protocol)):
return name
return fs.protocol[0] + "://" + name

A lot of operations need you to call fsspec.core.url_to_fs and having the inverse would be a big help since not all fs/operations return full paths w/ protocols.

def url_to_fs(url, **kwargs):

Would be great to be able to do:

full_path = "memory://bar"
fs, path = fsspec.core.url_to_fs(full_path)
for file in fs.ls(path):
    print(fsspec.core.fs_to_url(fs, file))

That would help cut down a lot of duplicate code/boilerplate like

https://github.com/tensorflow/tensorboard/blob/4568e2bc948311b78ee7f2e775e5bcb589ecacf8/tensorboard/compat/tensorflow_stub/io/gfile.py#L487-L491

@martindurant
Copy link
Member

In the current latest, _unstrip_protocol is available on the class as a method (because different implementations might have various opinions, particularly http).

Generally, ls/find/glob operations return URLs as seen by the implementation in question, so not including the protocol. There's probably no changing that. The new generics module in #828 will convert these into complete URLs in every case.

@efiop
Copy link
Member

efiop commented Dec 8, 2021

My 2c: we've also introduced fs.path for convenience methods like name/parts/parent/parents/join/etc in dvc https://github.com/iterative/dvc/blob/master/dvc/fs/path.py Compared to using path-like objects as we did before, it has better performance when dealing with large number of objects. It would also be convenient for anyone writing a new fs implementation.

@martindurant
Copy link
Member

Ah yes, you did mention this sometime before, @efiop . You might consider upstreaming that module, maybe.

@d4l3k
Copy link
Contributor Author

d4l3k commented Dec 8, 2021

@martindurant I see that it's provided via "full_name" on the file class but often I want to be able to ls files without the overhead of opening them. https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L1394-L1395

Providing it on the Filesystem class would work for me though

I'm not seeing anything special on http. https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/http.py

Good to know about the generics change! Would be nice if the _unstrip_protocol method was public (no _) in those changes

@martindurant
Copy link
Member

(ah sorry, the method is only in the same PR I mentioned above, #828 - but it will come!)

@martindurant
Copy link
Member

Would be nice if the _unstrip_protocol method was public (no _)

That's reasonable

@efiop
Copy link
Member

efiop commented Oct 30, 2022

After using our fs.path for almost a year and making attempts (private) to create a patch (and finding a lot of fs.path already used and also some existing path-manipulation methods in some filesystems), I think the most natural way is to avoid fs.path layer and just have fs.join/basename/etc directly in fs. I found myself mistyping fs.join instead of fs.path.join so often that I think it proves the point 😄

Working on submitting a PR.

@logan-markewich
Copy link

Is this still being worked on? Would love to this feature implemented

@efiop
Copy link
Member

efiop commented May 29, 2023

Other stuff got in the way, so I didn't manage to contribute it yet 🙁

@agrinh
Copy link

agrinh commented Nov 18, 2023

@efiop Also very interested in getting this in, any chance some time might open up / is there anything anyone else can do to help out here?

@efiop
Copy link
Member

efiop commented Dec 15, 2023

@agrinh Finally getting around to moving fs.path to plain fs methods in dvc, so hoping to get around to contributing it to fsspec around new years 🙂 (said that before, but still).

efiop added a commit to efiop/dvc-objects that referenced this issue Dec 15, 2023
After such a long time using it, it is clear that it wasn't the right
decision and path manipulation methods should be right in the class.

This also makes it possible to use most of methods as classmethods without
having to initialize the filesystem at all.

Per fsspec/filesystem_spec#747 (comment) and also a pre-requisite for it.
efiop added a commit to efiop/dvc-objects that referenced this issue Dec 15, 2023
After such a long time using it, it is clear that it wasn't the right
decision and path manipulation methods should be right in the class.

This also makes it possible to use most of methods as classmethods without
having to initialize the filesystem at all.

Per fsspec/filesystem_spec#747 (comment) and also a pre-requisite for it.
efiop added a commit to iterative/dvc-objects that referenced this issue Dec 15, 2023
After such a long time using it, it is clear that it wasn't the right
decision and path manipulation methods should be right in the class.

This also makes it possible to use most of methods as classmethods without
having to initialize the filesystem at all.

Per fsspec/filesystem_spec#747 (comment) and also a pre-requisite for it.
efiop added a commit to efiop/filesystem_spec that referenced this issue Jan 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants