You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used the GitHub search to find a similar issue and didn't find it.
I searched the Prefect documentation for this issue.
I checked that this issue is related to Prefect and not one of its dependencies.
Bug summary
Currently it does not seem possible to use RemoteFileSystem with WebHDFS as the underlying implementation.
There are 2 problems afaict.
Assume you define your filesystem block as follows: myfs = RemoteFileSystem(basepath="webhdfs://home/user/project", settings={"host": "example.com"}).
Calling write_path fails due to an improperly formatted url.
myfs.write_path("filename", b"content") calls myfs.filesystem.makedirs("webhdfs://home/user/project), but the underlying implementation doesn't do any preprocessing and basically appendspath to base url, producing something like https://example.com/webhdfs/v1webhdfs%3A//home/user/project?op=MKDIRS.
Calling the above url fails.
Was this url generated properly it would look like this: https://example.com/webhdfs/v1/home/user/project?op=MKDIRS.
That is, path param that gets passed to WebHDFS._call should begin with a slash and have no scheme.
This doesn't seem to be a problem for other RemoteFileSystem methods since all of them call (be it directly or implicitly) fs.filesystem.open, which (in the case of WebHDFS) calls fsspec.utils.infer_storage_options, stripping the scheme. However, infer_storage_options causes another problem.
First segment of fs.basepath gets stripped, leading to accessing incorrect remote paths.
fs.filesystem.open calls fs.filesystem._strip_protocol (link).
Filesystem implementations commonly override _strip_protocol.
WebHDFS implementation's of _strip_protocol calls fsspec.utils.infer_storage_options.
As far as I can infer, infer_storage_options expects its input to either have no scheme (in which case the whole path is returned), or have netloc following the scheme (in which case netloc is stripped away along with the scheme).
As a result, /user/project gets accessed instead of /home/user/project.
One can work around this by prepending an extra segment to basepath (e.g. basepath="webhdfs://fakehost/home/user/project"), but that requires knowing how a particular implementation behaves (and is ugly to boot).
Of note here is that it treats s3/gcs schemes as special cases (doesn't strip the first segment), so the above method can't be used blindly.
I'd like to mention that current docs have usage examples only for cloud storage providers, which are seemingly immune to this issue.
As an aside, it's not clear why an implementation that takes hostname/port as parameters expects path to contain a netloc at all.
Previous section got me thinking whether WebHDFS is unique or maybe there are other implementations that have the same problem?
So I picked some implementations and wrote a script to compare what _strip_protocol outputs for the same input path.
Going by this, a few other filesystems might have something similar going on.
SFTP seems reasonably easy to test.
docker run --name sftp -p 2222:22 -d atmoz/sftp foo:pass:::upload
fromprefect.filesystemsimportRemoteFileSystemsftp_settings= {"host": "localhost","port": 2222, "username": "foo", "password": "pass"}
unprefixed_sftp=RemoteFileSystem(basepath="sftp://upload/", settings=sftp_settings)
prefixed_sftp=RemoteFileSystem(basepath="sftp://fakehost/upload/", settings=sftp_settings)
# We can't use write_path since SFTPFileSystem.makedirs errors out for the same reason WebHDFS.makedirs does# So let's write a file manually and read it back# It doesn't matter whether we use prefixed_sftp or unprefixed_sftp for thiswithprefixed_sftp.filesystem.open("upload/filename", "wb") asfile:
file.write(b'Hi!')
forfsin [unprefixed_sftp, prefixed_sftp]:
try:
print(fs.basepath, end="\n\t")
print(fs.read_path("filename"))
exceptExceptionase:
print(e)
# sftp://upload/# [Errno 2] No such file# sftp://fakenetloc/upload/# b'Hi!'print(unprefixed_sftp.filesystem.ls('.'))
# ['./upload']print(unprefixed_sftp.filesystem.ls('./upload'))
# ['./upload/filename']
docker exec -it sftp ls /home/foo/upload
# filename
WebHDFS doesn't seem to be the only problematic implementation.
Looking at the docstring of fsspec.implementations.smb.SMBFileSystem (link), I noticed it talks about using the class via fsspec.core.open(URI), in which case URI must contain a netloc. fsspec.core.open calls fsspec.core.open_files, which calls fsspec.core.get_fs_token_paths.
get_fs_token_paths does roughly the following (link):
Takes a full url as input (e.g. sftp://foo:pass@localhost:2222/upload/filename)
Extracts the scheme, uses it to get a filesystem implementation (not a instance) (e.g. cls = fsspec.implementations.sftp.SFTPFileSystem)
Calls cls._get_kwargs_from_urls to extract params used to instantiate a class from url (params not contained in url are to be passed to get_fs_token_paths as storage_options)
Uses extracted params and storage_options to instantiate a filesystem
Calls cls._strip_protocol on its input, producing a valid filepath
Returns instantiated filesystem, filepath and some cache-related token
Contrast example below with OP.
docker run --name sftp -p 2222:22 -d atmoz/sftp foo:pass:::upload
I reckon fsspec implementations fall into two groups:
Implementations in the first group do their own (seemingly idempotent) preprocessing in makedirs and treat first non-scheme segment of _strip_protocol's input as part of the filepath.
Implementations in the second group expect external preprocessing to be done before makedirs is called and treat first non-scheme segment of _strip_protocol's input as a netloc.
prefect.filesystems.RemoteFileSystem assumes that all implementations fall into the first group, leading to the above described problems.
First check
Bug summary
Currently it does not seem possible to use
RemoteFileSystem
with WebHDFS as the underlying implementation.There are 2 problems afaict.
Assume you define your filesystem block as follows:
myfs = RemoteFileSystem(basepath="webhdfs://home/user/project", settings={"host": "example.com"})
.Calling
write_path
fails due to an improperly formatted url.myfs.write_path("filename", b"content")
callsmyfs.filesystem.makedirs("webhdfs://home/user/project)
, but the underlying implementation doesn't do any preprocessing and basically appendspath
to base url, producing something likehttps://example.com/webhdfs/v1webhdfs%3A//home/user/project?op=MKDIRS
.Calling the above url fails.
Was this url generated properly it would look like this:
https://example.com/webhdfs/v1/home/user/project?op=MKDIRS
.That is,
path
param that gets passed toWebHDFS._call
should begin with a slash and have no scheme.This doesn't seem to be a problem for other
RemoteFileSystem
methods since all of them call (be it directly or implicitly)fs.filesystem.open
, which (in the case of WebHDFS) callsfsspec.utils.infer_storage_options
, stripping the scheme. However,infer_storage_options
causes another problem.First segment of
fs.basepath
gets stripped, leading to accessing incorrect remote paths.fs.filesystem.open
callsfs.filesystem._strip_protocol
(link).Filesystem implementations commonly override
_strip_protocol
.WebHDFS implementation's of
_strip_protocol
callsfsspec.utils.infer_storage_options
.As far as I can infer,
infer_storage_options
expects its input to either have no scheme (in which case the whole path is returned), or have netloc following the scheme (in which case netloc is stripped away along with the scheme).As a result,
/user/project
gets accessed instead of/home/user/project
.One can work around this by prepending an extra segment to basepath (e.g.
basepath="webhdfs://fakehost/home/user/project"
), but that requires knowing how a particular implementation behaves (and is ugly to boot).Of note here is that it treats s3/gcs schemes as special cases (doesn't strip the first segment), so the above method can't be used blindly.
I'd like to mention that current docs have usage examples only for cloud storage providers, which are seemingly immune to this issue.
As an aside, it's not clear why an implementation that takes hostname/port as parameters expects
path
to contain a netloc at all.Previous section got me thinking whether WebHDFS is unique or maybe there are other implementations that have the same problem?
So I picked some implementations and wrote a script to compare what
_strip_protocol
outputs for the same input path.Script
Here's what I get if I run it:
Going by this, a few other filesystems might have something similar going on.
SFTP seems reasonably easy to test.
WebHDFS doesn't seem to be the only problematic implementation.
Reproduction
Error
No response
Versions
Version: 2.4.2
API version: 0.8.0
Python version: 3.10.4
Git commit: 65807e8
Built: Fri, Sep 23, 2022 10:43 AM
OS/Arch: win32/AMD64
Profile: default
Server type: ephemeral
Server:
Database: sqlite
SQLite version: 3.37.2
Additional context
No response
The text was updated successfully, but these errors were encountered: