Open
Description
I need to implement the FUSE getattr (stat) callback. I.e., I need to get at least the file type and size, and possibly name for a given path.
I am failing to do this with the HTTP filesystem implementation because:
info(path)
always returns the file information for the HTML file, i.e., the file type is also always a file. This is already inconsistent to all other fsspec implementations. The same forisfile
, which always returns true.isdir(path)
hangs and when looking at my local HTTP server log or at my network bandwidth when testing with an external server, I see that this call downloads the whole file. This means that currently anls -la
will download all files in the given folder...
Test to reproduce:
import pprint
import time
import fsspec
prefix="https://ash-speed.hetzner.com/"
def timedCall(f, *args):
t0 = time.time()
result = f(*args)
t1 = time.time()
print(f"{f} took {t1 - t0:.3f} s")
pprint.pprint(result)
print()
f = fsspec.open(prefix)
print(f"# Testing {prefix}\n")
timedCall(f.fs.exists, prefix)
timedCall(f.fs.listdir, prefix)
timedCall(f.fs.info, prefix)
timedCall(f.fs.isfile, prefix)
timedCall(f.fs.isdir, prefix)
path = prefix + "100MB.bin"
print(f"# Testing {path}\n")
timedCall(f.fs.exists, path)
timedCall(f.fs.info, path)
timedCall(f.fs.isfile, path)
timedCall(f.fs.isdir, path)
Output:
# Testing https://ash-speed.hetzner.com/
<function HTTPFileSystem._exists at 0x7fb0be244700> took 0.362 s
True
<bound method AbstractFileSystem.listdir of <fsspec.implementations.http.HTTPFileSystem object at 0x7fb0beaca680>> took 0.110 s
[{'name': 'https://ash-speed.hetzner.com/10GB.bin',
'size': None,
'type': 'file'},
{'name': 'https://ash-speed.hetzner.com/100MB.bin',
'size': None,
'type': 'file'},
{'name': 'https://ash-speed.hetzner.com/1GB.bin',
'size': None,
'type': 'file'}]
<function HTTPFileSystem._info at 0x7fb0be244550> took 0.763 s
{'ETag': '"60f52d50-143"',
'mimetype': 'text/html',
'name': 'https://ash-speed.hetzner.com/',
'size': 323,
'type': 'file',
'url': 'https://ash-speed.hetzner.com/'}
<function HTTPFileSystem._isfile at 0x7fb0be2445e0> took 0.108 s
True
<function HTTPFileSystem._isdir at 0x7fb0be244670> took 0.108 s
True
# Testing https://ash-speed.hetzner.com/100MB.bin
<function HTTPFileSystem._exists at 0x7fb0be244700> took 0.216 s
True
<function HTTPFileSystem._info at 0x7fb0be244550> took 1.098 s
{'ETag': '"60c9b8bd-6400000"',
'mimetype': 'application/octet-stream',
'name': 'https://ash-speed.hetzner.com/100MB.bin',
'size': 104857600,
'type': 'file',
'url': 'https://ash-speed.hetzner.com/100MB.bin'}
<function HTTPFileSystem._isfile at 0x7fb0be2445e0> took 0.428 s
True
<function HTTPFileSystem._isdir at 0x7fb0be244670> took 38.450 s
False
Imho, isdir should be implemented via a listdir to the parent if there is no other way. I am also wondering what it does check. Is it simply doing a mimetype check whether it is HTML? If so, then the first 1000 or so bytes would suffice. But then, wouldn't it detect arbitrary HTML files inside a given "folder" wrongly as a folder?
My current workaround is to call info
first and only call isdir
if mimetype
is text/html
. This logic could also be implemented in HTTPFileSystem
if there is no better way.
Metadata
Metadata
Assignees
Labels
No labels