Skip to content

HTTPFileSystem isdir downloads the whole file #1707

Open
@mxmlnkn

Description

@mxmlnkn

I need to implement the FUSE getattr (stat) callback. I.e., I need to get at least the file type and size, and possibly name for a given path.

I am failing to do this with the HTTP filesystem implementation because:

  • info(path) always returns the file information for the HTML file, i.e., the file type is also always a file. This is already inconsistent to all other fsspec implementations. The same for isfile, which always returns true.
  • isdir(path) hangs and when looking at my local HTTP server log or at my network bandwidth when testing with an external server, I see that this call downloads the whole file. This means that currently an ls -la will download all files in the given folder...

Test to reproduce:

import pprint
import time
import fsspec

prefix="https://ash-speed.hetzner.com/"

def timedCall(f, *args):    
    t0 = time.time()
    result = f(*args)
    t1 = time.time()
    print(f"{f} took {t1 - t0:.3f} s")
    pprint.pprint(result)
    print()

f = fsspec.open(prefix)

print(f"# Testing {prefix}\n")
timedCall(f.fs.exists, prefix)
timedCall(f.fs.listdir, prefix)
timedCall(f.fs.info, prefix)
timedCall(f.fs.isfile, prefix)
timedCall(f.fs.isdir, prefix)

path = prefix + "100MB.bin"
print(f"# Testing {path}\n")
timedCall(f.fs.exists, path)
timedCall(f.fs.info, path)
timedCall(f.fs.isfile, path)
timedCall(f.fs.isdir, path)

Output:

# Testing https://ash-speed.hetzner.com/

<function HTTPFileSystem._exists at 0x7fb0be244700> took 0.362 s
True

<bound method AbstractFileSystem.listdir of <fsspec.implementations.http.HTTPFileSystem object at 0x7fb0beaca680>> took 0.110 s
[{'name': 'https://ash-speed.hetzner.com/10GB.bin',
  'size': None,
  'type': 'file'},
 {'name': 'https://ash-speed.hetzner.com/100MB.bin',
  'size': None,
  'type': 'file'},
 {'name': 'https://ash-speed.hetzner.com/1GB.bin',
  'size': None,
  'type': 'file'}]

<function HTTPFileSystem._info at 0x7fb0be244550> took 0.763 s
{'ETag': '"60f52d50-143"',
 'mimetype': 'text/html',
 'name': 'https://ash-speed.hetzner.com/',
 'size': 323,
 'type': 'file',
 'url': 'https://ash-speed.hetzner.com/'}

<function HTTPFileSystem._isfile at 0x7fb0be2445e0> took 0.108 s
True

<function HTTPFileSystem._isdir at 0x7fb0be244670> took 0.108 s
True

# Testing https://ash-speed.hetzner.com/100MB.bin

<function HTTPFileSystem._exists at 0x7fb0be244700> took 0.216 s
True

<function HTTPFileSystem._info at 0x7fb0be244550> took 1.098 s
{'ETag': '"60c9b8bd-6400000"',
 'mimetype': 'application/octet-stream',
 'name': 'https://ash-speed.hetzner.com/100MB.bin',
 'size': 104857600,
 'type': 'file',
 'url': 'https://ash-speed.hetzner.com/100MB.bin'}

<function HTTPFileSystem._isfile at 0x7fb0be2445e0> took 0.428 s
True

<function HTTPFileSystem._isdir at 0x7fb0be244670> took 38.450 s
False

Imho, isdir should be implemented via a listdir to the parent if there is no other way. I am also wondering what it does check. Is it simply doing a mimetype check whether it is HTML? If so, then the first 1000 or so bytes would suffice. But then, wouldn't it detect arbitrary HTML files inside a given "folder" wrongly as a folder?

My current workaround is to call info first and only call isdir if mimetype is text/html. This logic could also be implemented in HTTPFileSystem if there is no better way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions