Description
I'm using fsspec
to access a number of filesystems including file
, fs
, hdfs
, webhdfs
and dbfs
. Of these, only dbfs
is returning an empty list in some conditions where I request the contents of a Parquet dataset directory. Here's a repro case:
from __future__ import annotations
import os
from typing import Any
from types import MethodType
import fsspec
if __name__ == "__main__":
fs = fsspec.filesystem(
"dbfs",
instance=os.environ["DATABRICKS_HOST"],
token=os.environ["DATABRICKS_TOKEN"],
)
for path in [
"/mkoistinen@host.com/parquet_files", # PATH A
"/mkoistinen@host.com/parquet_files/bank/", # PATH B
]:
listing = fs.ls(path=path, detail=False) # NOTE: the value for `detail` makes no difference
if listing and isinstance(listing[0], dict):
print(f"ls({path})\n{'\n'.join(f.get('name') for f in listing)}", end="\n\n") # noqa
else:
print(f"ls({path})\n{'\n'.join(listing)}", end="\n\n")
When run with 2025.5.1 (and earlier versions, I tried all released 2025 versions), the output of this run is:
ls(/mkoistinen@howso.com/parquet_files)
/mkoistinen@host.com/parquet_files/adult
/mkoistinen@host.com/parquet_files/adult.parquet
/mkoistinen@host.com/parquet_files/bank
/mkoistinen@host.com/parquet_files/bank.parquet
/mkoistinen@host.com/parquet_files/iris
/mkoistinen@host.com/parquet_files/iris.parquet
/mkoistinen@host.com/parquet_files/tmp
ls(/mkoistinen@host.com/parquet_files/adult)
/mkoistinen@host.com/parquet_files/adult
If I disable PATH A
in the list, I get:
ls(/mkoistinen@host.com/parquet_files/adult)
/mkoistinen@host.com/parquet_files/adult/part_18aacbf9-6844-4038-8929-c45028ef870c.parquet
/mkoistinen@host.com/parquet_files/adult/part_4950eed3-56d3-4f28-962e-46c9f728f5c4.parquet
/mkoistinen@host.com/parquet_files/adult/part_8e7e0a04-bc42-4275-bbb8-b95b5ef39ab7.parquet
/mkoistinen@host.com/parquet_files/adult/part_d6d5e107-77a3-41c2-96c4-18af726d5ef6.parquet
/mkoistinen@host.com/parquet_files/adult/part_dc302b0b-6f18-4093-be95-271554c4ca21.parquet
The correct output when both paths are enabled should be:
ls(/mkoistinen@host.com/parquet_files)
/mkoistinen@host.com/parquet_files/adult
/mkoistinen@host.com/parquet_files/adult.parquet
/mkoistinen@host.com/parquet_files/bank
/mkoistinen@host.com/parquet_files/bank.parquet
/mkoistinen@host.com/parquet_files/iris
/mkoistinen@host.com/parquet_files/iris.parquet
/mkoistinen@host.com/parquet_files/tmp
ls(/mkoistinen@host.com/parquet_files/adult)
/mkoistinen@host.com/parquet_files/adult/part_18aacbf9-6844-4038-8929-c45028ef870c.parquet
/mkoistinen@host.com/parquet_files/adult/part_4950eed3-56d3-4f28-962e-46c9f728f5c4.parquet
/mkoistinen@host.com/parquet_files/adult/part_8e7e0a04-bc42-4275-bbb8-b95b5ef39ab7.parquet
/mkoistinen@host.com/parquet_files/adult/part_d6d5e107-77a3-41c2-96c4-18af726d5ef6.parquet
/mkoistinen@host.com/parquet_files/adult/part_dc302b0b-6f18-4093-be95-271554c4ca21.parquet
I've traced this down to an issue where the cache self._ls_from_cache
used in the ls()
method is returning objects where, in this case, .../adult
was seen as a plain file even though we're requesting the contents of that path in a subsequent call to ls()
.
(Note that this behavior is unaffected by the presence of a trailing slash in the path to denote a directory.)
I'm able to get the correct behavior by inserting one line:
src: fsspec/implementations/dbfs.py:60-75
def ls(self, path, detail=True, **kwargs):
"""
List the contents of the given path.
...
"""
out = self._ls_from_cache(path)
out = [o for o in out if o.get("name") == path and o.get("type") == "file"] if out else [] # <--- SINGLE LINE FIX
if not out:
...
I can submit this as a PR, but, I thought I'd check with the maintainers first. Please advise.