Skip to content

DatabricksFileSystem.ls() returns empty list for directories in some scenarios. #1865

Open
@mkoistinen

Description

@mkoistinen

I'm using fsspec to access a number of filesystems including file, fs, hdfs, webhdfs and dbfs. Of these, only dbfs is returning an empty list in some conditions where I request the contents of a Parquet dataset directory. Here's a repro case:

from __future__ import annotations

import os
from typing import Any
from types import MethodType

import fsspec

if __name__ == "__main__":

    fs = fsspec.filesystem(
        "dbfs",
        instance=os.environ["DATABRICKS_HOST"],
        token=os.environ["DATABRICKS_TOKEN"],
    )

    for path in [
        "/mkoistinen@host.com/parquet_files",          # PATH A
        "/mkoistinen@host.com/parquet_files/bank/",    # PATH B
    ]:
        listing = fs.ls(path=path, detail=False)  # NOTE: the value for `detail` makes no difference
        if listing and isinstance(listing[0], dict):
            print(f"ls({path})\n{'\n'.join(f.get('name') for f in listing)}", end="\n\n")  # noqa
        else:
            print(f"ls({path})\n{'\n'.join(listing)}", end="\n\n")

When run with 2025.5.1 (and earlier versions, I tried all released 2025 versions), the output of this run is:

ls(/mkoistinen@howso.com/parquet_files)
/mkoistinen@host.com/parquet_files/adult
/mkoistinen@host.com/parquet_files/adult.parquet
/mkoistinen@host.com/parquet_files/bank
/mkoistinen@host.com/parquet_files/bank.parquet
/mkoistinen@host.com/parquet_files/iris
/mkoistinen@host.com/parquet_files/iris.parquet
/mkoistinen@host.com/parquet_files/tmp

ls(/mkoistinen@host.com/parquet_files/adult)
/mkoistinen@host.com/parquet_files/adult

If I disable PATH A in the list, I get:

ls(/mkoistinen@host.com/parquet_files/adult)
/mkoistinen@host.com/parquet_files/adult/part_18aacbf9-6844-4038-8929-c45028ef870c.parquet
/mkoistinen@host.com/parquet_files/adult/part_4950eed3-56d3-4f28-962e-46c9f728f5c4.parquet
/mkoistinen@host.com/parquet_files/adult/part_8e7e0a04-bc42-4275-bbb8-b95b5ef39ab7.parquet
/mkoistinen@host.com/parquet_files/adult/part_d6d5e107-77a3-41c2-96c4-18af726d5ef6.parquet
/mkoistinen@host.com/parquet_files/adult/part_dc302b0b-6f18-4093-be95-271554c4ca21.parquet

The correct output when both paths are enabled should be:

ls(/mkoistinen@host.com/parquet_files)
/mkoistinen@host.com/parquet_files/adult
/mkoistinen@host.com/parquet_files/adult.parquet
/mkoistinen@host.com/parquet_files/bank
/mkoistinen@host.com/parquet_files/bank.parquet
/mkoistinen@host.com/parquet_files/iris
/mkoistinen@host.com/parquet_files/iris.parquet
/mkoistinen@host.com/parquet_files/tmp

ls(/mkoistinen@host.com/parquet_files/adult)
/mkoistinen@host.com/parquet_files/adult/part_18aacbf9-6844-4038-8929-c45028ef870c.parquet
/mkoistinen@host.com/parquet_files/adult/part_4950eed3-56d3-4f28-962e-46c9f728f5c4.parquet
/mkoistinen@host.com/parquet_files/adult/part_8e7e0a04-bc42-4275-bbb8-b95b5ef39ab7.parquet
/mkoistinen@host.com/parquet_files/adult/part_d6d5e107-77a3-41c2-96c4-18af726d5ef6.parquet
/mkoistinen@host.com/parquet_files/adult/part_dc302b0b-6f18-4093-be95-271554c4ca21.parquet

I've traced this down to an issue where the cache self._ls_from_cache used in the ls() method is returning objects where, in this case, .../adult was seen as a plain file even though we're requesting the contents of that path in a subsequent call to ls().

(Note that this behavior is unaffected by the presence of a trailing slash in the path to denote a directory.)

I'm able to get the correct behavior by inserting one line:

src: fsspec/implementations/dbfs.py:60-75

    def ls(self, path, detail=True, **kwargs):
        """
        List the contents of the given path.
        ...
        """
        out = self._ls_from_cache(path)
        out = [o for o in out if o.get("name") == path and o.get("type") == "file"] if out else []   # <--- SINGLE LINE FIX
        if not out:
            ...

I can submit this as a PR, but, I thought I'd check with the maintainers first. Please advise.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions