Skip to content

Prevent returning cached entry if the entry is degenerate #1873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

mkoistinen
Copy link

@mkoistinen mkoistinen commented Jun 20, 2025

When DatabricksFileSystem.ls is called with a directory, it will query the API and build a directory listing, then cache it.

If a subsequent ls is called on a directory that is child to the first directory, the results that are returned are for the child directory itself, as though it were just a file. The correct behavior is to get the contents of this child directory.

This PR ignores such responses so that ls will call the API again with the child directory to get its contents. These results are then cached as usual and available for future repeated ls calls with the child directory.

This addresses issue: #1865 by overriding AbstractFileSystem._ls_from_cache with a new implementation that only uses the parent's cache entry if it indicates that path is not a "directory". If it is a "directory", nothing is returned to allow for DatabricksFileSystem.ls to make a new API call to DBFS to get the contents of that path.

Also, if path is not found in the cached listing of its parent (if it exists), then it strongly suggests that path was created since the parent was cached. Using this information, DatabricksFileSystem.ls will invalidate the parent's cache entry (by deleting it) before continuing.

When `DatabricksFileSystem.ls` is called with a directory, it will query the API and
build a directory listing, then cache it.

If a subsequent `ls` is called on a directory that is child to the first directory,
the results that are returned are for the child directory itself, as though it were
just a file. The correct behavior is to get the contents of this child direcctory.

This PR ignores such responses so that `ls` will call the API again with the child
directory to get its contents. These results are then cached as usual and available
for future repeated `ls` calls with the child directory.
@mkoistinen
Copy link
Author

I think I've fixed the 3.9 test issues (annotation issues). Not sure if there's anything I can do about the 3.12 ones (rate-limiting issues)

@martindurant
Copy link
Member

OK, let's give this a try and see how it goes for others

@martindurant
Copy link
Member

The dbfs tests only run on py3.9 currently, and you can see they are now failing. This might just mean having to re-record the vcr data, but I am not sure.

@mkoistinen
Copy link
Author

Sorry for the delay, my annual leave happened =) The VCR cassettes are re-recorded and the tests should pass now (they do locally).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants