-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glob performance regression #641
Comments
I ran into the issue too. The crux of he matter seems to be a confluence of a couple of different things:
Lines 826 to 836 in d7952a9
Performance under this algorithm was pretty good. Now it's REALLY based on the total number of objects under the glob - which can be really bad. There are a couple of potential fixes:
|
The following monkey patch is significantly faster:
|
Thanks for looking into this and performing the benchmark. I have a suspicion - which ought to be tested - that you would find the opposite preference for the case that you have very deep directory structure with very few files. At the extreme, this be a single file like "bucket/dir1/dir2/dir3/dir4/dir4/file", where a glob would need to walk as many directories deep as given by the glob pattern (in the |
When using
GCSFileSystem.glob
with a pattern like"bucket-name/prefix*suffix"
, version 2023.9.0 introduced a performance regression. Previously, thisglob
would be resolved with an efficient API call whose performance was proportional to the number of matching objects. Since 2023.9.0, the performance seems to scale with the number of objects in the bucket. In my system, the buckets have a "flat" pseudo-folder structure with 1e5+ objects.Debug output from 2023.6.0:
Debug output from 2023.9.0 (and more recent versions like 2024.6.0):
Perhaps the
prefix
argument is no longer being specified to the GCS backend (e.g. inGCSFileSystem._list_objects
). I've been studying the differences between 2023.6.0 and 2023.9.0 in both this repo andfilesystem_spec
, but I haven't seen evidence of this change being explicit or intentional. The unit testing ofglob
seems to be functional, so it wouldn't catch a performance regression.The text was updated successfully, but these errors were encountered: