Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glob performance regression #641

Open
mhfrantz opened this issue Sep 25, 2024 · 0 comments
Open

glob performance regression #641

mhfrantz opened this issue Sep 25, 2024 · 0 comments

Comments

@mhfrantz
Copy link

When using GCSFileSystem.glob with a pattern like "bucket-name/prefix*suffix", version 2023.9.0 introduced a performance regression. Previously, this glob would be resolved with an efficient API call whose performance was proportional to the number of matching objects. Since 2023.9.0, the performance seems to scale with the number of objects in the bucket. In my system, the buckets have a "flat" pseudo-folder structure with 1e5+ objects.

Debug output from 2023.6.0:

DEBUG:gcsfs:GET: b/{}/o, ('bucket-name',), None
DEBUG:gcsfs.credentials:GCS refresh
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None

Debug output from 2023.9.0 (and more recent versions like 2024.6.0):

DEBUG:asyncio:Using selector: EpollSelector
DEBUG:gcsfs:GET: b/{}/o, ('bucket-name',), None
DEBUG:gcsfs.credentials:GCS refresh
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG:gcsfs:GET: b/{}/o, ('bucket-name',), None
[repeated 100+ times]

Perhaps the prefix argument is no longer being specified to the GCS backend (e.g. in GCSFileSystem._list_objects). I've been studying the differences between 2023.6.0 and 2023.9.0 in both this repo and filesystem_spec, but I haven't seen evidence of this change being explicit or intentional. The unit testing of glob seems to be functional, so it wouldn't catch a performance regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant