Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glob performance regression #641

Open
mhfrantz opened this issue Sep 25, 2024 · 3 comments
Open

glob performance regression #641

mhfrantz opened this issue Sep 25, 2024 · 3 comments

Comments

@mhfrantz
Copy link

When using GCSFileSystem.glob with a pattern like "bucket-name/prefix*suffix", version 2023.9.0 introduced a performance regression. Previously, this glob would be resolved with an efficient API call whose performance was proportional to the number of matching objects. Since 2023.9.0, the performance seems to scale with the number of objects in the bucket. In my system, the buckets have a "flat" pseudo-folder structure with 1e5+ objects.

Debug output from 2023.6.0:

DEBUG:gcsfs:GET: b/{}/o, ('bucket-name',), None
DEBUG:gcsfs.credentials:GCS refresh
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None

Debug output from 2023.9.0 (and more recent versions like 2024.6.0):

DEBUG:asyncio:Using selector: EpollSelector
DEBUG:gcsfs:GET: b/{}/o, ('bucket-name',), None
DEBUG:gcsfs.credentials:GCS refresh
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG:gcsfs:GET: b/{}/o, ('bucket-name',), None
[repeated 100+ times]

Perhaps the prefix argument is no longer being specified to the GCS backend (e.g. in GCSFileSystem._list_objects). I've been studying the differences between 2023.6.0 and 2023.9.0 in both this repo and filesystem_spec, but I haven't seen evidence of this change being explicit or intentional. The unit testing of glob seems to be functional, so it wouldn't catch a performance regression.

@discretizer
Copy link

I ran into the issue too. The crux of he matter seems to be a confluence of a couple of different things:

  1. The current code uses the default glob call in the underlying AbstractFilesystem object which will call find with a maxdepth based on the glob structure
  2. The current implementation of find() doesn't REALLY respect the maxdepth argument. It will return ALL objects under the file path with a _list_objects call and then filter those objects out based on the max_depth . This might be handled better by building a 'fake' glob based on the maxdepth parameter and feeding that into the list objects
  3. Previously there was a glob implementation that called the base glob implementation which then used the base find (which uses a walk based algorithm to only recurse down to the appropriate depth)

gcsfs/gcsfs/core.py

Lines 826 to 836 in d7952a9

async def _glob(self, path, prefix="", **kwargs):
if not prefix:
# Identify pattern prefixes. Ripped from fsspec.spec.AbstractFileSystem.glob and matches
# the glob.has_magic patterns.
indstar = path.find("*") if path.find("*") >= 0 else len(path)
indques = path.find("?") if path.find("?") >= 0 else len(path)
indbrace = path.find("[") if path.find("[") >= 0 else len(path)
ind = min(indstar, indques, indbrace)
prefix = path[:ind].split("/")[-1]
return await super()._glob(path, prefix=prefix, **kwargs)

Performance under this algorithm was pretty good. Now it's REALLY based on the total number of objects under the glob - which can be really bad.

There are a couple of potential fixes:

  • Revert glob
  • Optimize find - build a fake glob from maxdepth parameter and feed that (with a delimeter) into GCS using matchGlob
  • Reimplement glob using matchGlob

@discretizer
Copy link

discretizer commented Nov 1, 2024

The following monkey patch is significantly faster:

import fsspec

from gcsfs import GCSFileSystem
from fsspec.asyn import AsyncFileSystem


base_glob = AsyncFileSystem._glob
base_find = AsyncFileSystem._find
async def glob_patch(self, path, prefix="", **kwargs): 
    if not prefix: 
        # Identify pattern prefixes. Ripped from fsspec.spec.AbstractFileSystem.glob and matches 
        # the glob.has_magic patterns. 
        indstar = path.find("*") if path.find("*") >= 0 else len(path) 
        indques = path.find("?") if path.find("?") >= 0 else len(path) 
        indbrace = path.find("[") if path.find("[") >= 0 else len(path) 

        ind = min(indstar, indques, indbrace) 
        prefix = path[:ind].split("/")[-1] 
    return await base_glob(self, path, prefix=prefix, **kwargs)

async def find_patch(self, path, maxdepth=None, withdirs=False, detail=False, **kwargs):
    return await base_find(self, path, maxdepth=maxdepth, withdirs=withdirs, detail=detail, **kwargs)

GCSFileSystem._glob = glob_patch
GCSFileSystem._find = find_patch

@martindurant
Copy link
Member

Thanks for looking into this and performing the benchmark. I have a suspicion - which ought to be tested - that you would find the opposite preference for the case that you have very deep directory structure with very few files. At the extreme, this be a single file like "bucket/dir1/dir2/dir3/dir4/dir4/file", where a glob would need to walk as many directories deep as given by the glob pattern (in the walk implementation) or only need one call with one returned item (in the find implementation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants