Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cache mapper that is basename plus fixed number of parent directories #1318

Merged
merged 6 commits into from
Aug 17, 2023

Conversation

ianthomas23
Copy link
Collaborator

This adds support for a cache mapper that is a fixed number of parent directories as well as the basename. This will be useful when using multiple directories containing files with the same names but you still want the cached filenames to be clearly identifiable rather than using hashes.

I have chosen to add a new directory_levels attribute to BasenameCacheMapper rather than create a new class.

Example use:

from fsspec import filesystem
from fsspec.implementations.cache_mapper import BasenameCacheMapper

fs = filesystem("filecache", target_protocol='file', cache_mapper=BasenameCacheMapper(2))

I have made some design decisions for this that are up for discussion:

  1. I've kept the same_names kwarg to CachingFileSystem.__init__ for backward compatibility as well as adding the new cache_mapper kwarg. You cannot specify both. This seemed a better approach than the alternative of allowing values for same_names other than boolean and allows us to easily support other CacheMapper classes in the future.
  2. For the cached filenames I am not using usual path separators as I wanted to use the same separators on all OSes and avoid any problems in inferring that slashes are indeed directory separators and looking in the wrong place. I am using a somewhat arbitrary string of _^_ so far but not tied to this.

@ianthomas23
Copy link
Collaborator Author

I've added type annotations to the lines I have changed so now mypy is interested in the cached.py file and complaining about the

self._strip_protocol = _strip_protocol

line.

I've added a # ignore: type[method-assign] to get rid of the error. I am not sure if we have a policy on this yet, I could always put the ignore in the setup.cfg instead.

@martindurant
Copy link
Member

I am not sure if we have a policy on this yet

There is no policy, typing is very limited in fsspec.

a somewhat arbitrary string of _^_ so far

So long as it isn't a glob re special, I don't mind. I think re only uses it within [..].

fsspec/implementations/cache_mapper.py Outdated Show resolved Hide resolved
fsspec/implementations/cache_mapper.py Show resolved Hide resolved
fsspec/implementations/cached.py Outdated Show resolved Hide resolved
@ianthomas23
Copy link
Collaborator Author

a somewhat arbitrary string of _^_ so far

So long as it isn't a glob re special, I don't mind. I think re only uses it within [..].

I've changed it to _@_ just in case there is any re confusion over use of ^

@ianthomas23
Copy link
Collaborator Author

To get this working on Windows I've used an explicit make_path_posix call. We can't just rely on os.sep as sometimes forward slashes are OK on Windows (e.g. MemoryFileSystem) so this seems the easiest solution.

@martindurant
Copy link
Member

Everything is resolved here, except I leave it up to you whether to remove the eq/hash stuff now or not.

@ianthomas23
Copy link
Collaborator Author

Everything is resolved here, except I leave it up to you whether to remove the eq/hash stuff now or not.

OK, let's hold off the eq/hash changes to a separate PR as I need to think about it a bit more.

@martindurant martindurant merged commit 2fbe8de into fsspec:master Aug 17, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants