Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic working FsspecStore #1785

Merged
merged 36 commits into from
Jun 11, 2024
Merged

Basic working FsspecStore #1785

merged 36 commits into from
Jun 11, 2024

Conversation

martindurant
Copy link
Member

This works.

  • We don't have any list methods, do we need them?
  • We don't have a bulk delete
  • No exception handling is done here, which has been a thorny issue. Shall we do the same as in v2 with expected exceptions -> KeyError? I don't see in Store's API what the expectation is.

Fixes #1757

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@martindurant
Copy link
Member Author

@jhamman @d-v-b , for discussion

@pep8speaks
Copy link

pep8speaks commented Apr 11, 2024

Hello @martindurant! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-04-14 01:40:25 UTC

@d-v-b
Copy link
Contributor

d-v-b commented Apr 11, 2024

thanks for this @martindurant, I will see if I have time to play with this locally.

No exception handling is done here, which has been a thorny issue. Shall we do the same as in v2 with expected exceptions -> KeyError? I don't see in Store's API what the expectation is.

I don't think we have expectations at this point. I will try to get a distillation from the v2 issues around this topic. What do you think we should do here?

@martindurant
Copy link
Member Author

I see we have no storage tests, so don't know how to push this any further.

@d-v-b
Copy link
Contributor

d-v-b commented Apr 14, 2024

I will get some tests for you shortly

@jhamman jhamman added the V3 Affects the v3 branch label Apr 22, 2024
@jhamman jhamman added this to the 3.0.0.alpha milestone Apr 22, 2024
@jhamman
Copy link
Member

jhamman commented May 29, 2024

@martindurant - the test suite is in a better place now. Are you up for picking this up?

@martindurant
Copy link
Member Author

For testing ... I could set up a mock s3 or gcs with some pain and CI overhead. Also, I could wrap the fsspec memoryFS in async stuff, but that would not really be testing the async-ness and depends no me doing that correctly. None of the other stores are actually async, right?

@jhamman
Copy link
Member

jhamman commented May 29, 2024

@martindurant - which fsspec implementations have an async-api available?

For now, a mocked s3 backend seems like the way to go.

@martindurant
Copy link
Member Author

which fsspec implementations have an async-api

I think this is a complete list

  • s3
  • gcs
  • azure (both blob and datalake2)
  • http
  • sshfs (not the builtin ssh/sftp)
  • anaconda (released version is still sync)

The following can pass through async, if they wrap an async FS:

  • generic
  • reference
  • dir/prefix

@jhamman
Copy link
Member

jhamman commented May 30, 2024

Got it. Makes sense. For now, let's target a mocked test against s3. We can add a few others as we approach a full release.

@martindurant
Copy link
Member Author

a mocked test against s3

Well, I would use the moto server as s3fs does, which is a real S3 implementation with most of the functionality of the real one (minus permissions, object lifetimes and other unimportant things).

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant -- this is coming together. I took it for a spin today and ended up needing to make a few changes (suggestions below). But it works!

src/zarr/store/remote.py Outdated Show resolved Hide resolved
src/zarr/store/remote.py Outdated Show resolved Hide resolved
src/zarr/store/remote.py Show resolved Hide resolved
src/zarr/store/remote.py Show resolved Hide resolved
@jhamman jhamman mentioned this pull request Jun 1, 2024
@martindurant
Copy link
Member Author

I merged from v3 and the suggestions above. I cannot get mypy to play.

By the way, I would suggest that memoryview() and bytes() really ought to work as expected for a Buffer.

src/zarr/store/remote.py Outdated Show resolved Hide resolved
@jhamman
Copy link
Member

jhamman commented Jun 4, 2024

@martindurant -- thanks for pushing this forward...

I merged from v3 and the suggestions above. I cannot get mypy to play.

I'm hoping @dstansby can take a look... we can find a way!

By the way, I would suggest that memoryview() and bytes() really ought to work as expected for a Buffer.

cc @madsbk, @akshaysubr, @normanrz

@martindurant
Copy link
Member Author

OK, so the problem with the tests, is that they call get/set in blocking sync code, but the store implementation works in async mode. The test instance is created in sync mode, and gets put on a dedicated fsspec IO thread; but then the async store on the main thread calls the same instance in async mode. Note that moto3 is also running on a thread, to complicate things. I am looking at it.

@d-v-b
Copy link
Contributor

d-v-b commented Jun 10, 2024

OK, so the problem with the tests, is that they call get/set in blocking sync code, but the store implementation works in async mode. The test instance is created in sync mode, and gets put on a dedicated fsspec IO thread; but then the async store on the main thread calls the same instance in async mode. Note that moto3 is also running on a thread, to complicate things. I am looking at it.

That's great insight, thank you. Please let me know if there's anything about the StoreTests design that we should change to make this process simpler.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see ✅!

Thanks @martindurant for working out the kinks here! (and @d-v-b and @dstansby for chipping in)

@martindurant
Copy link
Member Author

Highlighting this line, which works around tests passing offset-length for data which is b"". Python allows you to slice bytes like that (b""[1:2] == b""), but remote stores do not allow this.

src/zarr/store/remote.py Outdated Show resolved Hide resolved
if byte_range
else fs._cat_file(path)
if byte_range:
# fsspec uses start/end, not start/length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I count this as additional evidence that we should switch to start/end semantics for the rest of the stores

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kerchunk is the exception, storing start/length, mostly because length is generally smaller for chunks in big files.

except self.exceptions:
return None
except OSError as e:
if "not satisfiable" in str(e):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flagging this as s3 abstraction leakage that we might want to address later on by making an s3-specific storage class

Comment on lines +161 to +162
# TODO: expectations for exceptions or missing keys?
res = await self._fs._cat_ranges(list(paths), starts, stops, on_error="return")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think returning the exceptions is the right thing here

@@ -399,7 +399,9 @@ def test_hierarchy(self):
assert [] == store.listdir(self.root + "c/x/y")
assert [] == store.listdir(self.root + "c/d/y")
assert [] == store.listdir(self.root + "c/d/y/z")
assert [] == store.listdir(self.root + "c/e/f")
# the following is listdir(filepath), for which fsspec gives [filepath]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the advantage of going with POSIX semantics here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsspec tries to adhere to posix as much as possible. If we want to exclude the [file] case, we'd have to code that special case into our store.



@pytest.fixture(autouse=True, scope="function")
def s3(s3_base):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant could you explain what's happening in this test fixture? e.g., why do we need to manipulate the cache, why do we need to create an instance of S3FileSystem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest-asyncio creates a new event loop for each async test. When an async-mode s3fs instance is made from async, it will be assigned to the loop from which it is made. That means that if you use s3fs again from a subsequent test, you will have the same identical instance, but be running on a different loop - which fails.

For the rest: it's very convenient to clean up the state of the store between tests, make sure we start off blank each time.

Copy link
Contributor

@d-v-b d-v-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thank you @martindurant et al

@jhamman jhamman merged commit 7ded5d6 into zarr-developers:v3 Jun 11, 2024
18 checks passed
AdamWill added a commit to AdamWill/zarr-python that referenced this pull request Jun 17, 2024
…r-developers#1679)

This is adapted from the fixes that were rolled into
zarr-developers#1785 for the
v3 branch.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
dcherian added a commit to dcherian/zarr-python that referenced this pull request Jun 25, 2024
* v3: (22 commits)
  [v3] `Buffer` ensure correct subclass based on the `BufferPrototype` argument (zarr-developers#1974)
  Fix doc build (zarr-developers#1987)
  Fix doc build warnings (zarr-developers#1985)
  Automatically generate API reference docs (zarr-developers#1918)
  Update `RemoteStore.__str__` and add UPath tests (zarr-developers#1964)
  [v3] Elevate codec pipeline (zarr-developers#1932)
  0 dim arrays: indexing (zarr-developers#1980)
  `parse_shapelike` allows 0 (zarr-developers#1979)
  Clean up typing and docs for indexing (zarr-developers#1961)
  add json indentation to config (zarr-developers#1952)
  chore: update pre-commit hooks (zarr-developers#1973)
  Bump pypa/gh-action-pypi-publish in the actions group (zarr-developers#1969)
  chore: update pre-commit hooks (zarr-developers#1957)
  Update release.rst (zarr-developers#1960)
  doc: update release notes for 3.0.0.alpha (zarr-developers#1959)
  Basic working FsspecStore (zarr-developers#1785)
  Feature: Top level V3 API (zarr-developers#1884)
  Buffer Prototype Argument (zarr-developers#1910)
  Create issue-metrics.yml
  fixes bug in transpose (zarr-developers#1949)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Affects the v3 branch
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[v3] remote store support (s3, gcs, azure, http)
6 participants