-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to converge on internal types consistency. #574
Conversation
While it is convenient for user on high level API to be able to pass arround many types (array, bytes, or even dictionary, list, strings), It can make it relatively hard to follow internally if the types are not consistant. In particular this can relatively complicate weaving the v3 spec into the core of Zarr and add a number of conditional if we are not sure of the type we get. This try to be a little more consistent in what the implementation _emits_ right now this try to make sure that all the core only try to converge toward bytes, so that store users can expect to alway get bytes back from Zarr-Python stores, and as much as possible give bytes back, and mostly affects the ConsolidatedMetadataStore which used to – unlike other store – return objects instead of bytes, as well as the tests directly testing the stores. We of course can't be too strict as some existing store may have stored some values as not-bytes (pickle), and we still want to support this for the time being,but with this changes, we can add many `assert isinstance(x, bytes_like)` in many places ans still have the test suite to pass. In particular in all the decode/encode metadata functions, and Group class. One extra change in test is the removal of an old conditional import of h5py in the test suite which is unnecessary since introduction of the pytest's `importorskip` for h5py tests.
Extra notes, returning object with Consolidated metadata mean that you actually get reference to mutable objects and have a risk of modifying the internal state of the consolidated store if you don't pay attention. Would it be ok in a separate PR to add a "strict" mode (opt-in), that would trigger a bunch of |
Sorry for not raising this yesterday. I'm a bit unsure about this as we have uses cases where we would like to handle other non- |
I think it would be ok to ensure only bytes-like in python. My main problem is to poke at it to figure out whether it's already decoded or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Carreau, in general agree with spirit of trying to achieve type consistency, but a couple of points worth some discussion here.
@@ -1489,7 +1489,7 @@ def _set_basic_selection_zd(self, selection, value, fields=None): | |||
chunk[selection] = value | |||
|
|||
# encode and store | |||
cdata = self._encode_chunk(chunk) | |||
cdata = ensure_bytes(self._encode_chunk(chunk)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N.B., ensuring bytes may introduce a memory copy, which may be unnecessary. Memory copies can noticeably affect performance. Ensuring we have something that supports the buffer interface would be better.
self.meta_store = {} | ||
for k, v in meta["metadata"].items(): | ||
self.meta_store[k] = json_dumps(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this was originally how this worked, but this introduces an extra unnecessary JSON encode/decode round trip into the processing of consolidated metadata. In practice that may not be a performance issue, but it seemed awkward to have this extra JSON encode/decode if it wasn't necessary. Hence I introduced a special case which avoided the necessity for this, see the parse_metadata function in the original PR.
While it is convenient for user on high level API to be able to pass
arround many types (array, bytes, or even dictionary, list, strings),
It can make it relatively hard to follow internally if the types are not
consistant.
In particular this can relatively complicate weaving the v3 spec into
the core of Zarr and add a number of conditional if we are not sure of
the type we get.
This try to be a little more consistent in what the implementation
emits right now this try to make sure that all the core only try to
converge toward bytes, so that store users can expect to alway get bytes
back from Zarr-Python stores, and as much as possible give bytes back,
and mostly affects the ConsolidatedMetadataStore which used to – unlike
other store – return objects instead of bytes, as well as the tests
directly testing the stores.
We of course can't be too strict as some existing store may have stored
some values as not-bytes (pickle), and we still want to support this for
the time being,but with this changes, we can add many
assert isinstance(x, bytes_like)
in many places ans still have the test suiteto pass.
In particular in all the decode/encode metadata functions, and Group
class.
One extra change in test is the removal of an old conditional import of
h5py in the test suite which is unnecessary since introduction of the
pytest's
importorskip
for h5py tests.[Description of PR]
TODO: