-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor and use more buffer view coercion #121
Conversation
While `buffer` is not exactly equivalent to `memoryview` on Python 3 and Python 2 already has `memoryview`, the support of the `buffer` type is much better within Python 2 much as the support of `memoryview` is much better on Python 3. Plus this is what a lot of Python 2/3 compatibility guides suggest when moving to Python 3. So this seems like a reasonable alias to add.
Provides a compatibility function for converting an object into a buffer or memoryview. Makes sure the Python 2 and Python 3 buffers behave roughly the same. Use the `buffer` type on Python 2 as more things support it and it can easily be coerced to a `memoryview`. On Python 3 use a `memoryview` as there is no `buffer` type and `memoryview`s work well on Python 3. Ensure the Python 3 `memoryview` is flattened and of `unsigned char` type, which is inline with the `memoryview` coerced Python 2 buffer. More importantly we generally want to work with flattened byte types anyways, so this provides us a consistent path to getting there while still providing a view onto the underlying data (without copying on either Python).
Effectively `to_buffer` is a refactoring of the functionality in `buffer_tobytes`. Namely it handles taking a view of the underlying data and ensuring that view is flattened, contiguous, and cast to `unsigned char`. Though it doesn't actually perform the copy yet. Hence `buffer_tobytes` leverages `to_buffer` to ensure the data is ready to copy. All that is needed is coercion to a `memoryview`. Once coerced the `tobytes` method can be called to return the `bytes` needed for the encoding or decoding operation. This causes a copy, but is unavoidable at this stage.
Make use of `to_buffer` in BZ2's `encode` function to avoid a copy to `bytes` due to non-contiguous data. This ensures the view of the data is a flattened one. Also removes the need for functions like `handle_datetime`, which this drops.
Ensures that we are working with a consistent view onto the underlying data with both Python 2 and Python 3. The `buffer` or `memoryview` returned on Python 2 or Python 3 respectively can be comfortably used with `BytesIO` and `GzipFile`'s `write` with the minimal necessary overhead. Also removes the need for functions like `handle_datetime`, as `to_buffer` already ensures an unsigned character view of the data.
Make use of `to_buffer` in LZMA's `encode` function to avoid a copy to `bytes` due to non-contiguous data. This ensures the view of the data is a flattened one. Also drop the need for `handle_datetime` as `to_buffer` already takes an unsigned character view of the data.
Ensures that we are working with a consistent view onto the underlying data with both Python 2 and Python 3. The `buffer` or `memoryview` returned on Python 2 or Python 3 respectively can be comfortably used with `zlib`'s `compress` and `decompress` functions with the minimal necessary overhead. Also drops the need for `handle_datetime` as `to_buffer` already takes an unsigned character view of the data.
The `to_buffer` function already tries to coerce `ndarray`'s into views of `np.uint8` data. So there is no need to handle this through this function any more.
This is great and should definitely avoid some memory copies on Python 2! Two thoughts... The code to raise a value error if an object array is passed could reasonably live within the I think we should only view as a different dtype when we have to, otherwise leave the dtype of a ndarray and/or memoryview/buffer as-is. It is possible that some compressors might want to access this information, it is accessible via the buffer interface. E.g., currently the blosc compressor will make use of the buffer's itemsize to determine how to apply the shuffle filter. The blosc compressor doesn't make use of the to_buffer() function so this is not directly relevant, but is just an illustration of why a compressor might want dtype information to be preserved where possible. Hence we should probably only make a view for datetime/timedelta arrays, and when we do we should preserve the itemsize. |
Thanks for the feedback. Yeah, hadn't meant to optimize Python 2 necessarily, but hopefully the code simplification is appreciated nonetheless. Would expect some performance bump on Python 3 as well. There are some other benefits outlined below. So here's the really cool thing about In [1]: import numpy as np
In [2]: a = np.array(["22", 2+0j, 1, 1.3], dtype=object)
In [3]: np.asarray(memoryview(a.tobytes()))
Out[3]:
array([160, 64, 228, 6, 1, 0, 0, 0, 240, 112, 83, 19, 1,
0, 0, 0, 80, 72, 143, 5, 1, 0, 0, 0, 32, 237,
28, 19, 1, 0, 0, 0], dtype=uint8)
In [4]: np.asarray(memoryview(a).cast('B'))
Out[4]:
array([160, 64, 228, 6, 1, 0, 0, 0, 240, 112, 83, 19, 1,
0, 0, 0, 80, 72, 143, 5, 1, 0, 0, 0, 32, 237,
28, 19, 1, 0, 0, 0], dtype=uint8) The advantage here is that we are able to get a view equivalent to calling That's a good point regarding typing. If we think |
For object arrays, if you take a buffer or convert to bytes, IIUC what you
are getting is access to the Python object memory addresses, i.e.,
pointers. If you are casting to unsigned char you're just breaking up those
pointers. Generally there's nothing good or useful you can do with those
pointers, which is why errors are raised when compression codecs are passed
an object array, and why zarr tries to ensure one of the codecs that
specifically serialises object arrays (like JSON, msgpack or VlenUTF8) is
called first.
Yes you're right most compressors will just cast to unsigned char and
process the memory as bytes. But some compressors and filters will want to
know itemsize where possible. So I think we should view as a different
dtype only where we know we have to to make things work, and otherwise
propagate unchanged as much information as we can via the buffer interface.
…On Fri, 9 Nov 2018, 16:15 jakirkham ***@***.*** wrote:
Thanks for the feedback.
Yeah, hadn't meant to optimize Python 2 necessarily, but hopefully the
code simplification is appreciated nonetheless. Would expect some
performance bump on Python 3 as well. There are some other benefits
outlined below.
So here's the really cool thing about object arrays. Basically a
memoryview cast to unsigned char type and flattened has the same layout
as calling tobytes on the object array, but provides a view instead.
Example below on Python 3. On Python 2, casting and flattening are not
options with memoryviews in Python. However the buffer type has the same
effect (additionally marking the data as read-only, which we could extend
to Python 3).
In [1]: import numpy as np
In [2]: a = np.array(["22", 2+0j, 1, 1.3], dtype=object)
In [3]: np.asarray(memoryview(a.tobytes()))
Out[3]:
array([160, 64, 228, 6, 1, 0, 0, 0, 240, 112, 83, 19, 1,
0, 0, 0, 80, 72, 143, 5, 1, 0, 0, 0, 32, 237,
28, 19, 1, 0, 0, 0], dtype=uint8)
In [4]: np.asarray(memoryview(a).cast('B'))
Out[4]:
array([160, 64, 228, 6, 1, 0, 0, 0, 240, 112, 83, 19, 1,
0, 0, 0, 80, 72, 143, 5, 1, 0, 0, 0, 32, 237,
28, 19, 1, 0, 0, 0], dtype=uint8)
The advantage here is that we are able to get a view equivalent to calling
tobytes on the object array, but without copying to a bytes instance yet.
This of course extends to all other types as well. If the copy happens
anyways, we simply make sure not to cause it any earlier. However there is
the potential to exploit this view for higher performance.
That's a good point regarding typing. If we think to_buffer might be
useful in other cases, would be happy to soften this unsigned char
constraint. At least for these CPython included compressors (going off of
the Python 3 implementations), they honor the buffer protocol for getting
input data (a nice benefit of skipping a tobytes call 😉). However once
they get the Py_buffer object they just act on the data as char* and seem
to ignore the itemsize and format attributes. So it shouldn't matter for
the cases changed here, but am happy to look more closely if you think it
will be useful for other use cases.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qr37jqyOlAkaeTKYvywX1t9xzRhiks5utaowgaJpZM4YWEr_>
.
|
Limits explicit casting to `unsigned char` only for datetime and timedeltas. Preserve types of underlying data in all other cases.
Have relaxed the casting in the latest commit. Please let me know what you think. |
As to the object array case, think you are right that this is what is happening under the hood. That said, I must still be missing something. The same codecs handling various |
IIRC those codecs only use buffer_tobytes on decode, i.e., to get from the
serialised data back to an object array. Going in the other direction
(encode) they do something different.
…On Fri, 9 Nov 2018, 19:23 jakirkham ***@***.*** wrote:
As to the object array case, think you are right that this is what is
happening under the hood. That said, I must still be missing something. The
same codecs handling various object arrays (e.g. Categorize, JSON,
MsgPack, Pickle, Vlen*) are calling buffer_tobytes, which has been using
tobytes on those NumPy arrays (and we are reproducing identical results
with to_buffer). IOW do these codecs have bugs? At least a few of them
appear to be using some sort of legacy implementation, in which case we
know they were broken (hence deprecated and being replaced), but have not
verified they are all like this. In any event, if we raise for object
type arrays in to_buffer, we just cause some tests for these codecs to
fail.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QpNreWDmn2zITWQ6_zFOvotIZjyTks5utdYegaJpZM4YWEr_>
.
|
Thanks for the clarification. Went through that code just to make sure I understood it correctly. Agree that doesn't seem to be the cause. Looking more closely at the failures it appears that all of them are coming from tests named Edit: Yeah, it appears some of the fixtures have |
As encoding `object` arrays to `bytes` does not really make sense (one gets address of pointers on the stack), raise if any data provided is of `object` type. This can be done easily with an `ndarray`. In other cases, it can be checked through a `memoryview` of the data. Though there are some edge cases on Python 2, which need special handling.
When running the `check_backwards_compatibility` test, make sure not to call `buffer_tobytes` on `object` arrays from fixture data. Casting these to bytes doesn't really make sense as it will return representation of pointers in the underlying buffer. Not only are the pointers things we do not want to be comparing, but it is also internal spec to NumPy and Python that really isn't reliable either. So make sure to pass through the `object` arrays as is.
Includes a simple check involving an `object` array and a `memoryview` of the `object` array to ensure that these cases are being caught and raised by `buffer_tobytes`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, generally awesome how this is simplifying the codec implementations!
As all of these functions coerce their buffers with `to_buffer` and `to_buffer` now checks for `object` type and raises, there is no need for an additional check before encoding data in these codecs. So go ahead and drop this code.
This preserves the itemsize of the underlying type, which could be handy information for compression algorithms to take advantage of (if they are able to).
This reverts commit 4425833.
When attempt to coerce an object to a `memoryview` to determine the buffer's type on Python 2, just use a `try...catch` to ignore all cases that fail. The are generally things like `array` or `mmap`, which cannot hold object types anyways. So they can be presumed safe and ignored. We are only interested in checking for things like `memoryviews` of other things that already contain `object`s or some third-party `ndarray`-like class that holds `object`s. If any of these show up, throw a `ValueError` per usual.
Simply use a small anonymous memory segment from `mmap` to test `buffer_tobytes`. This should be good for exercising a case similar to memory-mapping files that Zarr users may want. Should add CPython automatically cleans this up for us once all reference counts go to zero. ref: https://stackoverflow.com/a/50460828
Make sure the unicode `array`s are being cast to unsigned integers of some size.
Provides a Python 2/3 `getbuffer` function that tries to mimic NumPy's `getbuffer` functionality on Python 3 and simply uses NumPy's implementation on Python 2.
Works around some pain points with unicode `array` types on Windows specifically.
Perform both `memoryview` casts for unicode `array`s one after the other. This ensures that the `memoryview` has the expected replacement type before taking the `ndarray` view. The behavior for an end user appears to be essentially identical. However it's arguably nicer to ensure the underlying `memoryview` has the intended type and shape. Updates the `ndarray` casting block at the end to be Python 2 only. As the unicode `array` case (the only one in need of casting on Python 3), is handled before constructing the `ndarray`. So there is no need for additional casting of `ndarray`s for Python 3.
If coercion to a `memoryview` does not work on Python 3, re-raise the original `TypeError` exception.
If an object is provided that does not implement either buffer protocol, it should raise a `TypeError`. This includes such a test to ensure that `TypeError`s are being raised for non-buffer supporting types.
Any thoughts on this @zarr-developers/core-devs? |
Hi @jakirkham, just taking a look at this, I very much appreciate the work gone in here, and there are technical improvements here that we definitely want to get into the codebase. But I have to say I am finding the logic hard to follow, and I'm a bit concerned that others will also find it hard to follow if we need to revisit this in future, which will do at least when we drop PY2 support. Don't take this as a final opinion as I'm still working on understanding everything that's going on here, but FWIW I think there are two general points regarding ease of comprehension that I think would be really valuable to find a solution for. The first is that it's not immediately clear what logic is specific to PY2 compatibility. So when we come to drop PY2 support, it will not be a question of just deleting all The second is that there is some sense of circularity, at least in how the functions are named. E.g., the function I don't have an immediate proposal on how to solve either of these, but wanted to raise for discussion. I'll give it some further thought. |
This is no longer used as we have switched to `np.getbuffer` on Python 2 to ensure that read-write buffers are used where possible like with `memoryview` on Python 3.
This was only being used as a convenience function for one test. As this results in a bit of indirection and can be easily inlined for simplicity. Just inline it and drop the `getbuffer` compatibility function.
Thanks @alimanfoo. Appreciate the feedback. Have to say I'm sorry this PR has become harder to understand. This is my fault as it started addressing a small point, but began to sprawl as we saw further simplifications could be made. Not to mention as work went into this it became clear there was a better direction we could go. Think the most important change here is However there are a number of situations that based on our discussion we wanted to gather into one place. For example we wanted to check for As to the Python 2/3 point, I think the confusion of utility functions is muddling this point. In actuality the compat code has shrunk quite significantly. There only is compat code in the Hope that helps provide some bearings to this PR. Sorry again that this PR has become overgrown. Would just like to stress this PR is trying to do 2 things consolidate handling of buffers, Python 2/3 differences, etc. by turning them all into |
As the buffer is coerced by NumPy into an array without copying, it needs to be able to handle the data type. However for Windows the unicode array has UCS2 whereas NumPy only supports UCS4. To fix this issue just cast the `memoryview` to `uint8`. We don't particularly care about the type for this test. Only that data is shared. So this should workaround the issue and allow the check to occur.
Thanks @jakirkham, no problem at all. This is a really hard one to get right, there are so many moving parts - PY2 quirks, numpy quirks, compressor quirks. Very difficult to find a clean solution that accommodates all of these in an easy-to-digest way. FWIW I think the key thing is that a codec should be able to express what type of input it requires, in a completely clear and obvious way. E.g., if a codec needs an object exporting C-contiguous memory via the new-style buffer interface, there should be an appropriately-named function for handling that. On the other hand, if a codec just needs a numpy array, but doesn't care if it's C-contiguous, there should a function for that. |
Yeah, that makes sense. The other tricky part in my mind is finding the right number of utility functions. Certainly the naming in this PR needs work. FWIW I wound up rolling a small library called |
FWIW re naming, in #128 I've tentatively gone with Re cybuffer, it looks fierce and I'm sure it will be an invaluable resource, but for numcodecs I think we are doing OK with the approach we have in here / #128? |
Right, the naming point was more lamenting the poor naming choices in this PR. Agree that what has been added in PR ( #128 ) is fine. After the release it may be worth revisiting after whether we want to pull content from |
Thanks @jakirkham, happy to revisit this after release if there's a better way. |
Also just wanted to add that thanks primarily to work in this PR I think numcodecs is going to be in a much better place than it was before. I think it will also be a much better foundation for others wanting to contribute more codecs. Beers all round 🍻. |
This is a reworking of PR ( #119 ). The main addition is the
to_buffer
function, which tries as best as possible to coerce input data into flat, contiguous,unsigned characterviews of the original data (without copying). This tries to follow the intent of the originalbuffer_tobytes
implementation as much as possible. As a resultbuffer_tobytes
is rewritten to be a thin wrapper aroundto_buffer
.Also tries to play to the strengths of Python 2 and Python 3 without too much special casing in other parts of the code. This amounts to returning
buffer
on Python 2 andmemoryview
on Python 3. Asmemoryview
support is rather weak in Python 2, usingbuffer
there ends up being a good choice. Besides creating amemoryview
from abuffer
in Python 2 is trivial. So this option is always available when needed (the reverse is not true however).Updates the
BZ2
,GZip
,LZMA
andZlib
codecs to utilizeto_buffer
directly. This avoids the need fortobytes
calls andbuffer_tobytes
in a few places. Dropping these copies should aid performance in these codecs.Further this supplants the need for
handle_datetime
as views of data are always made to beunsigned char
in all cases. Hencehandle_datetime
is dropped throughout. Also various checks forobject
type arrays are dropped throughout as this is also checked and handled into_buffer
. Generally consolidates the special casing into_buffer
allow the codecs to succinctly and clearly describe the main steps. Hopefully makes it easier for others to get started adding their own codecs (not that it was difficult before).Adds tests of
to_buffer
alongside tests ofbuffer_tobytes
. Includes some tests cases for expected errors to ensure they are raised.TODO:
tox -e py37
passes locallytox -e py27
passes locallytox -e docs
passes locally