Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and use more buffer view coercion #121

Closed
wants to merge 67 commits into from
Closed

Refactor and use more buffer view coercion #121

wants to merge 67 commits into from

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Nov 9, 2018

This is a reworking of PR ( #119 ). The main addition is the to_buffer function, which tries as best as possible to coerce input data into flat, contiguous, unsigned character views of the original data (without copying). This tries to follow the intent of the original buffer_tobytes implementation as much as possible. As a result buffer_tobytes is rewritten to be a thin wrapper around to_buffer.

Also tries to play to the strengths of Python 2 and Python 3 without too much special casing in other parts of the code. This amounts to returning buffer on Python 2 and memoryview on Python 3. As memoryview support is rather weak in Python 2, using buffer there ends up being a good choice. Besides creating a memoryview from a buffer in Python 2 is trivial. So this option is always available when needed (the reverse is not true however).

Updates the BZ2, GZip, LZMA and Zlib codecs to utilize to_buffer directly. This avoids the need for tobytes calls and buffer_tobytes in a few places. Dropping these copies should aid performance in these codecs.

Further this supplants the need for handle_datetime as views of data are always made to be unsigned char in all cases. Hence handle_datetime is dropped throughout. Also various checks for object type arrays are dropped throughout as this is also checked and handled in to_buffer. Generally consolidates the special casing in to_buffer allow the codecs to succinctly and clearly describe the main steps. Hopefully makes it easier for others to get started adding their own codecs (not that it was difficult before).

Adds tests of to_buffer alongside tests of buffer_tobytes. Includes some tests cases for expected errors to ensure they are raised.

TODO:

  • Unit tests and/or doctests in docstrings
  • tox -e py37 passes locally
  • tox -e py27 passes locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • tox -e docs passes locally
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

@jakirkham jakirkham mentioned this pull request Nov 9, 2018
8 tasks
While `buffer` is not exactly equivalent to `memoryview` on Python 3 and
Python 2 already has `memoryview`, the support of the `buffer` type is
much better within Python 2 much as the support of `memoryview` is much
better on Python 3. Plus this is what a lot of Python 2/3 compatibility
guides suggest when moving to Python 3. So this seems like a reasonable
alias to add.
Provides a compatibility function for converting an object into a buffer
or memoryview. Makes sure the Python 2 and Python 3 buffers behave
roughly the same. Use the `buffer` type on Python 2 as more things
support it and it can easily be coerced to a `memoryview`. On Python 3
use a `memoryview` as there is no `buffer` type and `memoryview`s work
well on Python 3. Ensure the Python 3 `memoryview` is flattened and of
`unsigned char` type, which is inline with the `memoryview` coerced
Python 2 buffer. More importantly we generally want to work with
flattened byte types anyways, so this provides us a consistent path to
getting there while still providing a view onto the underlying data
(without copying on either Python).
Effectively `to_buffer` is a refactoring of the functionality in
`buffer_tobytes`. Namely it handles taking a view of the underlying data
and ensuring that view is flattened, contiguous, and cast to `unsigned
char`. Though it doesn't actually perform the copy yet.

Hence `buffer_tobytes` leverages `to_buffer` to ensure the data is ready
to copy. All that is needed is coercion to a `memoryview`. Once coerced
the `tobytes` method can be called to return the `bytes` needed for the
encoding or decoding operation. This causes a copy, but is unavoidable
at this stage.
Make use of `to_buffer` in BZ2's `encode` function to avoid a copy to
`bytes` due to non-contiguous data. This ensures the view of the data is
a flattened one. Also removes the need for functions like
`handle_datetime`, which this drops.
Ensures that we are working with a consistent view onto the underlying
data with both Python 2 and Python 3. The `buffer` or `memoryview`
returned on Python 2 or Python 3 respectively can be comfortably used
with `BytesIO` and `GzipFile`'s `write` with the minimal necessary
overhead. Also removes the need for functions like `handle_datetime`, as
`to_buffer` already ensures an unsigned character view of the data.
Make use of `to_buffer` in LZMA's `encode` function to avoid a copy to
`bytes` due to non-contiguous data. This ensures the view of the data is
a flattened one. Also drop the need for `handle_datetime` as `to_buffer`
already takes an unsigned character view of the data.
Ensures that we are working with a consistent view onto the underlying
data with both Python 2 and Python 3. The `buffer` or `memoryview`
returned on Python 2 or Python 3 respectively can be comfortably used
with `zlib`'s `compress` and `decompress` functions with the minimal
necessary overhead. Also drops the need for `handle_datetime` as
`to_buffer` already takes an unsigned character view of the data.
The `to_buffer` function already tries to coerce `ndarray`'s into views
of `np.uint8` data. So there is no need to handle this through this
function any more.
@alimanfoo
Copy link
Member

This is great and should definitely avoid some memory copies on Python 2!

Two thoughts...

The code to raise a value error if an object array is passed could reasonably live within the to_buffer() function I think. Taking a buffer of an object array is pretty-much always going to lead to bad things.

I think we should only view as a different dtype when we have to, otherwise leave the dtype of a ndarray and/or memoryview/buffer as-is. It is possible that some compressors might want to access this information, it is accessible via the buffer interface. E.g., currently the blosc compressor will make use of the buffer's itemsize to determine how to apply the shuffle filter. The blosc compressor doesn't make use of the to_buffer() function so this is not directly relevant, but is just an illustration of why a compressor might want dtype information to be preserved where possible. Hence we should probably only make a view for datetime/timedelta arrays, and when we do we should preserve the itemsize.

@jakirkham
Copy link
Member Author

Thanks for the feedback.

Yeah, hadn't meant to optimize Python 2 necessarily, but hopefully the code simplification is appreciated nonetheless. Would expect some performance bump on Python 3 as well. There are some other benefits outlined below.

So here's the really cool thing about object arrays. Basically a memoryview cast to unsigned char type and flattened has the same layout as calling tobytes on the object array, but provides a view instead. Example below on Python 3. On Python 2, casting and flattening are not options with memoryviews in Python. However the buffer type has the same effect (additionally marking the data as read-only, which we could extend to Python 3).

In [1]: import numpy as np

In [2]: a = np.array(["22", 2+0j, 1, 1.3], dtype=object)

In [3]: np.asarray(memoryview(a.tobytes()))
Out[3]: 
array([160,  64, 228,   6,   1,   0,   0,   0, 240, 112,  83,  19,   1,
         0,   0,   0,  80,  72, 143,   5,   1,   0,   0,   0,  32, 237,
        28,  19,   1,   0,   0,   0], dtype=uint8)

In [4]: np.asarray(memoryview(a).cast('B'))
Out[4]: 
array([160,  64, 228,   6,   1,   0,   0,   0, 240, 112,  83,  19,   1,
         0,   0,   0,  80,  72, 143,   5,   1,   0,   0,   0,  32, 237,
        28,  19,   1,   0,   0,   0], dtype=uint8)

The advantage here is that we are able to get a view equivalent to calling tobytes on the object array, but without copying to a bytes instance yet. This of course extends to all other types as well. If the copy happens anyways, we simply make sure not to cause it any earlier. However there is the potential to exploit this view for higher performance.

That's a good point regarding typing. If we think to_buffer might be useful in other cases, would be happy to soften this unsigned char constraint. At least for these CPython included compressors (going off of the Python 3 implementations), they honor the buffer protocol for getting input data (a nice benefit of skipping a tobytes call 😉). However once they get the Py_buffer object they just act on the data as char* and seem to ignore the itemsize and format attributes. So it shouldn't matter for the cases changed here, but am happy to look more closely if you think it will be useful for other use cases.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 9, 2018 via email

Limits explicit casting to `unsigned char` only for datetime and
timedeltas. Preserve types of underlying data in all other cases.
@jakirkham
Copy link
Member Author

Have relaxed the casting in the latest commit. Please let me know what you think.

@jakirkham
Copy link
Member Author

As to the object array case, think you are right that this is what is happening under the hood. That said, I must still be missing something. The same codecs handling various object arrays (e.g. Categorize, JSON, MsgPack, Pickle, Vlen*) are calling buffer_tobytes, which has been using tobytes on those NumPy arrays (and we are reproducing identical results with to_buffer). IOW do these codecs have bugs? At least a few of them appear to be using some sort of legacy implementation, in which case we know they were broken (hence deprecated and being replaced), but have not verified they are all like this. In any event, if we raise for object type arrays in to_buffer, we just cause some tests for these codecs to fail.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 9, 2018 via email

@jakirkham
Copy link
Member Author

jakirkham commented Nov 9, 2018

Thanks for the clarification. Went through that code just to make sure I understood it correctly. Agree that doesn't seem to be the cause.

Looking more closely at the failures it appears that all of them are coming from tests named test_backwards_compatibility in different codec tests. They all seem to trace back to this line.

Edit: Yeah, it appears some of the fixtures have object arrays in them. IDK if that is expected. In any event, those object arrays are getting passed to buffer_tobytes as part of the tests.

As encoding `object` arrays to `bytes` does not really make sense (one
gets address of pointers on the stack), raise if any data provided is of
`object` type. This can be done easily with an `ndarray`. In other
cases, it can be checked through a `memoryview` of the data. Though
there are some edge cases on Python 2, which need special handling.
When running the `check_backwards_compatibility` test, make sure not to
call `buffer_tobytes` on `object` arrays from fixture data. Casting
these to bytes doesn't really make sense as it will return
representation of pointers in the underlying buffer. Not only are the
pointers things we do not want to be comparing, but it is also internal
spec to NumPy and Python that really isn't reliable either. So make sure
to pass through the `object` arrays as is.
numcodecs/tests/common.py Outdated Show resolved Hide resolved
Includes a simple check involving an `object` array and a `memoryview`
of the `object` array to ensure that these cases are being caught and
raised by `buffer_tobytes`.
Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, generally awesome how this is simplifying the codec implementations!

numcodecs/compat.py Outdated Show resolved Hide resolved
numcodecs/compat.py Outdated Show resolved Hide resolved
numcodecs/bz2.py Show resolved Hide resolved
numcodecs/tests/test_compat.py Outdated Show resolved Hide resolved
As all of these functions coerce their buffers with `to_buffer` and
`to_buffer` now checks for `object` type and raises, there is no need
for an additional check before encoding data in these codecs. So go
ahead and drop this code.
This preserves the itemsize of the underlying type, which could be handy
information for compression algorithms to take advantage of (if they are
able to).
When attempt to coerce an object to a `memoryview` to determine the
buffer's type on Python 2, just use a `try...catch` to ignore all cases
that fail. The are generally things like `array` or `mmap`, which cannot
hold object types anyways. So they can be presumed safe and ignored.

We are only interested in checking for things like `memoryviews` of
other things that already contain `object`s or some third-party
`ndarray`-like class that holds `object`s. If any of these show up,
throw a `ValueError` per usual.
Simply use a small anonymous memory segment from `mmap` to test
`buffer_tobytes`. This should be good for exercising a case similar to
memory-mapping files that Zarr users may want. Should add CPython
automatically cleans this up for us once all reference counts go to
zero.

ref: https://stackoverflow.com/a/50460828
Make sure the unicode `array`s are being cast to unsigned integers of
some size.
Provides a Python 2/3 `getbuffer` function that tries to mimic NumPy's
`getbuffer` functionality on Python 3 and simply uses NumPy's
implementation on Python 2.
Works around some pain points with unicode `array` types on Windows
specifically.
Perform both `memoryview` casts for unicode `array`s one after the
other. This ensures that the `memoryview` has the expected replacement
type before taking the `ndarray` view. The behavior for an end user
appears to be essentially identical. However it's arguably nicer to
ensure the underlying `memoryview` has the intended type and shape.
Updates the `ndarray` casting block at the end to be Python 2 only. As
the unicode `array` case (the only one in need of casting on Python 3),
is handled before constructing the `ndarray`. So there is no need for
additional casting of `ndarray`s for Python 3.
If coercion to a `memoryview` does not work on Python 3, re-raise the
original `TypeError` exception.
If an object is provided that does not implement either buffer protocol,
it should raise a `TypeError`. This includes such a test to ensure that
`TypeError`s are being raised for non-buffer supporting types.
@jakirkham
Copy link
Member Author

Any thoughts on this @zarr-developers/core-devs?

@alimanfoo alimanfoo added this to the 0.6.0 milestone Nov 15, 2018
@alimanfoo
Copy link
Member

Hi @jakirkham, just taking a look at this, I very much appreciate the work gone in here, and there are technical improvements here that we definitely want to get into the codebase. But I have to say I am finding the logic hard to follow, and I'm a bit concerned that others will also find it hard to follow if we need to revisit this in future, which will do at least when we drop PY2 support.

Don't take this as a final opinion as I'm still working on understanding everything that's going on here, but FWIW I think there are two general points regarding ease of comprehension that I think would be really valuable to find a solution for.

The first is that it's not immediately clear what logic is specific to PY2 compatibility. So when we come to drop PY2 support, it will not be a question of just deleting all if PY2: conditional branches. I think it would really help if we could get a clean separation between PY2 and PY3 code paths.

The second is that there is some sense of circularity, at least in how the functions are named. E.g., the function to_buffer() internally calls ndarray_from_buffer(). I think we need to get some clarity on what each of these helper functions is accepting as inputs and returning as outputs, and sort out the naming so it's more intuitive.

I don't have an immediate proposal on how to solve either of these, but wanted to raise for discussion. I'll give it some further thought.

This is no longer used as we have switched to `np.getbuffer` on Python 2
to ensure that read-write buffers are used where possible like with
`memoryview` on Python 3.
This was only being used as a convenience function for one test. As this
results in a bit of indirection and can be easily inlined for
simplicity. Just inline it and drop the `getbuffer` compatibility
function.
@jakirkham
Copy link
Member Author

Thanks @alimanfoo. Appreciate the feedback.

Have to say I'm sorry this PR has become harder to understand. This is my fault as it started addressing a small point, but began to sprawl as we saw further simplifications could be made. Not to mention as work went into this it became clear there was a better direction we could go.

Think the most important change here is ndarray_from_buffer. That function has become a general purpose buffer accepting function that will return an ndarray from any kind of buffer on Python 2/3. It tries as best as possible to return an ndarray that matches the shape and type of the original buffer. As such many of the existing compat functions have either been dropped or largely replaced by calls to it.

However there are a number of situations that based on our discussion we wanted to gather into one place. For example we wanted to check for object type in one place. Also we had to handle the datetime/timedelta workaround. Finally some codecs expect a flattened array. So this logic has fallen into to_buffer. The name of the function is not great. The matter is complicated further as not every use case in Numcodecs needs or can even have these checks. Am half tempted to move this stuff back to the codecs as it is relatively light and it appears having it one place is confusing to others. Though am open to suggestions on how to improve this.

As to the Python 2/3 point, I think the confusion of utility functions is muddling this point. In actuality the compat code has shrunk quite significantly. There only is compat code in the compat.py module and its associated test suite (as opposed to throughout codecs as was the case before). In fact nearly all of it lives in ndarray_from_buffer and has pragma: ... no cover lines and/or PY2 checks for the relevant Python version, which generally improves usability and readability elsewhere. IMHO this is one of the biggest wins and was part of the objective of this PR (the other being to cutdown on copies). Have tried to push a couple minor commits to highlight this further. Not sure how to improve this more, but open to suggestions.

Hope that helps provide some bearings to this PR. Sorry again that this PR has become overgrown. Would just like to stress this PR is trying to do 2 things consolidate handling of buffers, Python 2/3 differences, etc. by turning them all into ndarrays and cutting down on copies of the data.

As the buffer is coerced by NumPy into an array without copying, it
needs to be able to handle the data type. However for Windows the
unicode array has UCS2 whereas NumPy only supports UCS4. To fix this
issue just cast the `memoryview` to `uint8`. We don't particularly care
about the type for this test. Only that data is shared. So this should
workaround the issue and allow the check to occur.
@alimanfoo
Copy link
Member

Thanks @jakirkham, no problem at all. This is a really hard one to get right, there are so many moving parts - PY2 quirks, numpy quirks, compressor quirks. Very difficult to find a clean solution that accommodates all of these in an easy-to-digest way.

FWIW I think the key thing is that a codec should be able to express what type of input it requires, in a completely clear and obvious way. E.g., if a codec needs an object exporting C-contiguous memory via the new-style buffer interface, there should be an appropriately-named function for handling that. On the other hand, if a codec just needs a numpy array, but doesn't care if it's C-contiguous, there should a function for that.

@jakirkham
Copy link
Member Author

Yeah, that makes sense.

The other tricky part in my mind is finding the right number of utility functions. Certainly the naming in this PR needs work.

FWIW I wound up rolling a small library called cybuffer that takes anything that supports either buffer protocol and creates an object that exports buffer and memoryview compatible interfaces and works on Python 2/3. It is written in Cython so we could use this to solve similar issues there and is accessible from Python. IDK if we want to use it or not, but it did help learn more about the problem.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 27, 2018

FWIW re naming, in #128 I've tentatively gone with ensure_ndarray() to mean make sure I've got a numpy array; and ensure_contiguous_ndarray() to mean make sure I've got a numpy array that can export C contiguous memory via the new-style buffer interface. In both cases, no memory copies will be made under any circumstances.

Re cybuffer, it looks fierce and I'm sure it will be an invaluable resource, but for numcodecs I think we are doing OK with the approach we have in here / #128?

@jakirkham
Copy link
Member Author

jakirkham commented Nov 27, 2018

Right, the naming point was more lamenting the poor naming choices in this PR. Agree that what has been added in PR ( #128 ) is fine.

After the release it may be worth revisiting after whether we want to pull content from cybuffer or perhaps use it directly. In particular the bulk of the content in PR ( #128 )'s ensure_ndarray can be shoved into Buffer, which would be a simplified version of what is in cybuffer's __cinit__.

@jakirkham jakirkham removed this from the 0.6.0 milestone Nov 27, 2018
@jakirkham jakirkham closed this Nov 27, 2018
@jakirkham jakirkham deleted the use_more_buffers branch November 27, 2018 18:52
@alimanfoo
Copy link
Member

Thanks @jakirkham, happy to revisit this after release if there's a better way.

@alimanfoo
Copy link
Member

Also just wanted to add that thanks primarily to work in this PR I think numcodecs is going to be in a much better place than it was before. I think it will also be a much better foundation for others wanting to contribute more codecs. Beers all round 🍻.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants