-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-759: [Python] Serializing large class of Python objects in Apache Arrow #965
Conversation
…ng is handled correctly)
5325ae5
to
f229d8d
Compare
7ed599b
to
7069e20
Compare
34f0d42
to
c4782ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This looks like a great start. I have a bunch of mostly nitpicks but overall I'm really interested to flesh this out into a very strong alternative to pickle when dealing with lots of tables and tensors in a Python collection
cpp/src/arrow/python/CMakeLists.txt
Outdated
pyarrow.cc | ||
sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a .cc
here?
|
||
#if PY_MAJOR_VERSION >= 3 | ||
#define PyInt_FromLong PyLong_FromLong | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this go in common.h?
#define PyInt_FromLong PyLong_FromLong | ||
#endif | ||
|
||
Status get_value(std::shared_ptr<Array> arr, int32_t index, int32_t type, PyObject* base, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetValue
?
|
||
Status DeserializeList(std::shared_ptr<Array> array, int32_t start_idx, int32_t stop_idx, | ||
PyObject* base, const std::vector<std::shared_ptr<Tensor>>& tensors, PyObject** out) { | ||
DESERIALIZE_SEQUENCE(PyList_New, PyList_SetItem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may be able to do this using a template instead of a macro if you wanted, see e.g. https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/pandas_to_arrow.cc#L905
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use PyList_SET_ITEM
since it skips error checking (you aren't checking the errors anyway)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I made it a template now and then I can't do PyList_SET_ITEM any more since it is a macro; I'd suggest we keep it a template for now and if this will be a performance bottleneck later, we make it a macro again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd have to pass a C++ lambda. Macro is OK by me
} \ | ||
} \ | ||
*out = result; \ | ||
return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this exits prematurely, result
leaks.
python/pyarrow/serialization.pxi
Outdated
"the object '{}'".format(obj), obj) | ||
return dict(serialized_obj, **{"_pytype_": type_id}) | ||
|
||
def deserialization_callback(serialized_obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public?
python/pyarrow/serialization.pxi
Outdated
check_status(deref(writer).Close()) | ||
|
||
for tensor in value.tensors: | ||
check_status(WriteTensor(deref(tensor), stream.get(), &metadata_length, &body_length)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with nogil?
python/pyarrow/serialization.pxi
Outdated
for tensor in value.tensors: | ||
check_status(WriteTensor(deref(tensor), stream.get(), &metadata_length, &body_length)) | ||
|
||
def read_python_object(NativeFile source): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering a bit why read_python_object
and write_python_object
need to be public. If you pass a NativeFile to serialize_*
then it should write the result to that, otherwise return a Buffer or byte string
python/pyarrow/serialization.pxi
Outdated
check_status(DeserializeList(deref(value.batch).column(0), 0, deref(value.batch).num_rows(), <PyObject*> base, value.tensors, &result)) | ||
return <object> result | ||
|
||
def write_python_object(PythonObject value, int32_t num_tensors, NativeFile sink): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any benefits to doing this in Cython (vs. C++)?
def serialization_roundtrip(value, f): | ||
f.seek(0) | ||
serialized, num_tensors = pa.lib.serialize_sequence(value) | ||
pa.lib.write_python_object(serialized, num_tensors, f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not
pa.serialize_sequence(value, f)
and result = pa.deserialize_sequence(f)
below?
Looks like some of my comments got messed up while you were pushing commits -- make sure to expand the "outdated" notes =) |
Thanks a lot for the thorough review =) I'll try to fix the comments ASAP |
7b32fbf
to
5b03cd1
Compare
1e093b1
to
54af39b
Compare
I'm working on the test failures |
Great thanks! I gave you access to my fork, feel free to push any changes. I'll switch working on the huge page table PR and the entry points PR. |
…-format Change-Id: Id100134ed72a42ed2bba6cab0b5fd5b0f29030e8
I think I got everything; I'll get the build passing and then take a last look |
For future reference, I find that using clang to build and using |
Appears we are missing some DLL exports for Windows ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some style nitpick comments, mainly with the goal to have the code more understandable to non Arrow core developers
namespace py { | ||
|
||
void set_serialization_callbacks(PyObject* serialize_callback, | ||
PyObject* deserialize_callback); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add descriptive comments to these three functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
namespace py { | ||
|
||
Status ReadSerializedPythonSequence(std::shared_ptr<io::RandomAccessFile> src, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some descriptive comments here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
namespace arrow { | ||
namespace py { | ||
|
||
#define UPDATE(OFFSET, TAG) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like these macros could also be written as templated functions. Anything that would prevent that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/// List containing the data from nested dictionaries in the | ||
/// value list of the dictionary | ||
Status Finish(std::shared_ptr<Array> key_tuple_data, | ||
std::shared_ptr<Array> key_dict_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the shared_ptr arguments a constant reference, this should avoid needless copies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Change-Id: I9ced59de48f169b6609dd27f8239ceb22fd5ebeb
Change-Id: If9ac54f494495186743b0a6929ea193ca5048ed0
@pcmoritz there were several flake8 warnings. In my environment I have this bash function which helps with catching these function arrow_preflight {
ARROW_PREFLIGHT_DIR=$HOME/code/arrow/cpp/preflight
mkdir -p $ARROW_PREFLIGHT_DIR
pushd $ARROW_PREFLIGHT_DIR
cmake -GNinja ..
ninja format
ninja lint
popd
pushd $HOME/code/arrow/python
flake8 pyarrow
flake8 --config=.flake8.cython pyarrow
popd
} |
Change-Id: Ia2d359a11fcecb9ce68af03554010acfc38de091
@pcmoritz @robertnishihara I'm going to mark these APIs experimental. We might add an additional set of functions to help with determining the amount of output space required, e.g.: import pyarrow as pa
serialized = pa.serialize(obj)
# This would use MockOutputStream
total_bytes = serialized.total_bytes
buf = ...allocate(total_bytes)
serialized.write_to(buf) |
That sounds good. |
…ethod Change-Id: I03b16c39951fedd069c935fff99f29f41dd5834c
Change-Id: Ieaf56dc38769d1d407af27d14c5b78c4ecf5d49a
I added a minimal version of this. We should return in a subsequent patch and harden with more unit tests and check more rigorously for memory leaks |
cc @cpcloud |
Change-Id: Icfbee217fe4e0872f1b2bb306083596cdd62c992
current benchmarks (haven't dug in to investigate where time being spent): In [1]: import numpy as np
In [2]: import pyarrow as pa
In [3]: import pickle
In [4]: arrays = [np.random.randn(100, 100) for i in range(1000)]
In [5]: %timeit buf = pa.serialize(arrays).to_buffer()
34.8 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: buf = pa.serialize(arrays).to_buffer()
In [8]: %timeit deserialized = pa.deserialize(buf)
2.05 ms ± 55.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [9]: pickled = pickle.dumps(arrays)
In [10]: %timeit pickled = pickle.dumps(arrays)
37.1 ms ± 777 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [11]: %timeit unpickled = pickle.loads(pickled)
11.1 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) So in a simple list of arrays, serialization takes 10% less time, but deserialization takes ~80% less time. Since this is only 80 MB of data, the savings will grow greater the larger the total size of the serialized object. |
It's amazing how this came together, thanks everybody so much for all the help! These are great changes. Concerning the benchmarks: For the serialization, it helps a bunch to use the FixedSizeBufferOutputStream and multithreading to write the bytes. Also excited how performance improves if we write things to Plasma with the recent hugepages PR ;) |
Concerning more rigorous tests, there is a good suite of tests here https://github.com/jsonpickle/jsonpickle/tree/master/tests that we could borrow from or port. |
Change-Id: Id02790935e750554f6d71a730e543b37e412a1c9
Just finished a last buglet and will hopefully get a passing Travis CI build (and Appveyor will fail because of ARROW-1375). Will merge this, let's open JIRAs about follow up stuff! |
+1 you should go ahead and merge it :) |
This PR adds the capability to serialize a large class of (nested) Python objects in Apache Arrow. The eventual goal is to evolve this into a more modern version of pickle that will make it possible to read the data from other languages supported by Apache Arrow (and might also be faster).
Currently we support lists, tuples, dicts, strings, numpy objects, Python classes and namedtuples. A fallback to (cloud-)pickle can be provided for objects that cannot be natively represented in Arrow (for example lambdas).
Numpy data within objects is efficiently represented using Arrow's Tensor facilities and for the nested Python sequences we use Arrow's UnionArray.
There are many loose ends that will need to be addressed in follow up PRs.