ARROW-759: [Python] Serializing large class of Python objects in Apache Arrow #965

pcmoritz · 2017-08-15T04:10:35Z

This PR adds the capability to serialize a large class of (nested) Python objects in Apache Arrow. The eventual goal is to evolve this into a more modern version of pickle that will make it possible to read the data from other languages supported by Apache Arrow (and might also be faster).

Currently we support lists, tuples, dicts, strings, numpy objects, Python classes and namedtuples. A fallback to (cloud-)pickle can be provided for objects that cannot be natively represented in Arrow (for example lambdas).

Numpy data within objects is efficiently represented using Arrow's Tensor facilities and for the nested Python sequences we use Arrow's UnionArray.

There are many loose ends that will need to be addressed in follow up PRs.

…ng is handled correctly)

wesm

Nice! This looks like a great start. I have a bunch of mostly nitpicks but overall I'm really interested to flesh this out into a very strong alternative to pickle when dealing with lots of tables and tensors in a Python collection

wesm · 2017-08-16T00:25:37Z

cpp/src/arrow/python/CMakeLists.txt

  pyarrow.cc
+  sequence


Missing a .cc here?

wesm · 2017-08-16T00:25:55Z

cpp/src/arrow/python/arrow_to_python.cc

+
+#if PY_MAJOR_VERSION >= 3
+#define PyInt_FromLong PyLong_FromLong
+#endif


Could this go in common.h?

wesm · 2017-08-16T00:26:30Z

cpp/src/arrow/python/arrow_to_python.cc

+#define PyInt_FromLong PyLong_FromLong
+#endif
+
+Status get_value(std::shared_ptr<Array> arr, int32_t index, int32_t type, PyObject* base,


GetValue?

wesm · 2017-08-16T00:28:59Z

cpp/src/arrow/python/arrow_to_python.cc

+
+Status DeserializeList(std::shared_ptr<Array> array, int32_t start_idx, int32_t stop_idx,
+    PyObject* base, const std::vector<std::shared_ptr<Tensor>>& tensors, PyObject** out) {
+  DESERIALIZE_SEQUENCE(PyList_New, PyList_SetItem)


I think you may be able to do this using a template instead of a macro if you wanted, see e.g. https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/pandas_to_arrow.cc#L905

Use PyList_SET_ITEM since it skips error checking (you aren't checking the errors anyway)?

Ok, I made it a template now and then I can't do PyList_SET_ITEM any more since it is a macro; I'd suggest we keep it a template for now and if this will be a performance bottleneck later, we make it a macro again?

You'd have to pass a C++ lambda. Macro is OK by me

wesm · 2017-08-16T00:30:40Z

cpp/src/arrow/python/arrow_to_python.cc

+    }                                                                       \
+  }                                                                         \
+  *out = result;                                                            \
+  return Status::OK();


If this exits prematurely, result leaks.

wesm · 2017-08-16T01:12:02Z

python/pyarrow/serialization.pxi

+                                         "the object '{}'".format(obj), obj)
+    return dict(serialized_obj, **{"_pytype_": type_id})
+
+def deserialization_callback(serialized_obj):


wesm · 2017-08-16T01:12:19Z

python/pyarrow/serialization.pxi

+        check_status(deref(writer).Close())
+
+    for tensor in value.tensors:
+        check_status(WriteTensor(deref(tensor), stream.get(), &metadata_length, &body_length))


with nogil?

wesm · 2017-08-16T01:16:22Z

python/pyarrow/serialization.pxi

+    for tensor in value.tensors:
+        check_status(WriteTensor(deref(tensor), stream.get(), &metadata_length, &body_length))
+
+def read_python_object(NativeFile source):


I'm wondering a bit why read_python_object and write_python_object need to be public. If you pass a NativeFile to serialize_* then it should write the result to that, otherwise return a Buffer or byte string

wesm · 2017-08-16T01:16:58Z

python/pyarrow/serialization.pxi

+    check_status(DeserializeList(deref(value.batch).column(0), 0, deref(value.batch).num_rows(), <PyObject*> base, value.tensors, &result))
+    return <object> result
+
+def write_python_object(PythonObject value, int32_t num_tensors, NativeFile sink):


Any benefits to doing this in Cython (vs. C++)?

wesm · 2017-08-16T01:19:08Z

python/pyarrow/tests/test_serialization.py

+def serialization_roundtrip(value, f):
+    f.seek(0)
+    serialized, num_tensors = pa.lib.serialize_sequence(value)
+    pa.lib.write_python_object(serialized, num_tensors, f)


Why not

pa.serialize_sequence(value, f)

and result = pa.deserialize_sequence(f) below?

wesm · 2017-08-16T01:22:38Z

Looks like some of my comments got messed up while you were pushing commits -- make sure to expand the "outdated" notes =)

pcmoritz · 2017-08-16T01:30:56Z

Thanks a lot for the thorough review =)

I'll try to fix the comments ASAP

wesm · 2017-08-18T18:30:01Z

I'm working on the test failures

pcmoritz · 2017-08-18T18:34:30Z

Great thanks! I gave you access to my fork, feel free to push any changes. I'll switch working on the huge page table PR and the entry points PR.

…-format Change-Id: Id100134ed72a42ed2bba6cab0b5fd5b0f29030e8

wesm · 2017-08-18T19:09:47Z

I think I got everything; I'll get the build passing and then take a last look

wesm · 2017-08-18T20:22:22Z

For future reference, I find that using clang to build and using -DARROW_FLAGS="-Werror -Wconversion -Wno-sign-conversion" helps catch the common issues that cause failures in MSVC. Using clang with -Werror also catches unchecked Status

…w code

wesm · 2017-08-19T02:03:53Z

Appears we are missing some DLL exports for Windows (ARROW_EXPORT). Getting too late here tonight, I will take a look tomorrow

xhochy

Added some style nitpick comments, mainly with the goal to have the code more understandable to non Arrow core developers

xhochy · 2017-08-19T11:08:12Z

cpp/src/arrow/python/python_to_arrow.h

+namespace py {
+
+void set_serialization_callbacks(PyObject* serialize_callback,
+                                 PyObject* deserialize_callback);


Can you add descriptive comments to these three functions?

xhochy · 2017-08-19T11:09:11Z

cpp/src/arrow/python/arrow_to_python.h

+
+namespace py {
+
+Status ReadSerializedPythonSequence(std::shared_ptr<io::RandomAccessFile> src,


Can you add some descriptive comments here?

xhochy · 2017-08-19T11:12:05Z

cpp/src/arrow/python/python_to_arrow.cc

+namespace arrow {
+namespace py {
+
+#define UPDATE(OFFSET, TAG)                                     \


Seems like these macros could also be written as templated functions. Anything that would prevent that?

xhochy · 2017-08-19T11:13:55Z

cpp/src/arrow/python/python_to_arrow.cc

+  ///   List containing the data from nested dictionaries in the
+  ///   value list of the dictionary
+  Status Finish(std::shared_ptr<Array> key_tuple_data,
+                std::shared_ptr<Array> key_dict_data,


Make the shared_ptr arguments a constant reference, this should avoid needless copies

Change-Id: I9ced59de48f169b6609dd27f8239ceb22fd5ebeb

Change-Id: If9ac54f494495186743b0a6929ea193ca5048ed0

wesm · 2017-08-19T18:09:31Z

@pcmoritz there were several flake8 warnings. In my environment I have this bash function which helps with catching these

function arrow_preflight {
    ARROW_PREFLIGHT_DIR=$HOME/code/arrow/cpp/preflight
    mkdir -p $ARROW_PREFLIGHT_DIR
    pushd $ARROW_PREFLIGHT_DIR
    cmake -GNinja ..
    ninja format
    ninja lint
    popd
    pushd $HOME/code/arrow/python
    flake8 pyarrow
    flake8 --config=.flake8.cython pyarrow
    popd
}

Change-Id: Ia2d359a11fcecb9ce68af03554010acfc38de091

wesm · 2017-08-19T19:18:59Z

@pcmoritz @robertnishihara I'm going to mark these APIs experimental. We might add an additional set of functions to help with determining the amount of output space required, e.g.:

import pyarrow as pa

serialized = pa.serialize(obj)

# This would use MockOutputStream
total_bytes = serialized.total_bytes

buf = ...allocate(total_bytes)
serialized.write_to(buf)

robertnishihara · 2017-08-19T19:26:49Z

That sounds good.

…ethod Change-Id: I03b16c39951fedd069c935fff99f29f41dd5834c

Change-Id: Ieaf56dc38769d1d407af27d14c5b78c4ecf5d49a

wesm · 2017-08-19T19:47:14Z

I added a minimal version of this. We should return in a subsequent patch and harden with more unit tests and check more rigorously for memory leaks

wesm · 2017-08-19T19:47:17Z

cc @cpcloud

Change-Id: Icfbee217fe4e0872f1b2bb306083596cdd62c992

wesm · 2017-08-19T20:14:54Z

current benchmarks (haven't dug in to investigate where time being spent):

In [1]: import numpy as np

In [2]: import pyarrow as pa

In [3]: import pickle

In [4]: arrays = [np.random.randn(100, 100) for i in range(1000)]

In [5]: %timeit buf = pa.serialize(arrays).to_buffer()
34.8 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: buf = pa.serialize(arrays).to_buffer()

In [8]: %timeit deserialized = pa.deserialize(buf)
2.05 ms ± 55.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: pickled = pickle.dumps(arrays)

In [10]: %timeit pickled = pickle.dumps(arrays)
37.1 ms ± 777 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]: %timeit unpickled = pickle.loads(pickled)
11.1 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So in a simple list of arrays, serialization takes 10% less time, but deserialization takes ~80% less time. Since this is only 80 MB of data, the savings will grow greater the larger the total size of the serialized object.

pcmoritz · 2017-08-19T21:41:26Z

It's amazing how this came together, thanks everybody so much for all the help! These are great changes.

Concerning the benchmarks: For the serialization, it helps a bunch to use the FixedSizeBufferOutputStream and multithreading to write the bytes. Also excited how performance improves if we write things to Plasma with the recent hugepages PR ;)

pcmoritz · 2017-08-19T23:52:46Z

Concerning more rigorous tests, there is a good suite of tests here https://github.com/jsonpickle/jsonpickle/tree/master/tests that we could borrow from or port.

Change-Id: Id02790935e750554f6d71a730e543b37e412a1c9

wesm · 2017-08-20T00:59:15Z

Just finished a last buglet and will hopefully get a passing Travis CI build (and Appveyor will fail because of ARROW-1375). Will merge this, let's open JIRAs about follow up stuff!

Change-Id: I1c42641d6560d0815dce102e8481916b8bf1fe38

pcmoritz · 2017-08-20T04:08:54Z

+1 you should go ahead and merge it :)

pcmoritz changed the title ~~[Python] Serializing large class of Python objects in Apache Arrow~~ ARROW-759: [Python] Serializing large class of Python objects in Apache Arrow Aug 15, 2017

pcmoritz added 8 commits August 14, 2017 21:16

python to arrow serialization

5766b8c

rename serialization entry point

deb3b46

deserialization path (need to figure out if base object and refcounti…

3af1c67

…ng is handled correctly)

work in progress

44fb98b

roundtrip working for the first time

49a4acb

handle very long longs with custom serialization callback

bd36c83

working version

8b2ffe6

serialization of custom objects

f229d8d

pcmoritz force-pushed the python-serialization branch from 5325ae5 to f229d8d Compare August 15, 2017 04:17

pcmoritz added 3 commits August 14, 2017 21:45

rebase

30bb960

fix python unicode string

2171761

fix imports

7069e20

pcmoritz force-pushed the python-serialization branch from 7ed599b to 7069e20 Compare August 15, 2017 22:11

fix

c4782ac

pcmoritz force-pushed the python-serialization branch from 34f0d42 to c4782ac Compare August 15, 2017 23:00

pcmoritz added 4 commits August 15, 2017 16:55

fix linting

91b57d5

fix namespaces

2e08de4

clang-format

802e739

lint fix

a6105d2

wesm reviewed Aug 16, 2017

View reviewed changes

fix first few comments

080db03

robertnishihara mentioned this pull request Aug 16, 2017

Does numbuf make sense as part of pyarrow? ray-project/ray#612

Closed

pcmoritz added 2 commits August 15, 2017 21:30

convert DESERIALIZE_SEQUENCE to a template

74b9e46

get rid of leaks and clarify reference counting for dicts

c38c58d

pcmoritz force-pushed the python-serialization branch 2 times, most recently from 7b32fbf to 5b03cd1 Compare August 16, 2017 08:00

remove code duplication

aaf6f09

pcmoritz force-pushed the python-serialization branch from 1e093b1 to 54af39b Compare August 18, 2017 16:42

remove sequence.h

831e2f2

Fix various Clang compiler warnings due to integer conversions. clang…

c8efef9

…-format Change-Id: Id100134ed72a42ed2bba6cab0b5fd5b0f29030e8

Do not use ARROW_CHECK in production code. Consolidate python_to_arro…

ce5784d

…w code

xhochy requested changes Aug 19, 2017

View reviewed changes

wesm added 2 commits August 19, 2017 14:00

Refactoring, address code review comments. fix flake8 issues

a9522c5

Change-Id: I9ced59de48f169b6609dd27f8239ceb22fd5ebeb

Add doxygen comment to set_serialization_callbacks

8a42f30

Change-Id: If9ac54f494495186743b0a6929ea193ca5048ed0

Use pytest tmpdir for large memory map fixture so works on Windows

8e59617

Change-Id: Ia2d359a11fcecb9ce68af03554010acfc38de091

wesm added 2 commits August 19, 2017 15:42

Add a Python container for the SerializedPyObject data, total_bytes m…

114a5fb

…ethod Change-Id: I03b16c39951fedd069c935fff99f29f41dd5834c

Memory map fixture robustness on Windows

a6a402e

Change-Id: Ieaf56dc38769d1d407af27d14c5b78c4ecf5d49a

Add pyarrow.deserialize convenience method

b70235c

Change-Id: Icfbee217fe4e0872f1b2bb306083596cdd62c992

Add SerializedPyObject to public API

2164db7

Change-Id: Id02790935e750554f6d71a730e543b37e412a1c9

Fix typo

31486ed

Change-Id: I1c42641d6560d0815dce102e8481916b8bf1fe38

asfgit closed this in b50f235 Aug 20, 2017

wesm deleted the python-serialization branch August 20, 2017 04:15


		namespace py {

		Status ReadSerializedPythonSequence(std::shared_ptr<io::RandomAccessFile> src,

ARROW-759: [Python] Serializing large class of Python objects in Apache Arrow #965

ARROW-759: [Python] Serializing large class of Python objects in Apache Arrow #965

Conversation

pcmoritz commented Aug 15, 2017 • edited Loading

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Aug 16, 2017

pcmoritz commented Aug 16, 2017

wesm commented Aug 18, 2017

pcmoritz commented Aug 18, 2017

wesm commented Aug 18, 2017

wesm commented Aug 18, 2017 • edited Loading

wesm commented Aug 19, 2017

xhochy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Aug 19, 2017 • edited Loading

wesm commented Aug 19, 2017 • edited Loading

robertnishihara commented Aug 19, 2017

wesm commented Aug 19, 2017

wesm commented Aug 19, 2017

wesm commented Aug 19, 2017

pcmoritz commented Aug 19, 2017

pcmoritz commented Aug 19, 2017

wesm commented Aug 20, 2017

pcmoritz commented Aug 20, 2017

pcmoritz commented Aug 15, 2017 •

edited

Loading

wesm commented Aug 18, 2017 •

edited

Loading

wesm commented Aug 19, 2017 •

edited

Loading

wesm commented Aug 19, 2017 •

edited

Loading