Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-759: [Python] Serializing large class of Python objects in Apache Arrow #965

Closed
wants to merge 55 commits into from

Conversation

pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Aug 15, 2017

This PR adds the capability to serialize a large class of (nested) Python objects in Apache Arrow. The eventual goal is to evolve this into a more modern version of pickle that will make it possible to read the data from other languages supported by Apache Arrow (and might also be faster).

Currently we support lists, tuples, dicts, strings, numpy objects, Python classes and namedtuples. A fallback to (cloud-)pickle can be provided for objects that cannot be natively represented in Arrow (for example lambdas).

Numpy data within objects is efficiently represented using Arrow's Tensor facilities and for the nested Python sequences we use Arrow's UnionArray.

There are many loose ends that will need to be addressed in follow up PRs.

@pcmoritz pcmoritz changed the title [Python] Serializing large class of Python objects in Apache Arrow ARROW-759: [Python] Serializing large class of Python objects in Apache Arrow Aug 15, 2017
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks like a great start. I have a bunch of mostly nitpicks but overall I'm really interested to flesh this out into a very strong alternative to pickle when dealing with lots of tables and tensors in a Python collection

pyarrow.cc
sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a .cc here?


#if PY_MAJOR_VERSION >= 3
#define PyInt_FromLong PyLong_FromLong
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this go in common.h?

#define PyInt_FromLong PyLong_FromLong
#endif

Status get_value(std::shared_ptr<Array> arr, int32_t index, int32_t type, PyObject* base,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetValue?


Status DeserializeList(std::shared_ptr<Array> array, int32_t start_idx, int32_t stop_idx,
PyObject* base, const std::vector<std::shared_ptr<Tensor>>& tensors, PyObject** out) {
DESERIALIZE_SEQUENCE(PyList_New, PyList_SetItem)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may be able to do this using a template instead of a macro if you wanted, see e.g. https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/pandas_to_arrow.cc#L905

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use PyList_SET_ITEM since it skips error checking (you aren't checking the errors anyway)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I made it a template now and then I can't do PyList_SET_ITEM any more since it is a macro; I'd suggest we keep it a template for now and if this will be a performance bottleneck later, we make it a macro again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd have to pass a C++ lambda. Macro is OK by me

} \
} \
*out = result; \
return Status::OK();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this exits prematurely, result leaks.

"the object '{}'".format(obj), obj)
return dict(serialized_obj, **{"_pytype_": type_id})

def deserialization_callback(serialized_obj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public?

check_status(deref(writer).Close())

for tensor in value.tensors:
check_status(WriteTensor(deref(tensor), stream.get(), &metadata_length, &body_length))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nogil?

for tensor in value.tensors:
check_status(WriteTensor(deref(tensor), stream.get(), &metadata_length, &body_length))

def read_python_object(NativeFile source):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering a bit why read_python_object and write_python_object need to be public. If you pass a NativeFile to serialize_* then it should write the result to that, otherwise return a Buffer or byte string

check_status(DeserializeList(deref(value.batch).column(0), 0, deref(value.batch).num_rows(), <PyObject*> base, value.tensors, &result))
return <object> result

def write_python_object(PythonObject value, int32_t num_tensors, NativeFile sink):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any benefits to doing this in Cython (vs. C++)?

def serialization_roundtrip(value, f):
f.seek(0)
serialized, num_tensors = pa.lib.serialize_sequence(value)
pa.lib.write_python_object(serialized, num_tensors, f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

pa.serialize_sequence(value, f)

and result = pa.deserialize_sequence(f) below?

@wesm
Copy link
Member

wesm commented Aug 16, 2017

Looks like some of my comments got messed up while you were pushing commits -- make sure to expand the "outdated" notes =)

@pcmoritz
Copy link
Contributor Author

Thanks a lot for the thorough review =)

I'll try to fix the comments ASAP

@pcmoritz pcmoritz force-pushed the python-serialization branch 2 times, most recently from 7b32fbf to 5b03cd1 Compare August 16, 2017 08:00
@wesm
Copy link
Member

wesm commented Aug 18, 2017

I'm working on the test failures

@pcmoritz
Copy link
Contributor Author

Great thanks! I gave you access to my fork, feel free to push any changes. I'll switch working on the huge page table PR and the entry points PR.

…-format

Change-Id: Id100134ed72a42ed2bba6cab0b5fd5b0f29030e8
@wesm
Copy link
Member

wesm commented Aug 18, 2017

I think I got everything; I'll get the build passing and then take a last look

@wesm
Copy link
Member

wesm commented Aug 18, 2017

For future reference, I find that using clang to build and using -DARROW_FLAGS="-Werror -Wconversion -Wno-sign-conversion" helps catch the common issues that cause failures in MSVC. Using clang with -Werror also catches unchecked Status

@wesm
Copy link
Member

wesm commented Aug 19, 2017

Appears we are missing some DLL exports for Windows (ARROW_EXPORT). Getting too late here tonight, I will take a look tomorrow

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some style nitpick comments, mainly with the goal to have the code more understandable to non Arrow core developers

namespace py {

void set_serialization_callbacks(PyObject* serialize_callback,
PyObject* deserialize_callback);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add descriptive comments to these three functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


namespace py {

Status ReadSerializedPythonSequence(std::shared_ptr<io::RandomAccessFile> src,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some descriptive comments here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

namespace arrow {
namespace py {

#define UPDATE(OFFSET, TAG) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like these macros could also be written as templated functions. Anything that would prevent that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/// List containing the data from nested dictionaries in the
/// value list of the dictionary
Status Finish(std::shared_ptr<Array> key_tuple_data,
std::shared_ptr<Array> key_dict_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the shared_ptr arguments a constant reference, this should avoid needless copies

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Change-Id: I9ced59de48f169b6609dd27f8239ceb22fd5ebeb
Change-Id: If9ac54f494495186743b0a6929ea193ca5048ed0
@wesm
Copy link
Member

wesm commented Aug 19, 2017

@pcmoritz there were several flake8 warnings. In my environment I have this bash function which helps with catching these

function arrow_preflight {
    ARROW_PREFLIGHT_DIR=$HOME/code/arrow/cpp/preflight
    mkdir -p $ARROW_PREFLIGHT_DIR
    pushd $ARROW_PREFLIGHT_DIR
    cmake -GNinja ..
    ninja format
    ninja lint
    popd
    pushd $HOME/code/arrow/python
    flake8 pyarrow
    flake8 --config=.flake8.cython pyarrow
    popd
}

Change-Id: Ia2d359a11fcecb9ce68af03554010acfc38de091
@wesm
Copy link
Member

wesm commented Aug 19, 2017

@pcmoritz @robertnishihara I'm going to mark these APIs experimental. We might add an additional set of functions to help with determining the amount of output space required, e.g.:

import pyarrow as pa

serialized = pa.serialize(obj)

# This would use MockOutputStream
total_bytes = serialized.total_bytes

buf = ...allocate(total_bytes)
serialized.write_to(buf)

@robertnishihara
Copy link
Contributor

That sounds good.

…ethod

Change-Id: I03b16c39951fedd069c935fff99f29f41dd5834c
Change-Id: Ieaf56dc38769d1d407af27d14c5b78c4ecf5d49a
@wesm
Copy link
Member

wesm commented Aug 19, 2017

I added a minimal version of this. We should return in a subsequent patch and harden with more unit tests and check more rigorously for memory leaks

@wesm
Copy link
Member

wesm commented Aug 19, 2017

cc @cpcloud

Change-Id: Icfbee217fe4e0872f1b2bb306083596cdd62c992
@wesm
Copy link
Member

wesm commented Aug 19, 2017

current benchmarks (haven't dug in to investigate where time being spent):

In [1]: import numpy as np

In [2]: import pyarrow as pa

In [3]: import pickle

In [4]: arrays = [np.random.randn(100, 100) for i in range(1000)]

In [5]: %timeit buf = pa.serialize(arrays).to_buffer()
34.8 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: buf = pa.serialize(arrays).to_buffer()

In [8]: %timeit deserialized = pa.deserialize(buf)
2.05 ms ± 55.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: pickled = pickle.dumps(arrays)

In [10]: %timeit pickled = pickle.dumps(arrays)
37.1 ms ± 777 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]: %timeit unpickled = pickle.loads(pickled)
11.1 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So in a simple list of arrays, serialization takes 10% less time, but deserialization takes ~80% less time. Since this is only 80 MB of data, the savings will grow greater the larger the total size of the serialized object.

@pcmoritz
Copy link
Contributor Author

It's amazing how this came together, thanks everybody so much for all the help! These are great changes.

Concerning the benchmarks: For the serialization, it helps a bunch to use the FixedSizeBufferOutputStream and multithreading to write the bytes. Also excited how performance improves if we write things to Plasma with the recent hugepages PR ;)

@pcmoritz
Copy link
Contributor Author

Concerning more rigorous tests, there is a good suite of tests here https://github.com/jsonpickle/jsonpickle/tree/master/tests that we could borrow from or port.

Change-Id: Id02790935e750554f6d71a730e543b37e412a1c9
@wesm
Copy link
Member

wesm commented Aug 20, 2017

Just finished a last buglet and will hopefully get a passing Travis CI build (and Appveyor will fail because of ARROW-1375). Will merge this, let's open JIRAs about follow up stuff!

Change-Id: I1c42641d6560d0815dce102e8481916b8bf1fe38
@pcmoritz
Copy link
Contributor Author

+1 you should go ahead and merge it :)

@asfgit asfgit closed this in b50f235 Aug 20, 2017
@wesm wesm deleted the python-serialization branch August 20, 2017 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants