ARROW-2660: [Python] Experimental zero-copy pickling #2161

pitrou · 2018-06-25T15:35:03Z

Zero-copy pickling of buffers and buffer-based objects will be possible using PEP 574 (if/when accepted). The PyPI backport "pickle5" helps us test that possibility.

codecov-io · 2018-06-25T17:55:06Z

Codecov Report

Merging #2161 into master will increase coverage by 0.01%.
The diff coverage is 63.63%.

@@            Coverage Diff             @@
##           master    #2161      +/-   ##
==========================================
+ Coverage   84.39%   84.41%   +0.01%     
==========================================
  Files         293      293              
  Lines       44820    44841      +21     
==========================================
+ Hits        37826    37851      +25     
  Misses       6963     6963              
+ Partials       31       27       -4

Impacted Files	Coverage Δ
python/pyarrow/io.pxi	`60.52% <40%> (-0.04%)`	⬇️
python/pyarrow/compat.py	`79.25% <40%> (+0.74%)`	⬆️
python/pyarrow/tests/test_array.py	`98.89% <73.91%> (-1.11%)`	⬇️
go/arrow/math/int64_sse4_amd64.go	`0% <0%> (-100%)`	⬇️
go/arrow/math/uint64_sse4_amd64.go	`0% <0%> (-100%)`	⬇️
go/arrow/memory/memory_sse4_amd64.go	`0% <0%> (-100%)`	⬇️
go/arrow/math/float64_sse4_amd64.go	`0% <0%> (-100%)`	⬇️
go/arrow/math/uint64_amd64.go	`33.33% <0%> (ø)`	⬆️
go/arrow/math/int64_amd64.go	`33.33% <0%> (ø)`	⬆️
go/arrow/math/float64_amd64.go	`33.33% <0%> (ø)`	⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 078b806...50f0491. Read the comment docs.

wesm · 2018-07-23T23:04:02Z

@pitrou this is cool. Do you see any reason not to rebase and merge this for 0.10.0?

pitrou · 2018-07-24T07:15:28Z

There were changes to the PEP lately, I must adapt the code before :-)

Zero-copy pickling of buffers and buffer-based objects will be possible using PEP 574 (if/when accepted). The PyPI backport "pickle5" helps us test that possibility.

pitrou · 2018-07-24T15:07:03Z

I've fixed for the latest PEP updates and rebased.

xhochy · 2018-07-25T17:39:07Z

ci/cpp-python-msvc-build.bat


 pushd python

+pip install pickle5


This install call is redundant with the once below.

Actually, no, the second one is in a distinct virtualenv where we install the wheel we just built.

Ah, I got confused by the free-standing pip install here with no other packages. This is then just because we have no conda package for it yet?

Probably, but it's very quick to compile anyway and there are no non-Python dependencies.

FWIW we did add a package to conda-forge. Though it's true this is quite fast to build.

ref: https://github.com/conda-forge/pickle5-feedstock

pitrou · 2018-07-25T18:55:12Z

Quick benchmark:

>>> import pickle5 as pickle
>>> import pyarrow as pa
>>> import pandas as pd
>>> df = pd.DataFrame({'ints': range(100000), 'strs': [str(i) for i in range(100000)]})
>>> table = pa.Table.from_pandas(df)

# Pickling a Pandas dataframe is slow, no difference with protocol 5
>>> %timeit pickle.loads(pickle.dumps(df))
29.1 ms ± 33.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit pickle.loads(pickle.dumps(df, protocol=5))
29.1 ms ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Pickling a Arrow table is faster
>>> %timeit pickle.loads(pickle.dumps(table))
3.33 ms ± 2.26 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# ... even faster with the new protocol 5
>>> %timeit pickle.loads(pickle.dumps(table, protocol=5))
526 µs ± 2.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# ... and even faster with zero-copy buffers
>>> %timeit buffers = []; serd = pickle.dumps(table, protocol=5, buffer_callback=buffers.append); pickle.loads(serd, buffers=buffers)
154 µs ± 152 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# There are 5 exported buffers for this table, and the serialized pickle stream is around 1kB
>>> buffers = []; serd = pickle.dumps(table, protocol=5, buffer_callback=buffers.append)
>>> buffers
[<pickle.PickleBuffer at 0x7f5a3096fac8>,
 <pickle.PickleBuffer at 0x7f5a3096f348>,
 <pickle.PickleBuffer at 0x7f5a27630948>,
 <pickle.PickleBuffer at 0x7f5a27630c48>,
 <pickle.PickleBuffer at 0x7f5a27630dc8>]
>>> len(serd)
1053

# Note that currently our exported buffers don't expose type information
>>> [(memoryview(buf).format, memoryview(buf).shape) for buf in buffers]
[('b', (800000,)),
 ('b', (12500,)),
 ('b', (400004,)),
 ('b', (488890,)),
 ('b', (800000,))]

@mrocklin

pitrou · 2018-07-25T19:02:56Z

I created ARROW-2913 for the issue that exported buffers lose the data type. @mrocklin your informed opinion on that one could be useful.

(note it doesn't block this PR)

wesm · 2018-07-25T19:56:23Z

Nice and informative benchmarks. The copying of memory in unpickling with NumPy arrays etc. has been a long-standing gripe of mine

xhochy · 2018-07-26T08:21:09Z

This looks really nice. @pitrou What is the best way to keep updated of the status of a PEP? Poll the website?

I guess, we wait with merging until the PEP is accepted?

pitrou · 2018-07-26T08:45:49Z

What is the best way to keep updated of the status of a PEP? Poll the website?

If you don't want to read python-dev, then you can just do that indeed.

I guess, we wait with merging until the PEP is accepted?

Or we could merge already as @wesm proposes. The changes should be transparent if you don't use pickle5.

pitrou force-pushed the ARROW-2660-zero-copy-pickling branch from 074ca9d to d444d1e Compare June 26, 2018 12:13

dhirschfeld mentioned this pull request Jul 15, 2018

Add custom serialization support for pyarrow dask/distributed#2115

Merged

pitrou added 3 commits July 24, 2018 16:44

ARROW-2660: [Python] Zero-copy pickling

892302a

Zero-copy pickling of buffers and buffer-based objects will be possible using PEP 574 (if/when accepted). The PyPI backport "pickle5" helps us test that possibility.

Add pickle5 to CI environments

132939c

Fix test on Python 2.7 (hopefully)

50f0491

pitrou force-pushed the ARROW-2660-zero-copy-pickling branch from d444d1e to 50f0491 Compare July 24, 2018 14:49

xhochy reviewed Jul 25, 2018

View reviewed changes

pitrou changed the title ~~[EXP] ARROW-2660: [Python] Zero-copy pickling~~ ARROW-2660: [Python] Zero-copy pickling Jul 30, 2018

pitrou changed the title ~~ARROW-2660: [Python] Zero-copy pickling~~ ARROW-2660: [Python] Experimental zero-copy pickling Jul 30, 2018

pitrou closed this in 2422d9c Jul 30, 2018

pitrou deleted the ARROW-2660-zero-copy-pickling branch July 30, 2018 13:32

pitrou mentioned this pull request Sep 21, 2018

Leverage the new PEP 574 for no-copy pickling of contiguous arrays numpy/numpy#11161

Closed

dhirschfeld mentioned this pull request Oct 25, 2018

ARROW-3587: [Python] Efficient serialization for Arrow Objects (array, table, tensor, etc) #2832

Closed

ghuls mentioned this pull request Feb 14, 2025

fix: Fix performance regression for DataFrame serialization/pickling pola-rs/polars#20641

Merged

ARROW-2660: [Python] Experimental zero-copy pickling #2161

ARROW-2660: [Python] Experimental zero-copy pickling #2161

Conversation

pitrou commented Jun 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Jun 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wesm commented Jul 23, 2018

Uh oh!

pitrou commented Jul 24, 2018

Uh oh!

pitrou commented Jul 24, 2018

Uh oh!

xhochy Jul 25, 2018

Choose a reason for hiding this comment

Uh oh!

pitrou Jul 25, 2018

Choose a reason for hiding this comment

Uh oh!

xhochy Jul 25, 2018

Choose a reason for hiding this comment

Uh oh!

pitrou Jul 25, 2018

Choose a reason for hiding this comment

Uh oh!

jakirkham Aug 10, 2018

Choose a reason for hiding this comment

Uh oh!

pitrou commented Jul 25, 2018

Uh oh!

pitrou commented Jul 25, 2018

Uh oh!

wesm commented Jul 25, 2018

Uh oh!

xhochy commented Jul 26, 2018

Uh oh!

pitrou commented Jul 26, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pitrou commented Jun 25, 2018 •

edited

Loading

codecov-io commented Jun 25, 2018 •

edited

Loading