- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.9k
ARROW-2660: [Python] Experimental zero-copy pickling #2161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Codecov Report
 @@            Coverage Diff             @@
##           master    #2161      +/-   ##
==========================================
+ Coverage   84.39%   84.41%   +0.01%     
==========================================
  Files         293      293              
  Lines       44820    44841      +21     
==========================================
+ Hits        37826    37851      +25     
  Misses       6963     6963              
+ Partials       31       27       -4
 Continue to review full report at Codecov. 
 | 
074ca9d    to
    d444d1e      
    Compare
  
    | @pitrou this is cool. Do you see any reason not to rebase and merge this for 0.10.0? | 
| There were changes to the PEP lately, I must adapt the code before :-) | 
Zero-copy pickling of buffers and buffer-based objects will be possible using PEP 574 (if/when accepted). The PyPI backport "pickle5" helps us test that possibility.
d444d1e    to
    50f0491      
    Compare
  
    | I've fixed for the latest PEP updates and rebased. | 
|  | ||
| pushd python | ||
|  | ||
| pip install pickle5 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This install call is redundant with the once below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no, the second one is in a distinct virtualenv where we install the wheel we just built.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I got confused by the free-standing pip install here with no other packages. This is then just because we have no conda package for it yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, but it's very quick to compile anyway and there are no non-Python dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW we did add a package to conda-forge. Though it's true this is quite fast to build.
| Quick benchmark: >>> import pickle5 as pickle
>>> import pyarrow as pa
>>> import pandas as pd
>>> df = pd.DataFrame({'ints': range(100000), 'strs': [str(i) for i in range(100000)]})
>>> table = pa.Table.from_pandas(df)
# Pickling a Pandas dataframe is slow, no difference with protocol 5
>>> %timeit pickle.loads(pickle.dumps(df))
29.1 ms ± 33.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit pickle.loads(pickle.dumps(df, protocol=5))
29.1 ms ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Pickling a Arrow table is faster
>>> %timeit pickle.loads(pickle.dumps(table))
3.33 ms ± 2.26 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# ... even faster with the new protocol 5
>>> %timeit pickle.loads(pickle.dumps(table, protocol=5))
526 µs ± 2.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# ... and even faster with zero-copy buffers
>>> %timeit buffers = []; serd = pickle.dumps(table, protocol=5, buffer_callback=buffers.append); pickle.loads(serd, buffers=buffers)
154 µs ± 152 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# There are 5 exported buffers for this table, and the serialized pickle stream is around 1kB
>>> buffers = []; serd = pickle.dumps(table, protocol=5, buffer_callback=buffers.append)
>>> buffers
[<pickle.PickleBuffer at 0x7f5a3096fac8>,
 <pickle.PickleBuffer at 0x7f5a3096f348>,
 <pickle.PickleBuffer at 0x7f5a27630948>,
 <pickle.PickleBuffer at 0x7f5a27630c48>,
 <pickle.PickleBuffer at 0x7f5a27630dc8>]
>>> len(serd)
1053
# Note that currently our exported buffers don't expose type information
>>> [(memoryview(buf).format, memoryview(buf).shape) for buf in buffers]
[('b', (800000,)),
 ('b', (12500,)),
 ('b', (400004,)),
 ('b', (488890,)),
 ('b', (800000,))] | 
| I created ARROW-2913 for the issue that exported buffers lose the data type. @mrocklin your informed opinion on that one could be useful. (note it doesn't block this PR) | 
| Nice and informative benchmarks. The copying of memory in unpickling with NumPy arrays etc. has been a long-standing gripe of mine | 
| This looks really nice. @pitrou What is the best way to keep updated of the status of a PEP? Poll the website? I guess, we wait with merging until the PEP is accepted? | 
| 
 If you don't want to read python-dev, then you can just do that indeed. 
 Or we could merge already as @wesm proposes. The changes should be transparent if you don't use  | 
Zero-copy pickling of buffers and buffer-based objects will be possible using PEP 574 (if/when accepted). The PyPI backport "pickle5" helps us test that possibility.