Skip to content

ENH: GH17054: read_html() handles rowspan/colspan and infers headers #17089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 145 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
4bf2f2e
ENH: GH17054: read_html() handles rowspan/colspan and infers headers
jowens Jul 26, 2017
80d9c2b
in python 3, lambdas no longer take tuples as args. thanks pep 3113.
jowens Jul 27, 2017
26d1f6a
fixing lint error
jowens Jul 27, 2017
37af4ea
in python3, zip does not return a list, so list(zip(...))
jowens Jul 27, 2017
86dee93
Merge branch 'master' into read_html_with_colspan_rowspan
jowens Aug 29, 2017
d3eca72
Merge branch 'master' into read_html_with_colspan_rowspan
jowens Sep 6, 2017
f064562
documentation changes only
jowens Sep 6, 2017
67c8a59
Merge branch 'read_html_with_colspan_rowspan' of github.com:jowens/pa…
jowens Sep 6, 2017
5a38278
documentation changes only
jowens Sep 7, 2017
39f7814
documentation changes only, limited to 80 cols
jowens Sep 7, 2017
531863f
more documentation edits
jowens Sep 8, 2017
818d394
minor documentation edits
jowens Sep 9, 2017
f3a6aa3
better return type explanation in code, added issue number to tests
jowens Sep 9, 2017
2f904b2
cleaning up legacy documentation issues
jowens Sep 18, 2017
f4e7592
remove 'if'
jowens Sep 18, 2017
293d9e4
newlines for clarity
jowens Sep 18, 2017
efabae4
DOC: whatsnew typos
jreback Jul 26, 2017
552677f
ENH: GH17054: read_html() handles rowspan/colspan and infers headers
jowens Jul 26, 2017
1aacf17
TST: Check more error messages in tests (#17075)
gfyoung Jul 26, 2017
359890f
BUG: Respect dtype when calling pivot_table with margins=True
toobaz Jul 26, 2017
3fd2612
MAINT: Add missing space in parsers.pyx
gfyoung Jul 27, 2017
76249bf
MAINT: Add missing paren around print statement
gfyoung Jul 27, 2017
77d16d4
DOC: fix typos in missing.rst
jreback Jul 27, 2017
bd50a4f
in python 3, lambdas no longer take tuples as args. thanks pep 3113.
jowens Jul 27, 2017
452e08d
fixing lint error
jowens Jul 27, 2017
ecfaa4c
in python3, zip does not return a list, so list(zip(...))
jowens Jul 27, 2017
69cd83c
DOC: further clean-up null/na changes (#17113)
jorisvandenbossche Jul 29, 2017
1e5cfa1
BUG: Allow pd.unique to accept tuple of strings (#17108)
mroeschke Jul 30, 2017
c502dba
BUG: Allow Series with same name with crosstab (#16028)
mroeschke Jul 30, 2017
2155c3e
COMPAT: make sure use_inf_as_null is deprecated (#17126)
jreback Aug 1, 2017
3ed9f53
CI: bump version of xlsxwriter to 0.5.2 (#17142)
jreback Aug 1, 2017
9a50c21
DOC: Clean up instructions in ISSUE_TEMPLATE (#17146)
gfyoung Aug 1, 2017
5759eff
Add missing space to the NotImplementedError's message for compound d…
FKint Aug 1, 2017
3855039
DOC: (de)type the return value of concat (#17079) (#17119)
jebob Aug 1, 2017
d7cb627
BUG: Thoroughly dedup column names in read_csv (#17095)
gfyoung Aug 1, 2017
9d32df6
DOC: Additions/updates to documentation (#17150)
alanyee Aug 2, 2017
5ce00e1
ENH: add to/from_parquet with pyarrow & fastparquet (#15838)
jreback Aug 2, 2017
9aadb64
DOC: doc typos, xref #15838
jreback Aug 2, 2017
89fa421
TST: test for categorical index monotonicity (#17152)
jreback Aug 3, 2017
ccdae36
MAINT: Remove non-standard and inconsistently-used imports (#17085)
jbrockmendel Aug 3, 2017
5b42bdf
DOC: typos in whatsnew
Aug 3, 2017
56957cf
DOC: whatsnew 0.21.0 fixes
jreback Aug 3, 2017
d2e21c3
BUG: Fix CSV parsing of singleton list header (#17090)
Aug 3, 2017
20487bf
ENH: Support strings containing '%' in add_prefix/add_suffix (#17151)…
jschendel Aug 3, 2017
b4b4c77
REF: repr - allow block to override values that get formatted (#17143)
jorisvandenbossche Aug 4, 2017
b720f0d
MAINT: Drop unnecessary newlines in issue template
gfyoung Aug 7, 2017
43dab45
remove direct import of nan
jbrockmendel Aug 7, 2017
94a734a
use == to test String equality (#17171)
jhelie Aug 7, 2017
e143ee1
ENH: Add warning when setting into nonexistent attribute (#16951)
deniederhut Aug 7, 2017
5a523bb
DOC: added string processing comparison with SAS (#16497)
natethedrummer Aug 7, 2017
0bfad7c
CLN: remove unused get methods in internals (#17169)
jbrockmendel Aug 7, 2017
a4e4909
TST: Partial Boolean DataFrame Indexing (#17186)
mroeschke Aug 7, 2017
e8fab8a
CLN: Reformat docstring for IPython fixture
gfyoung Aug 7, 2017
d089d44
Define Series.plot and Series.hist in class definition (#17199)
jbrockmendel Aug 8, 2017
b09b274
BUG: support pandas objects in iloc with old numpy versions (#17194)
toobaz Aug 8, 2017
cc8c5d7
Implement _make_accessor classmethod for PandasDelegate (#17166)
jbrockmendel Aug 8, 2017
df9710b
Create ABCDateOffset (#17165)
jbrockmendel Aug 9, 2017
e71e6d7
BUG: resample and apply modify the index type for empty Series (#17149)
discort Aug 9, 2017
e9c7f29
DOC: Updated NDFrame.astype docs (#17203)
topper-123 Aug 9, 2017
38293d3
MAINT: Minor touch-ups to GitHub PULL_REQUEST_TEMPLATE (#17207)
dhimmel Aug 9, 2017
7280e6c
CLN: replace %s syntax with .format in core.computation (#17209)
jschendel Aug 10, 2017
421dcf4
Bugfix for multilevel columns with empty strings in Python 2 (#17099)
chrisjbillington Aug 10, 2017
d5733ee
CLN/ASV clean-up frame stat ops benchmarks (#17205)
jorisvandenbossche Aug 10, 2017
9f69583
BUG: Rolling apply on DataFrame with Datetime index returns NaN (#17156)
FXocena Aug 10, 2017
1e1ce40
CLN: Remove import exception handling (#17218)
dhimmel Aug 10, 2017
a1509dc
MAINT: Remove extra the's in deprecation messages (#17222)
gfyoung Aug 11, 2017
6788533
DOC: Patch docs in _decorators.py
gfyoung Aug 11, 2017
619e031
CLN: replace %s syntax with .format in pandas.util (#17224)
jschendel Aug 11, 2017
9e26997
Add 'See also' sections (#17223)
topper-123 Aug 11, 2017
a7311d2
move pivot_table doc-string to DataFrame (#17174)
jbrockmendel Aug 11, 2017
1ac9ede
Remove import of pandas as pd in core.window (#17233)
jbrockmendel Aug 12, 2017
a2d8d23
TST: Move more frame tests to SharedWithSparse (#17227)
kernc Aug 12, 2017
013b983
REF: _get_objs_combined_axis (#17217)
toobaz Aug 12, 2017
fddb66d
ENH/PERF: Remove frequency inference from .dt accessor (#17210)
cpcloud Aug 14, 2017
2e55156
Fix apparent typo in tests (#17247)
jbrockmendel Aug 14, 2017
b49446e
COMPAT: avoid calling getsizeof() on PyPy
mattip Aug 15, 2017
536b761
CLN: replace %s syntax with .format in pandas.core.reshape (#17252)
jschendel Aug 15, 2017
a1ff671
ENH: Infer compression from non-string paths (#17206)
dhimmel Aug 15, 2017
df1b0dc
Fix bugs in IntervalIndex.is_non_overlapping_monotonic (#17238)
jschendel Aug 15, 2017
8fe1cc3
BUG: Fix behavior of argmax and argmin with inf (#16449) (#16449)
DGrady Aug 15, 2017
357e7ae
CLN: Remove have_pytz (#17266)
jbrockmendel Aug 16, 2017
aa97aa6
CLN: replace %s syntax with .format in core.dtypes and core.sparse (#…
jschendel Aug 17, 2017
a618bec
Replace imports of * with explicit imports (#17269)
jbrockmendel Aug 17, 2017
db3ea2f
TST: pytest deprecation warnings GH17197 (#17253)
swyoon Aug 17, 2017
de60666
Handle more date/datetime/time formats (#15871)
Winand Aug 18, 2017
0bbda54
DOC: add example on json_normalize (#16438)
zzgao Aug 18, 2017
c148dd2
BUG: Have object dtype for empty Categorical.categories (#17249)
TomAugspurger Aug 19, 2017
155c11a
CLN: replace %s syntax with .format in pandas.tseries (#17290)
jschendel Aug 19, 2017
e4aeed2
TST: parameterize consistency tests for rolling/expanding windows (#1…
jreback Aug 19, 2017
db11418
FIX: define `DataFrame.items` for all versions of python (#17214)
tacaswell Aug 19, 2017
a256e26
PERF: Update ASV publish config (#17293)
TomAugspurger Aug 20, 2017
75d46a6
DOC: Expand docstrings for head / tail methods (#16941)
yosukeBaya4 Aug 21, 2017
172abfb
MAINT: Use set literal for unsupported + depr args
gfyoung Aug 21, 2017
1982aca
DOC: Add proper docstring to maybe_convert_indices
gfyoung Aug 21, 2017
393bb19
DOC: Improving docstring of take method (#16948)
matagus Aug 21, 2017
595e0a4
BUG: Fixed regex in asv.conf.json (#17300)
TomAugspurger Aug 21, 2017
6a45d36
Remove unnecessary usage of _TSObject (#17297)
jbrockmendel Aug 21, 2017
5f077f3
BUG: clip should handle null values
mgasvoda Aug 21, 2017
a10fa92
BUG: fillna returns frame when inplace=True if value is a dict (#1615…
Aug 21, 2017
8dfb95b
CLN: Index.append() refactoring (#16236)
toobaz Aug 22, 2017
8326c83
DEPS: set min versions (#17002)
jreback Aug 22, 2017
8fbd8f8
CLN: replace %s syntax with .format in core.tools, algorithms.py, bas…
jschendel Aug 22, 2017
3625190
BUG: Fix strange behaviour of Series.iloc on MultiIndex Series (#1714…
Aug 22, 2017
7364711
DOC: Add module doc-string to tseries/api.py
gfyoung Aug 23, 2017
e5797fa
MAINT: Clean up docs in pandas/errors/__init__.py
gfyoung Aug 23, 2017
9be531a
CLN: replace %s syntax with .format in missing.py, nanops.py, ops.py …
jschendel Aug 24, 2017
a9574b0
Make pd.Period immutable (#17239)
jbrockmendel Aug 24, 2017
3e31383
Bug: groupby multiindex levels equals rows (#16859)
Aug 24, 2017
e5030b3
BUG: Cannot use tz-aware origin in to_datetime (#16842)
ivybae Aug 24, 2017
7be53ed
Replace usage of total_seconds compat func with timedelta method (#17…
jbrockmendel Aug 25, 2017
f4adbb9
CLN: replace %s syntax with .format in core/indexing.py (#17357)
cbertinato Aug 28, 2017
b1b3325
DOC: Point to dev-docs in issue template (#17353)
gfyoung Aug 28, 2017
76cc924
CLN: remove total_seconds compat from json (#17341)
chris-b1 Aug 29, 2017
0309dae
CLN: Move test_intersect_str_dates (#17366)
jschendel Aug 29, 2017
c523bfc
BUG: Respect dups in reindexing CategoricalIndex (#17355)
gfyoung Aug 29, 2017
5a6f2ac
Unify Index._dir_* with Series implementation (#17117)
jbrockmendel Aug 29, 2017
ce8ccba
BUG: make order of index from pd.concat deterministic (#17364)
toobaz Aug 29, 2017
a585e09
Fix typo that causes several NaT methods to have incorrect docstrings…
jbrockmendel Aug 29, 2017
8199559
CLN: replace %s syntax with .format in io/formats/format.py (#17358)
cbertinato Aug 30, 2017
6ec1044
PKG: Added pyproject.toml for PEP 518 (#16745)
TomAugspurger Aug 30, 2017
c33af56
DOC: Update Overview page in documentation (#17368)
iuliakhomenko Aug 30, 2017
0f8205c
API: Have MultiIndex consturctors always return a MI (#17236)
TomAugspurger Aug 30, 2017
54f68b4
CLN: replace %s syntax with .format in io/formats/css.py, excel.py, p…
cbertinato Aug 31, 2017
b717ebc
BUG: not correctly using OrderedDict in test_series_apply (#17384)
sylviawhoa Aug 31, 2017
b61af0e
Remove boxplot from _dataframe_apply_whitelist (#17381)
jbrockmendel Aug 31, 2017
c80e8d0
API: Localize Series when calling to_datetime with utc=True (#6415) (…
mroeschke Sep 1, 2017
3a0dc92
TST: Enable tests in test_tools.py (#17405)
jschendel Sep 1, 2017
365f2fe
TST: remove tests and docs for legacy (pre 0.12) hdf5 support (#17404)
topper-123 Sep 1, 2017
d994323
Tslib unused (#17402)
jbrockmendel Sep 1, 2017
e94e572
DOC: Cleaned references to pandas <v0.12 in docs (#17375)
topper-123 Sep 2, 2017
6a02ffa
Remove unused _day and _month attrs (#17431)
jbrockmendel Sep 4, 2017
519c57f
DOC: Clean-up references to v12 to v14 (both included) (#17420)
topper-123 Sep 5, 2017
f22b895
BUG: Plotting Timedelta on y-axis #16953 (#17430)
s-weigand Sep 6, 2017
8edd85a
COMPAT: handle pyarrow deprecation of timestamps_to_ms in .from_panda…
jreback Sep 6, 2017
047727a
DOC/TST: Add examples to MultiIndex.get_level_values + related change…
topper-123 Sep 6, 2017
91a2300
documentation changes only
jowens Sep 6, 2017
41058ab
documentation changes only
jowens Sep 7, 2017
4926913
documentation changes only, limited to 80 cols
jowens Sep 7, 2017
14235ec
more documentation edits
jowens Sep 8, 2017
196c835
minor documentation edits
jowens Sep 9, 2017
fed4b03
better return type explanation in code, added issue number to tests
jowens Sep 9, 2017
c2d9cc6
cleaning up legacy documentation issues
jowens Sep 18, 2017
d4b213b
remove 'if'
jowens Sep 18, 2017
b16f6d5
newlines for clarity
jowens Sep 18, 2017
092889a
Merge branch 'read_html_with_colspan_rowspan' of github.com:jowens/pa…
jowens Sep 20, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
ENH: add to/from_parquet with pyarrow & fastparquet (#15838)
  • Loading branch information
jreback authored and jowens committed Sep 20, 2017
commit 5ce00e1d2a3fdfc24e9ca36bc36120aeb1ef7ed9
1 change: 1 addition & 0 deletions ci/install_travis.sh
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ fi
echo
echo "[removing installed pandas]"
conda remove pandas -y --force
pip uninstall -y pandas

if [ "$BUILD_TEST" ]; then

Expand Down
2 changes: 1 addition & 1 deletion ci/requirements-2.7.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ source activate pandas

echo "install 27"

conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 fastparquet
4 changes: 2 additions & 2 deletions ci/requirements-3.5.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ source activate pandas

echo "install 35"

conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1

# pip install python-dateutil to get latest
conda remove -n pandas python-dateutil --force
pip install python-dateutil

conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
2 changes: 1 addition & 1 deletion ci/requirements-3.5_OSX.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ source activate pandas

echo "install 35_OSX"

conda install -n pandas -c conda-forge feather-format==0.3.1
conda install -n pandas -c conda-forge feather-format==0.3.1 fastparquet
1 change: 1 addition & 0 deletions ci/requirements-3.6.pip
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
brotlipy
2 changes: 2 additions & 0 deletions ci/requirements-3.6.run
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ pymysql
feather-format
pyarrow
psycopg2
python-snappy
fastparquet
beautifulsoup4
s3fs
xarray
Expand Down
2 changes: 1 addition & 1 deletion ci/requirements-3.6_DOC.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@ echo "[install DOC_BUILD deps]"

pip install pandas-gbq

conda install -n pandas -c conda-forge feather-format pyarrow nbsphinx pandoc
conda install -n pandas -c conda-forge feather-format pyarrow nbsphinx pandoc fastparquet

conda install -n pandas -c r r rpy2 --yes
2 changes: 2 additions & 0 deletions ci/requirements-3.6_WIN.run
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,5 @@ numexpr
pytables
matplotlib
blosc
fastparquet
pyarrow
1 change: 1 addition & 0 deletions doc/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,7 @@ Optional Dependencies
* `xarray <http://xarray.pydata.org>`__: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended.
* `PyTables <http://www.pytables.org>`__: necessary for HDF5-based storage. Version 3.0.0 or higher required, Version 3.2.1 or higher highly recommended.
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
* ``Apache Parquet Format``, either `pyarrow <http://arrow.apache.org/docs/python/>`__ (>= 0.4.1) or `fastparquet <https://fastparquet.readthedocs.io/en/latest/necessary>`__ (>= 0.0.6) for parquet-based storage. The `snappy <https://pypi.python.org/pypi/python-snappy>`__ and `brotli <https://pypi.python.org/pypi/brotlipy>`__ are available for compression support.
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:

* `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL
Expand Down
82 changes: 78 additions & 4 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ object. The corresponding ``writer`` functions are object methods that are acces
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
binary;`Msgpack <http://msgpack.org/index.html>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
Expand Down Expand Up @@ -209,7 +210,7 @@ buffer_lines : int, default None
.. deprecated:: 0.19.0

Argument removed because its value is not respected by the parser

compact_ints : boolean, default False
.. deprecated:: 0.19.0

Expand Down Expand Up @@ -4087,7 +4088,7 @@ control compression: ``complevel`` and ``complib``.
``complevel`` specifies if and how hard data is to be compressed.
``complevel=0`` and ``complevel=None`` disables
compression and ``0<complevel<10`` enables compression.

``complib`` specifies which compression library to use. If nothing is
specified the default library ``zlib`` is used. A
compression library usually optimizes for either good
Expand All @@ -4102,9 +4103,9 @@ control compression: ``complevel`` and ``complib``.
- `blosc <http://www.blosc.org/>`_: Fast compression and decompression.

.. versionadded:: 0.20.2

Support for alternative blosc compressors:

- `blosc:blosclz <http://www.blosc.org/>`_ This is the
default compressor for ``blosc``
- `blosc:lz4
Expand Down Expand Up @@ -4545,6 +4546,79 @@ Read from a feather file.
import os
os.remove('example.feather')


.. _io.parquet:

Parquet
-------

.. versionadded:: 0.21.0

`Parquet <https://parquet.apache.org/`__ provides a partitioned binary columnar serialization for data frames. It is designed to
make reading and writing data frames efficient, and to make sharing data across data analysis
languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible
while still maintaining good read performance.

Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas
dtypes, including extension dtypes such as datetime with tz.

Several caveats.

- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
error if a non-default one is provided. You can simply ``.reset_index(drop=True)`` in order to store the index.
- Duplicate column names and non-string columns names are not supported
- Categorical dtypes are currently not-supported (for ``pyarrow``).
- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
on an attempt at serialization.

You can specifiy an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``, then
then ``pyarrow`` is tried, and falling back to ``fastparquet``.

See the documentation for `pyarrow <http://arrow.apache.org/docs/python/`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__

.. note::

These engines are very similar and should read/write nearly identical parquet format files.
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).

.. ipython:: python

df = pd.DataFrame({'a': list('abc'),
'b': list(range(1, 4)),
'c': np.arange(3, 6).astype('u1'),
'd': np.arange(4.0, 7.0, dtype='float64'),
'e': [True, False, True],
'f': pd.date_range('20130101', periods=3),
'g': pd.date_range('20130101', periods=3, tz='US/Eastern'),
'h': pd.date_range('20130101', periods=3, freq='ns')})

df
df.dtypes

Write to a parquet file.

.. ipython:: python

df.to_parquet('example_pa.parquet', engine='pyarrow')
df.to_parquet('example_fp.parquet', engine='fastparquet')

Read from a parquet file.

.. ipython:: python

result = pd.read_parquet('example_pa.parquet', engine='pyarrow')
result = pd.read_parquet('example_fp.parquet', engine='fastparquet')

result.dtypes

.. ipython:: python
:suppress:

import os
os.remove('example_pa.parquet')
os.remove('example_fp.parquet')

.. _io.sql:

SQL Queries
Expand Down
3 changes: 3 additions & 0 deletions doc/source/options.rst
Original file line number Diff line number Diff line change
Expand Up @@ -414,6 +414,9 @@ io.hdf.default_format None default format writing format,
'table'
io.hdf.dropna_table True drop ALL nan rows when appending
to a table
io.parquet.engine None The engine to use as a default for
parquet reading and writing. If None
then try 'pyarrow' and 'fastparquet'
mode.chained_assignment warn Raise an exception, warn, or no
action if trying to use chained
assignment, The default is warn
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ Other Enhancements
- :func:`date_range` now accepts 'YS' in addition to 'AS' as an alias for start of year (:issue:`9313`)
- :func:`date_range` now accepts 'Y' in addition to 'A' as an alias for end of year (:issue:`9313`)
- :func:`read_html` handles colspan and rowspan arguments and attempts to infer a header if the header is not explicitly specified (:issue:`17054`)
- Integration with Apache Parquet, including a new top-level ``pd.read_parquet()`` and ``DataFrame.to_parquet()`` method, see :ref:`here <io.parquet>`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add these lines back.

Copy link
Member

@gfyoung gfyoung Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, your whatsnew has a merge conflict, so you definitely need to fix that.

.. _whatsnew_0210.api_breaking:

Expand Down
12 changes: 12 additions & 0 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -465,3 +465,15 @@ def _register_xlsx(engine, other):
except ImportError:
# fallback
_register_xlsx('openpyxl', 'xlsxwriter')

# Set up the io.parquet specific configuration.
parquet_engine_doc = """
: string
The default parquet reader/writer engine. Available options:
'auto', 'pyarrow', 'fastparquet', the default is 'auto'
"""

with cf.config_prefix('io.parquet'):
cf.register_option(
'engine', 'auto', parquet_engine_doc,
validator=is_one_of_factory(['auto', 'pyarrow', 'fastparquet']))
24 changes: 24 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1598,6 +1598,30 @@ def to_feather(self, fname):
from pandas.io.feather_format import to_feather
to_feather(self, fname)

def to_parquet(self, fname, engine='auto', compression='snappy',
**kwargs):
"""
Write a DataFrame to the binary parquet format.

.. versionadded:: 0.21.0

Parameters
----------
fname : str
string file path
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
Parquet reader library to use. If 'auto', then the option
'io.parquet.engine' is used. If 'auto', then the first
library to be installed is used.
compression : str, optional, default 'snappy'
compression method, includes {'gzip', 'snappy', 'brotli'}
kwargs
Additional keyword arguments passed to the engine
"""
from pandas.io.parquet import to_parquet
to_parquet(self, fname, engine,
compression=compression, **kwargs)

@Substitution(header='Write out column names. If a list of string is given, \
it is assumed to be aliases for the column names')
@Appender(fmt.docstring_to_string, indents=1)
Expand Down
1 change: 1 addition & 0 deletions pandas/io/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from pandas.io.sql import read_sql, read_sql_table, read_sql_query
from pandas.io.sas import read_sas
from pandas.io.feather_format import read_feather
from pandas.io.parquet import read_parquet
from pandas.io.stata import read_stata
from pandas.io.pickle import read_pickle, to_pickle
from pandas.io.packers import read_msgpack, to_msgpack
Expand Down
4 changes: 2 additions & 2 deletions pandas/io/feather_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def _try_import():
"you can install via conda\n"
"conda install feather-format -c conda-forge\n"
"or via pip\n"
"pip install feather-format\n")
"pip install -U feather-format\n")

try:
feather.__version__ >= LooseVersion('0.3.1')
Expand All @@ -29,7 +29,7 @@ def _try_import():
"you can install via conda\n"
"conda install feather-format -c conda-forge"
"or via pip\n"
"pip install feather-format\n")
"pip install -U feather-format\n")

return feather

Expand Down
Loading