Skip to content

Commit

Permalink
ENH: add calamine excel reader (close #50395) (#54998)
Browse files Browse the repository at this point in the history
  • Loading branch information
dimastbk authored Sep 12, 2023
1 parent 705d431 commit 79067a7
Show file tree
Hide file tree
Showing 20 changed files with 290 additions and 58 deletions.
1 change: 1 addition & 0 deletions ci/deps/actions-310.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ dependencies:
- pymysql>=1.0.2
- pyreadstat>=1.1.5
- pytables>=3.7.0
- python-calamine>=0.1.6
- pyxlsb>=1.0.9
- s3fs>=2022.05.0
- scipy>=1.8.1
Expand Down
1 change: 1 addition & 0 deletions ci/deps/actions-311-downstream_compat.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ dependencies:
- pymysql>=1.0.2
- pyreadstat>=1.1.5
- pytables>=3.7.0
- python-calamine>=0.1.6
- pyxlsb>=1.0.9
- s3fs>=2022.05.0
- scipy>=1.8.1
Expand Down
1 change: 1 addition & 0 deletions ci/deps/actions-311.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ dependencies:
- pymysql>=1.0.2
- pyreadstat>=1.1.5
# - pytables>=3.7.0, 3.8.0 is first version that supports 3.11
- python-calamine>=0.1.6
- pyxlsb>=1.0.9
- s3fs>=2022.05.0
- scipy>=1.8.1
Expand Down
1 change: 1 addition & 0 deletions ci/deps/actions-39-minimum_versions.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ dependencies:
- pymysql=1.0.2
- pyreadstat=1.1.5
- pytables=3.7.0
- python-calamine=0.1.6
- pyxlsb=1.0.9
- s3fs=2022.05.0
- scipy=1.8.1
Expand Down
1 change: 1 addition & 0 deletions ci/deps/actions-39.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ dependencies:
- pymysql>=1.0.2
- pyreadstat>=1.1.5
- pytables>=3.7.0
- python-calamine>=0.1.6
- pyxlsb>=1.0.9
- s3fs>=2022.05.0
- scipy>=1.8.1
Expand Down
1 change: 1 addition & 0 deletions ci/deps/circle-310-arm64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ dependencies:
- pymysql>=1.0.2
# - pyreadstat>=1.1.5 not available on ARM
- pytables>=3.7.0
- python-calamine>=0.1.6
- pyxlsb>=1.0.9
- s3fs>=2022.05.0
- scipy>=1.8.1
Expand Down
1 change: 1 addition & 0 deletions doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,7 @@ xlrd 2.0.1 excel Reading Excel
xlsxwriter 3.0.3 excel Writing Excel
openpyxl 3.0.10 excel Reading / writing for xlsx files
pyxlsb 1.0.9 excel Reading for xlsb files
python-calamine 0.1.6 excel Reading for xls/xlsx/xlsb/ods files
========================= ================== =============== =============================================================

HTML
Expand Down
23 changes: 21 additions & 2 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3453,7 +3453,8 @@ Excel files
The :func:`~pandas.read_excel` method can read Excel 2007+ (``.xlsx``) files
using the ``openpyxl`` Python module. Excel 2003 (``.xls``) files
can be read using ``xlrd``. Binary Excel (``.xlsb``)
files can be read using ``pyxlsb``.
files can be read using ``pyxlsb``. All formats can be read
using :ref:`calamine<io.calamine>` engine.
The :meth:`~DataFrame.to_excel` instance method is used for
saving a ``DataFrame`` to Excel. Generally the semantics are
similar to working with :ref:`csv<io.read_csv_table>` data.
Expand Down Expand Up @@ -3494,6 +3495,9 @@ using internally.

* For the engine odf, pandas is using :func:`odf.opendocument.load` to read in (``.ods``) files.

* For the engine calamine, pandas is using :func:`python_calamine.load_workbook`
to read in (``.xlsx``), (``.xlsm``), (``.xls``), (``.xlsb``), (``.ods``) files.

.. code-block:: python
# Returns a DataFrame
Expand Down Expand Up @@ -3935,7 +3939,8 @@ The :func:`~pandas.read_excel` method can also read binary Excel files
using the ``pyxlsb`` module. The semantics and features for reading
binary Excel files mostly match what can be done for `Excel files`_ using
``engine='pyxlsb'``. ``pyxlsb`` does not recognize datetime types
in files and will return floats instead.
in files and will return floats instead (you can use :ref:`calamine<io.calamine>`
if you need recognize datetime types).

.. code-block:: python
Expand All @@ -3947,6 +3952,20 @@ in files and will return floats instead.
Currently pandas only supports *reading* binary Excel files. Writing
is not implemented.

.. _io.calamine:

Calamine (Excel and ODS files)
------------------------------

The :func:`~pandas.read_excel` method can read Excel file (``.xlsx``, ``.xlsm``, ``.xls``, ``.xlsb``)
and OpenDocument spreadsheets (``.ods``) using the ``python-calamine`` module.
This module is a binding for Rust library `calamine <https://crates.io/crates/calamine>`__
and is faster than other engines in most cases. The optional dependency 'python-calamine' needs to be installed.

.. code-block:: python
# Returns a DataFrame
pd.read_excel("path_to_file.xlsb", engine="calamine")
.. _io.clipboard:

Expand Down
23 changes: 20 additions & 3 deletions doc/source/whatsnew/v2.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,27 @@ including other versions of pandas.
Enhancements
~~~~~~~~~~~~

.. _whatsnew_220.enhancements.enhancement1:
.. _whatsnew_220.enhancements.calamine:

enhancement1
^^^^^^^^^^^^
Calamine engine for :func:`read_excel`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``calamine`` engine was added to :func:`read_excel`.
It uses ``python-calamine``, which provides Python bindings for the Rust library `calamine <https://crates.io/crates/calamine>`__.
This engine supports Excel files (``.xlsx``, ``.xlsm``, ``.xls``, ``.xlsb``) and OpenDocument spreadsheets (``.ods``) (:issue:`50395`).

There are two advantages of this engine:

1. Calamine is often faster than other engines, some benchmarks show results up to 5x faster than 'openpyxl', 20x - 'odf', 4x - 'pyxlsb', and 1.5x - 'xlrd'.
But, 'openpyxl' and 'pyxlsb' are faster in reading a few rows from large files because of lazy iteration over rows.
2. Calamine supports the recognition of datetime in ``.xlsb`` files, unlike 'pyxlsb' which is the only other engine in pandas that can read ``.xlsb`` files.

.. code-block:: python
pd.read_excel("path_to_file.xlsb", engine="calamine")
For more, see :ref:`io.calamine` in the user guide on IO tools.

.. _whatsnew_220.enhancements.enhancement2:

Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ dependencies:
- pymysql>=1.0.2
- pyreadstat>=1.1.5
- pytables>=3.7.0
- python-calamine>=0.1.6
- pyxlsb>=1.0.9
- s3fs>=2022.05.0
- scipy>=1.8.1
Expand Down
2 changes: 2 additions & 0 deletions pandas/compat/_optional.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
"pyarrow": "7.0.0",
"pyreadstat": "1.1.5",
"pytest": "7.3.2",
"python-calamine": "0.1.6",
"pyxlsb": "1.0.9",
"s3fs": "2022.05.0",
"scipy": "1.8.1",
Expand All @@ -62,6 +63,7 @@
"lxml.etree": "lxml",
"odf": "odfpy",
"pandas_gbq": "pandas-gbq",
"python_calamine": "python-calamine",
"sqlalchemy": "SQLAlchemy",
"tables": "pytables",
}
Expand Down
10 changes: 5 additions & 5 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -513,11 +513,11 @@ def use_inf_as_na_cb(key) -> None:
auto, {others}.
"""

_xls_options = ["xlrd"]
_xlsm_options = ["xlrd", "openpyxl"]
_xlsx_options = ["xlrd", "openpyxl"]
_ods_options = ["odf"]
_xlsb_options = ["pyxlsb"]
_xls_options = ["xlrd", "calamine"]
_xlsm_options = ["xlrd", "openpyxl", "calamine"]
_xlsx_options = ["xlrd", "openpyxl", "calamine"]
_ods_options = ["odf", "calamine"]
_xlsb_options = ["pyxlsb", "calamine"]


with cf.config_prefix("io.excel.xls"):
Expand Down
16 changes: 11 additions & 5 deletions pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,13 +159,15 @@
of dtype conversion.
engine : str, default None
If io is not a buffer or path, this must be set to identify io.
Supported engines: "xlrd", "openpyxl", "odf", "pyxlsb".
Supported engines: "xlrd", "openpyxl", "odf", "pyxlsb", "calamine".
Engine compatibility :
- "xlrd" supports old-style Excel files (.xls).
- "openpyxl" supports newer Excel file formats.
- "odf" supports OpenDocument file formats (.odf, .ods, .odt).
- "pyxlsb" supports Binary Excel files.
- "calamine" supports Excel (.xls, .xlsx, .xlsm, .xlsb)
and OpenDocument (.ods) file formats.
.. versionchanged:: 1.2.0
The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
Expand Down Expand Up @@ -394,7 +396,7 @@ def read_excel(
| Callable[[str], bool]
| None = ...,
dtype: DtypeArg | None = ...,
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb"] | None = ...,
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb", "calamine"] | None = ...,
converters: dict[str, Callable] | dict[int, Callable] | None = ...,
true_values: Iterable[Hashable] | None = ...,
false_values: Iterable[Hashable] | None = ...,
Expand Down Expand Up @@ -433,7 +435,7 @@ def read_excel(
| Callable[[str], bool]
| None = ...,
dtype: DtypeArg | None = ...,
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb"] | None = ...,
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb", "calamine"] | None = ...,
converters: dict[str, Callable] | dict[int, Callable] | None = ...,
true_values: Iterable[Hashable] | None = ...,
false_values: Iterable[Hashable] | None = ...,
Expand Down Expand Up @@ -472,7 +474,7 @@ def read_excel(
| Callable[[str], bool]
| None = None,
dtype: DtypeArg | None = None,
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb"] | None = None,
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb", "calamine"] | None = None,
converters: dict[str, Callable] | dict[int, Callable] | None = None,
true_values: Iterable[Hashable] | None = None,
false_values: Iterable[Hashable] | None = None,
Expand Down Expand Up @@ -1456,13 +1458,15 @@ class ExcelFile:
.xls, .xlsx, .xlsb, .xlsm, .odf, .ods, or .odt file.
engine : str, default None
If io is not a buffer or path, this must be set to identify io.
Supported engines: ``xlrd``, ``openpyxl``, ``odf``, ``pyxlsb``
Supported engines: ``xlrd``, ``openpyxl``, ``odf``, ``pyxlsb``, ``calamine``
Engine compatibility :
- ``xlrd`` supports old-style Excel files (.xls).
- ``openpyxl`` supports newer Excel file formats.
- ``odf`` supports OpenDocument file formats (.odf, .ods, .odt).
- ``pyxlsb`` supports Binary Excel files.
- ``calamine`` supports Excel (.xls, .xlsx, .xlsm, .xlsb)
and OpenDocument (.ods) file formats.
.. versionchanged:: 1.2.0
Expand Down Expand Up @@ -1498,6 +1502,7 @@ class ExcelFile:
... df1 = pd.read_excel(xls, "Sheet1") # doctest: +SKIP
"""

from pandas.io.excel._calamine import CalamineReader
from pandas.io.excel._odfreader import ODFReader
from pandas.io.excel._openpyxl import OpenpyxlReader
from pandas.io.excel._pyxlsb import PyxlsbReader
Expand All @@ -1508,6 +1513,7 @@ class ExcelFile:
"openpyxl": OpenpyxlReader,
"odf": ODFReader,
"pyxlsb": PyxlsbReader,
"calamine": CalamineReader,
}

def __init__(
Expand Down
127 changes: 127 additions & 0 deletions pandas/io/excel/_calamine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
from __future__ import annotations

from datetime import (
date,
datetime,
time,
timedelta,
)
from typing import (
TYPE_CHECKING,
Any,
Union,
cast,
)

from pandas._typing import Scalar
from pandas.compat._optional import import_optional_dependency
from pandas.util._decorators import doc

import pandas as pd
from pandas.core.shared_docs import _shared_docs

from pandas.io.excel._base import BaseExcelReader

if TYPE_CHECKING:
from python_calamine import (
CalamineSheet,
CalamineWorkbook,
)

from pandas._typing import (
FilePath,
ReadBuffer,
StorageOptions,
)

_CellValueT = Union[int, float, str, bool, time, date, datetime, timedelta]


class CalamineReader(BaseExcelReader["CalamineWorkbook"]):
@doc(storage_options=_shared_docs["storage_options"])
def __init__(
self,
filepath_or_buffer: FilePath | ReadBuffer[bytes],
storage_options: StorageOptions | None = None,
engine_kwargs: dict | None = None,
) -> None:
"""
Reader using calamine engine (xlsx/xls/xlsb/ods).
Parameters
----------
filepath_or_buffer : str, path to be parsed or
an open readable stream.
{storage_options}
engine_kwargs : dict, optional
Arbitrary keyword arguments passed to excel engine.
"""
import_optional_dependency("python_calamine")
super().__init__(
filepath_or_buffer,
storage_options=storage_options,
engine_kwargs=engine_kwargs,
)

@property
def _workbook_class(self) -> type[CalamineWorkbook]:
from python_calamine import CalamineWorkbook

return CalamineWorkbook

def load_workbook(
self, filepath_or_buffer: FilePath | ReadBuffer[bytes], engine_kwargs: Any
) -> CalamineWorkbook:
from python_calamine import load_workbook

return load_workbook(
filepath_or_buffer, **engine_kwargs # type: ignore[arg-type]
)

@property
def sheet_names(self) -> list[str]:
from python_calamine import SheetTypeEnum

return [
sheet.name
for sheet in self.book.sheets_metadata
if sheet.typ == SheetTypeEnum.WorkSheet
]

def get_sheet_by_name(self, name: str) -> CalamineSheet:
self.raise_if_bad_sheet_by_name(name)
return self.book.get_sheet_by_name(name)

def get_sheet_by_index(self, index: int) -> CalamineSheet:
self.raise_if_bad_sheet_by_index(index)
return self.book.get_sheet_by_index(index)

def get_sheet_data(
self, sheet: CalamineSheet, file_rows_needed: int | None = None
) -> list[list[Scalar]]:
def _convert_cell(value: _CellValueT) -> Scalar:
if isinstance(value, float):
val = int(value)
if val == value:
return val
else:
return value
elif isinstance(value, date):
return pd.Timestamp(value)
elif isinstance(value, timedelta):
return pd.Timedelta(value)
elif isinstance(value, time):
# cast needed here because Scalar doesn't include datetime.time
return cast(Scalar, value)

return value

rows: list[list[_CellValueT]] = sheet.to_python(skip_empty_area=False)
data: list[list[Scalar]] = []

for row in rows:
data.append([_convert_cell(cell) for cell in row])
if file_rows_needed is not None and len(data) >= file_rows_needed:
break

return data
Loading

0 comments on commit 79067a7

Please sign in to comment.