Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add calamite engine to read_excel #50581

Closed
wants to merge 48 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
30da9a4
ENH: add calamite excel reader and modify test to include engine
kostyafarber Jan 5, 2023
a47d3fb
Merge branch 'main' into issue-50395
kostyafarber Jan 5, 2023
6c1dd87
Merge branch 'main' into issue-50395
kostyafarber Jan 5, 2023
fd06ad9
fix deps for python-calamine
kostyafarber Jan 5, 2023
8b6200a
Merge branch 'main' into issue-50395
kostyafarber Jan 5, 2023
6a8d822
fix deps for python-calamine, add as pip package
kostyafarber Jan 5, 2023
efcb2fc
ENH: fix typo in engine declaration, add import_optional_dependency, …
kostyafarber Jan 7, 2023
e1105de
Merge branch 'main' into issue-50395
kostyafarber Jan 11, 2023
6b50e0c
calamite -> calamine, updated some tests for calamine
dimastbk Jan 11, 2023
0784733
calamine excel engine: skip tests with datetime
dimastbk Jan 12, 2023
5971199
Merge branch 'main' into issue-50395
kostyafarber Jan 12, 2023
655318b
Merge branch 'issue-50395' into issue-50395-dima
dimastbk Jan 18, 2023
cc049cf
Merge branch 'main' into issue-50395
kostyafarber Jan 20, 2023
038133e
ENH: change reader filename match library, fix typo in engine name in…
kostyafarber Jan 22, 2023
52c2cbd
Merge branch 'issue-50395' into issue-50395-dima
kostyafarber Jan 22, 2023
2dc5e02
Merge pull request #1 from dimastbk/issue-50395-dima
kostyafarber Jan 22, 2023
2076e11
Merge pull request #2 from dimastbk/issue-50395-dima-skip-tests
kostyafarber Jan 22, 2023
6b0a7ac
Merge branch 'main' into issue-50395
kostyafarber Jan 22, 2023
a614089
ENH: add back get_sheet_by_index
kostyafarber Jan 22, 2023
256f9f9
Merge branch 'main' into issue-50395
kostyafarber Jan 23, 2023
eee8b4e
Merge branch 'main' into issue-50395
kostyafarber Jan 23, 2023
9fc2209
ENH: fix mypy and trailing whitespace
kostyafarber Jan 24, 2023
cf1268a
Merge branch 'main' into issue-50395
kostyafarber Jan 24, 2023
bebfec5
Merge branch 'main' into issue-50395
kostyafarber Jan 24, 2023
9019904
added conversion date/time/float, support file_rows_needed, fixed sup…
dimastbk Jan 26, 2023
08a5616
Update test_readers.py
dimastbk Jan 26, 2023
677a224
Merge pull request #3 from dimastbk/issue-50395
kostyafarber Jan 26, 2023
8c55e5d
Merge branch 'main' into issue-50395
kostyafarber Jan 26, 2023
d817999
update python-calamine to 0.0.7
dimastbk Jan 27, 2023
12aaf19
Merge branch 'main' into issue-50395
kostyafarber Jan 28, 2023
255e8fb
Merge pull request #4 from dimastbk/issue-50395
kostyafarber Jan 28, 2023
500fa9f
Merge branch 'main' into issue-50395
kostyafarber Jan 29, 2023
5d94728
fix review: use CalamineReader/CalamineSheet
dimastbk Jan 31, 2023
15874c3
fixed pyright, fixed docs in __init__
dimastbk Feb 6, 2023
89ae49e
Merge pull request #5 from dimastbk/issue-50395
kostyafarber Feb 7, 2023
33e5b7e
Merge branch 'main' into issue-50395
kostyafarber Feb 7, 2023
85d31ec
Merge branch 'main' into issue-50395
kostyafarber Mar 23, 2023
a0d4193
Merge branch 'main' into issue-50395
kostyafarber Mar 25, 2023
a6b6fb2
bump python-calamine to 0.1.0
dimastbk Mar 26, 2023
0a431c5
_ValueT -> _CellValueT
dimastbk Mar 29, 2023
745cd09
Merge pull request #6 from dimastbk/issue-50395
kostyafarber Mar 29, 2023
942a16a
Merge branch 'main' into issue-50395
kostyafarber Apr 1, 2023
8803ca9
Merge branch 'main' into issue-50395
kostyafarber Apr 2, 2023
2f5ffba
added xfail to tests, small fixes
dimastbk Apr 3, 2023
b8b1a9a
Merge pull request #7 from dimastbk/issue-50395
kostyafarber Apr 8, 2023
f5ab40d
Merge branch 'main' into issue-50395
kostyafarber Apr 8, 2023
02c2e7f
bump calamine to 0.1.1, update tests (472 passed, 75 xfailed), update…
dimastbk May 1, 2023
74a3e70
Merge pull request #8 from dimastbk/issue-50395
kostyafarber May 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/deps/actions-310.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,5 @@ dependencies:
- zstandard>=0.15.2

- pip:
- python-calamine
- tzdata>=2022.1
1 change: 1 addition & 0 deletions ci/deps/actions-311.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,5 @@ dependencies:
- zstandard>=0.15.2

- pip:
- python-calamine>=0.1.1
- tzdata>=2022.1
1 change: 1 addition & 0 deletions ci/deps/actions-38-downstream_compat.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,4 +70,5 @@ dependencies:
- py

- pip:
- python-calamine
- tzdata>=2022.1
1 change: 1 addition & 0 deletions ci/deps/actions-38-minimum_versions.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,4 +59,5 @@ dependencies:

- pip:
- pyqt5==5.15.1
- python-calamine==0.1.1
- tzdata==2022.1
1 change: 1 addition & 0 deletions ci/deps/actions-38.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,5 @@ dependencies:
- zstandard>=0.15.2

- pip:
- python-calamine
- tzdata>=2022.1
1 change: 1 addition & 0 deletions ci/deps/actions-39.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,5 @@ dependencies:
- zstandard>=0.15.2

- pip:
- python-calamine
- tzdata>=2022.1
3 changes: 3 additions & 0 deletions ci/deps/circle-38-arm64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,6 @@ dependencies:
- xlrd>=2.0.1
- xlsxwriter>=1.4.3
- zstandard>=0.15.2

- pip:
- python-calamine
1 change: 1 addition & 0 deletions doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,7 @@ xlrd 2.0.1 excel Reading Excel
xlsxwriter 1.4.3 excel Writing Excel
openpyxl 3.0.7 excel Reading / writing for xlsx files
pyxlsb 1.0.8 excel Reading for xlsb files
python-calamine 0.1.1 excel Reading for xls/xlsx/xlsb/ods files
========================= ================== =============== =============================================================

HTML
Expand Down
4 changes: 3 additions & 1 deletion doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3420,7 +3420,9 @@ Excel files
The :func:`~pandas.read_excel` method can read Excel 2007+ (``.xlsx``) files
using the ``openpyxl`` Python module. Excel 2003 (``.xls``) files
can be read using ``xlrd``. Binary Excel (``.xlsb``)
files can be read using ``pyxlsb``.
files can be read using ``pyxlsb``. Also, all this formats can be read using ``python-calamine``,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the datetime issue the only limitation? If so we can probably be more explicit and say something like python-calamine can be used to read all formats, but specifically does not support reading datetimes from .xls and .xlsb formats

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datetime is the main limitation, but there are a few more bugs, #50581 (comment). I suppressed them all with pytest.xfail, but should I write about them in documentation?

but this library has some limitation and different behavior from other libraries,
for example, can't detect date in some formats (xls and xlsb).
The :meth:`~DataFrame.to_excel` instance method is used for
saving a ``DataFrame`` to Excel. Generally the semantics are
similar to working with :ref:`csv<io.read_csv_table>` data.
Expand Down
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,9 @@ Other enhancements
- Improved error message when creating a DataFrame with empty data (0 rows), no index and an incorrect number of columns. (:issue:`52084`)
- :meth:`DataFrame.applymap` now uses the :meth:`~api.extensions.ExtensionArray.map` method of underlying :class:`api.extensions.ExtensionArray` instances (:issue:`52219`)
- :meth:`arrays.SparseArray.map` now supports ``na_action`` (:issue:`52096`).
- :meth:`Categorical.map` and :meth:`CategoricalIndex.map` now have a ``na_action`` parameter (:issue:`44279`)
- Added ``calamine`` as an engine to ``read_excel`` (:issue:`50395`)
- Add dtype of categories to ``repr`` information of :class:`CategoricalDtype` (:issue:`52179`)
-

.. ---------------------------------------------------------------------------
.. _whatsnew_210.notable_bug_fixes:
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -117,5 +117,6 @@ dependencies:

- pip:
- sphinx-toggleprompt
- python-calamine
- typing_extensions; python_version<"3.11"
- tzdata>=2022.1
2 changes: 2 additions & 0 deletions pandas/compat/_optional.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
"pyarrow": "7.0.0",
"pyreadstat": "1.1.2",
"pytest": "7.0.0",
"python-calamine": "0.1.1",
"pyxlsb": "1.0.8",
"s3fs": "2021.08.0",
"scipy": "1.7.1",
Expand Down Expand Up @@ -64,6 +65,7 @@
"lxml.etree": "lxml",
"odf": "odfpy",
"pandas_gbq": "pandas-gbq",
"python_calamine": "python-calamine",
"snappy": "python-snappy",
"sqlalchemy": "SQLAlchemy",
"tables": "pytables",
Expand Down
10 changes: 5 additions & 5 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -503,11 +503,11 @@ def use_inf_as_na_cb(key) -> None:
auto, {others}.
"""

_xls_options = ["xlrd"]
_xlsm_options = ["xlrd", "openpyxl"]
_xlsx_options = ["xlrd", "openpyxl"]
_ods_options = ["odf"]
_xlsb_options = ["pyxlsb"]
_xls_options = ["xlrd", "calamine"]
_xlsm_options = ["xlrd", "openpyxl", "calamine"]
_xlsx_options = ["xlrd", "openpyxl", "calamine"]
_ods_options = ["odf", "calamine"]
_xlsb_options = ["pyxlsb", "calamine"]


with cf.config_prefix("io.excel.xls"):
Expand Down
16 changes: 11 additions & 5 deletions pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,13 +149,15 @@
of dtype conversion.
engine : str, default None
If io is not a buffer or path, this must be set to identify io.
Supported engines: "xlrd", "openpyxl", "odf", "pyxlsb".
Supported engines: "xlrd", "openpyxl", "odf", "pyxlsb", "calamine".
Engine compatibility :

- "xlrd" supports old-style Excel files (.xls).
- "openpyxl" supports newer Excel file formats.
- "odf" supports OpenDocument file formats (.odf, .ods, .odt).
- "pyxlsb" supports Binary Excel files.
- "calamine" supports Excel (.xls, .xlsx, .xlsm, .xlsb)
and OpenDocument (.ods) file formats.

.. versionchanged:: 1.2.0
The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
Expand Down Expand Up @@ -375,7 +377,7 @@ def read_excel(
| Callable[[str], bool]
| None = ...,
dtype: DtypeArg | None = ...,
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb"] | None = ...,
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb", "calamine"] | None = ...,
converters: dict[str, Callable] | dict[int, Callable] | None = ...,
true_values: Iterable[Hashable] | None = ...,
false_values: Iterable[Hashable] | None = ...,
Expand Down Expand Up @@ -414,7 +416,7 @@ def read_excel(
| Callable[[str], bool]
| None = ...,
dtype: DtypeArg | None = ...,
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb"] | None = ...,
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb", "calamine"] | None = ...,
converters: dict[str, Callable] | dict[int, Callable] | None = ...,
true_values: Iterable[Hashable] | None = ...,
false_values: Iterable[Hashable] | None = ...,
Expand Down Expand Up @@ -453,7 +455,7 @@ def read_excel(
| Callable[[str], bool]
| None = None,
dtype: DtypeArg | None = None,
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb"] | None = None,
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb", "calamine"] | None = None,
converters: dict[str, Callable] | dict[int, Callable] | None = None,
true_values: Iterable[Hashable] | None = None,
false_values: Iterable[Hashable] | None = None,
Expand Down Expand Up @@ -1418,13 +1420,15 @@ class ExcelFile:
.xls, .xlsx, .xlsb, .xlsm, .odf, .ods, or .odt file.
engine : str, default None
If io is not a buffer or path, this must be set to identify io.
Supported engines: ``xlrd``, ``openpyxl``, ``odf``, ``pyxlsb``
Supported engines: ``xlrd``, ``openpyxl``, ``odf``, ``pyxlsb``, ``calamine``
Engine compatibility :

- ``xlrd`` supports old-style Excel files (.xls).
- ``openpyxl`` supports newer Excel file formats.
- ``odf`` supports OpenDocument file formats (.odf, .ods, .odt).
- ``pyxlsb`` supports Binary Excel files.
- ``calamine`` supports Excel (.xls, .xlsx, .xlsm, .xlsb)
and OpenDocument (.ods) file formats.

.. versionchanged:: 1.2.0

Expand Down Expand Up @@ -1452,6 +1456,7 @@ class ExcelFile:
This is not supported, switch to using ``openpyxl`` instead.
"""

from pandas.io.excel._calamine import CalamineReader
from pandas.io.excel._odfreader import ODFReader
from pandas.io.excel._openpyxl import OpenpyxlReader
from pandas.io.excel._pyxlsb import PyxlsbReader
Expand All @@ -1462,6 +1467,7 @@ class ExcelFile:
"openpyxl": OpenpyxlReader,
"odf": ODFReader,
"pyxlsb": PyxlsbReader,
"calamine": CalamineReader,
}

def __init__(
Expand Down
99 changes: 99 additions & 0 deletions pandas/io/excel/_calamine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
from __future__ import annotations

from datetime import (
date,
datetime,
time,
)
from typing import (
TYPE_CHECKING,
Union,
)

from pandas.compat._optional import import_optional_dependency
from pandas.util._decorators import doc

import pandas as pd
from pandas.core.shared_docs import _shared_docs

from pandas.io.excel._base import BaseExcelReader

if TYPE_CHECKING:
from pandas._typing import (
FilePath,
ReadBuffer,
Scalar,
StorageOptions,
)

_CellValueT = Union[int, float, str, bool, time, date, datetime]


class CalamineReader(BaseExcelReader):
@doc(storage_options=_shared_docs["storage_options"])
def __init__(
self,
filepath_or_buffer: FilePath | ReadBuffer[bytes],
storage_options: StorageOptions = None,
) -> None:
"""
Reader using calamine engine (xlsx/xls/xlsb/ods).

Parameters
----------
filepath_or_buffer : str, path to be parsed or
an open readable stream.
{storage_options}
"""
import_optional_dependency("python_calamine")
super().__init__(filepath_or_buffer, storage_options=storage_options)

@property
def _workbook_class(self):
from python_calamine import CalamineWorkbook

return CalamineWorkbook

def load_workbook(self, filepath_or_buffer: FilePath | ReadBuffer[bytes]):
from python_calamine import load_workbook

return load_workbook(filepath_or_buffer) # type: ignore[arg-type]

@property
def sheet_names(self) -> list[str]:
return self.book.sheet_names # pyright: ignore

def get_sheet_by_name(self, name: str):
self.raise_if_bad_sheet_by_name(name)
return self.book.get_sheet_by_name(name) # pyright: ignore

def get_sheet_by_index(self, index: int):
self.raise_if_bad_sheet_by_index(index)
return self.book.get_sheet_by_index(index) # pyright: ignore

def get_sheet_data(
self, sheet, file_rows_needed: int | None = None
) -> list[list[Scalar]]:
def _convert_cell(value: _CellValueT) -> Scalar:
if isinstance(value, float):
val = int(value)
if val == value:
return val
else:
return value
elif isinstance(value, date):
return pd.Timestamp(value)
elif isinstance(value, time):
return value.isoformat()

return value

rows: list[list[_CellValueT]] = sheet.to_python(skip_empty_area=False)
data: list[list[Scalar]] = []

for row in rows:
data.append([_convert_cell(cell) for cell in row])
if file_rows_needed is not None and len(data) >= file_rows_needed:
break

return data
Loading