Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
Problem description
openpyxl
has fairly high memory consumption by default. To address this for write-only operations, they recommend two things:
- Install the
lxml
package, which is more optimized than the default xml parser. - Use the
write_only
mode on the workbook.
In pandas/pandas/io/excel/_openpyxl.py
we have the a comment which states intent - # Create workbook object with default optimized_write=True.
However openpyxl
depreciated the optimized_write
option in version 2.4 and now recommends setting write_only=True
on workbook creation. Pandas requires 2.6 so the result is that writes to excel use more memory than necessary. Switching to write_only=True
has other ramifications, like only being able to write out once, so would need to be tested.
Proposed solution
- Print out a warning about
lxml
being missing if writing to excel in an environment that doesn't have it installed. - Use the
write_only
mode if possible.
other context
Running on a GCP cloud function, I had a df = pd.DataFrame()
of size ~150mb. df.to_excel()
without lxml used >2gb memory, crashing the function. With lxml it used ~1.4gb memory, returning successfully but getting too close. Adding in write-only should lower that by an order of magnitude to <100mb.