-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: df.to_excel() with openpyxl engine doesn't use write-optimized mode, resulting in higher memory consumption #41681
Comments
Thanks for looking into this @Kharms. This all sounds quite reasonable, would you be interested in putting up a pr? |
I'm happy to take this on. The caveats of using lxml and
The first doesn't seem to be an issue given that pandas already removes the sheets when it initializes a |
I've created a pull request as shown above. The main thing I'm concerned about is how it handles lxml not being present - excepting an |
I would like to tackle this issue if we think that its still relevant and Jordan is no longer interested in taking it on. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
Problem description
openpyxl
has fairly high memory consumption by default. To address this for write-only operations, they recommend two things:lxml
package, which is more optimized than the default xml parser.write_only
mode on the workbook.In
pandas/pandas/io/excel/_openpyxl.py
we have the a comment which states intent -# Create workbook object with default optimized_write=True.
However
openpyxl
depreciated theoptimized_write
option in version 2.4 and now recommends settingwrite_only=True
on workbook creation. Pandas requires 2.6 so the result is that writes to excel use more memory than necessary. Switching towrite_only=True
has other ramifications, like only being able to write out once, so would need to be tested.Proposed solution
lxml
being missing if writing to excel in an environment that doesn't have it installed.write_only
mode if possible.other context
Running on a GCP cloud function, I had a
df = pd.DataFrame()
of size ~150mb.df.to_excel()
without lxml used >2gb memory, crashing the function. With lxml it used ~1.4gb memory, returning successfully but getting too close. Adding in write-only should lower that by an order of magnitude to <100mb.The text was updated successfully, but these errors were encountered: