Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: df.to_excel() with openpyxl engine doesn't use write-optimized mode, resulting in higher memory consumption #41681

Open
2 tasks done
Kharms opened this issue May 26, 2021 · 4 comments
Labels
IO Excel read_excel, to_excel Performance Memory or execution speed performance

Comments

@Kharms
Copy link

Kharms commented May 26, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.


Problem description

openpyxl has fairly high memory consumption by default. To address this for write-only operations, they recommend two things:

  1. Install the lxml package, which is more optimized than the default xml parser.
  2. Use the write_only mode on the workbook.

In pandas/pandas/io/excel/_openpyxl.py we have the a comment which states intent - # Create workbook object with default optimized_write=True.

However openpyxl depreciated the optimized_write option in version 2.4 and now recommends setting write_only=True on workbook creation. Pandas requires 2.6 so the result is that writes to excel use more memory than necessary. Switching to write_only=True has other ramifications, like only being able to write out once, so would need to be tested.

Proposed solution

  1. Print out a warning about lxml being missing if writing to excel in an environment that doesn't have it installed.
  2. Use the write_only mode if possible.

other context

Running on a GCP cloud function, I had a df = pd.DataFrame() of size ~150mb. df.to_excel() without lxml used >2gb memory, crashing the function. With lxml it used ~1.4gb memory, returning successfully but getting too close. Adding in write-only should lower that by an order of magnitude to <100mb.

@Kharms Kharms added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 26, 2021
@mzeitlin11
Copy link
Member

Thanks for looking into this @Kharms. This all sounds quite reasonable, would you be interested in putting up a pr?

@lithomas1 lithomas1 added IO Excel read_excel, to_excel Performance Memory or execution speed performance and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 26, 2021
@mzeitlin11 mzeitlin11 added this to the Contributions Welcome milestone May 26, 2021
@jordanrmerrick
Copy link

I'm happy to take this on. The caveats of using lxml and write_only mode are:

  1. Worksheets are not initialized automatically, we have to use create_sheet on the book.
  2. Rows can only be appended, we can't go back to arbitrary locations to write data.
  3. The book can only be saved once.
  4. Everything that shows up in the file needs to be created before being added (like frozen cells).

The first doesn't seem to be an issue given that pandas already removes the sheets when it initializes a Workbook() object. Are there use cases that would break with no arbitrary cell writing, or only being able to save the book once?

@jordanrmerrick
Copy link

I've created a pull request as shown above. The main thing I'm concerned about is how it handles lxml not being present - excepting an ImportError and initializing a regular workbook. Further, right now there isn't an option for the user to disable write_only mode except by removing lxml from their environment. Should I add an option to disable it, and if so, at what level would be appropriate (e.g. at the lowest level of the OpenpyxlWriter class, or higher up at df.to_excel(write_only=True))?

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@rmhowe425
Copy link
Contributor

@mroeschke @jordanrmerrick

I would like to tackle this issue if we think that its still relevant and Jordan is no longer interested in taking it on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants