Skip to content

BUG: df.to_excel() with openpyxl engine doesn't use write-optimized mode, resulting in higher memory consumption #41681

Open
@Kharms

Description

@Kharms
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.


Problem description

openpyxl has fairly high memory consumption by default. To address this for write-only operations, they recommend two things:

  1. Install the lxml package, which is more optimized than the default xml parser.
  2. Use the write_only mode on the workbook.

In pandas/pandas/io/excel/_openpyxl.py we have the a comment which states intent - # Create workbook object with default optimized_write=True.

However openpyxl depreciated the optimized_write option in version 2.4 and now recommends setting write_only=True on workbook creation. Pandas requires 2.6 so the result is that writes to excel use more memory than necessary. Switching to write_only=True has other ramifications, like only being able to write out once, so would need to be tested.

Proposed solution

  1. Print out a warning about lxml being missing if writing to excel in an environment that doesn't have it installed.
  2. Use the write_only mode if possible.

other context

Running on a GCP cloud function, I had a df = pd.DataFrame() of size ~150mb. df.to_excel() without lxml used >2gb memory, crashing the function. With lxml it used ~1.4gb memory, returning successfully but getting too close. Adding in write-only should lower that by an order of magnitude to <100mb.

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO Excelread_excel, to_excelPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions