BUG: Improve memory usage with openpyxl #41767

jordanrmerrick · 2021-06-01T15:18:58Z

closes BUG: df.to_excel() with openpyxl engine doesn't use write-optimized mode, resulting in higher memory consumption #41681
tests added / passed
Ensure all linting tests pass, see here for how to run them

…only

…ite_only

Accidentally added this

pep8speaks · 2021-06-01T15:26:13Z

Hello @jordanrmerrick! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-09 18:56:54 UTC

pandas/__init__.py

Removed a setting that wasn't supposed to be there, apologies for missing it.

datapythonista

Thanks for the contribution @jordanrmerrick

Since this is fixing a bug, the first thing would be to write a test that it fails with the current implementation. So we can see if the test is fixed after making the code changes. In this case can be a bit trickier, since for what I understand the bug is too high memory consumption. But in any case, if we don't have the test, how do we know this implementation makes sense?

And we also need a release note.

datapythonista · 2021-06-01T23:00:42Z

pandas/io/excel/_openpyxl.py

+            try:
+                # Sheets are not automatically created in the workbook
+                self.book = Workbook(write_only=True)
+            except ImportError:


If you want to know if lxml is installed, use import lxml. This ImportError can be generated by different missing dependencies.

If you want to know if lxml is installed, use import lxml. This ImportError can be generated by different missing dependencies.

Sounds good, I'll make that change now.

datapythonista · 2021-06-01T23:01:25Z

pandas/io/excel/_openpyxl.py

+                # Sheets are not automatically created in the workbook
+                self.book = Workbook(write_only=True)
+            except ImportError:
+                print("Warning: lxml is not installed")


If we want to send warning to the user, we better use proper warnings, this is not the way.

If we want to send warning to the user, we better use proper warnings, this is not the way.

I thought so, can I ask if there is a standard that pandas uses for passing errors to the user? I think this also brings up an interesting point; if write_only mode is unavailable, should that throw an error and stop the code, or just initialize a read/write workbook instead?

Python has a warning module, you can grep in the code for samples.

My understanding is that we are currently using unnecessary memory when reading Excel files. I think what we want is to try to save memory, but if it's not possible continue to open the file anyway. Warning the user that could be saving memory in the operation by installing a library seems reasonable. But I don't think it makes sense to fail if the user has enough memory for the operation.

Your understanding is correct! I agree, a warning seems reasonable enough. I'm mainly asking because I'm new to contributing to pandas and still trying to get a feel for best practices within the library :)

datapythonista · 2021-06-01T23:02:23Z

pandas/io/excel/_openpyxl.py


    def save(self):
        """
        Save workbook to disk.
        """
-        self.book.save(self.handles.handle)
+        # TODO: Handle errors from saving more than once


What are we supposed to do with this TODO? Are you fixing it later in this PR, or do we need to create an issue?

What are we supposed to do with this TODO? Are you fixing it later in this PR, or do we need to create an issue?

That was a TODO for me! I'm committing it today.

…ite_only

jordanrmerrick · 2021-06-02T21:10:21Z

pandas/io/excel/_openpyxl.py

+    # Because this is used a few times in the class,
+    # it's declared as a variable
+    # Maybe move to within __init__?
+


Forgot to remove this, I ended up moving the variable into __init__. Should it remain there?

jordanrmerrick · 2021-06-02T21:13:07Z

pandas/io/excel/_openpyxl.py

+            col = (col - mod) // 26
+
+        return "%s%d" % (col_name, row)
+


Due to the nature of how merging cells in write_only mode works, we need the row and column in Excel notation rather than its index. This converts a row and column number to excel format (e.g. A1, XB19, etc.).

jordanrmerrick · 2021-06-02T21:16:59Z

pandas/io/excel/_openpyxl.py

+                    if style_kwargs:
+                        first_row = startrow + cell.row + 1
+                        last_row = startrow + cell.mergestart + 1
+                        first_col = startcol + cell.col + 1
+                        last_col = startcol + cell.mergeend + 1
+
+                        for row in range(first_row, last_row + 1):
+                            for col in range(first_col, last_col + 1):
+                                if row == first_row and col == first_col:
+                                    # Ignore first cell. It is already handled.
+                                    continue
+                                xcell = wks.cell(column=col, row=row)
+                                for k, v in style_kwargs.items():
+                                    setattr(xcell, k, v)


Haven't finished this part yet. I'm still trying to figure out a way to apply the style to the other cells in the merged range. I'm not entirely sure there is one, but I'll spend more time prodding around.

If I can't find a reasonable workaround, does this have serious ramifications? I imagine probably not given that it's just styling the underlying cells, but maybe I'm missing something.

jordanrmerrick · 2021-06-02T21:18:27Z

The above commit won't work right now, I'm just adding it so the progress is visible. I still need to change how the cells are styled before it should work.

…only

github-actions · 2021-07-05T00:02:26Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

rhshadrach · 2021-07-29T20:32:48Z

@jordanrmerrick Friendly ping to see if you're interested in continuing this PR.

mroeschke · 2021-08-17T02:14:55Z

Thanks for the contribution, but it appears this PR has gone stale. Closing due to inactivity but if you're interested in continuing we would be happy to reopen.

jordanrmerrick · 2021-09-09T18:56:05Z

Thanks for the contribution, but it appears this PR has gone stale. Closing due to inactivity but if you're interested in continuing we would be happy to reopen.

Hello, unfortunately I had to take an extended hiatus from pretty much all my personal coding work. I'm thankfully able to pick it up again, and this is a big priority of mine!

If you're able to reopen this PR, I would appreciate it. I'll be working on this PR consistently.

jordanrmerrick · 2021-10-05T14:44:38Z

Hi, posting some updates here:

I'm still working on this, but progress has been slow. The way write_only workbooks work is causing complications with how to add styles (fonts, colors, etc.) to cells in a way that works with the existing methods in Pandas' code. Merged cells are also proving to be quite tricky.

It's taking so long to add to this because I think I'm going to have to rewrite a lot of Pandas' implementation of openpyxl and I'm trying to plan it out beforehand. Sorry about that! I should be pushing ~~some~~ code in the next day or so.

jreback · 2022-01-16T18:07:31Z

closing as stale but ping if you want to continue

jordanrmerrick added 9 commits May 31, 2021 13:24

PERF

089b810

changed import error message to be more concise

300bbc7

some comments changed

c793a50

Merge branch 'pandas-dev:master' into write_only

84b7c46

inline with pandas linting

16b26d6

changed todo for save function

1170a24

Merge branch 'master' of github.com:jordanrmerrick/pandas into write_…

d5723e8

…only

Merge branch 'write_only' of github.com:jordanrmerrick/pandas into wr…

3514db2

…ite_only

Delete miniconda.sh

9b7c119

Accidentally added this

jordanrmerrick added 2 commits June 1, 2021 11:26

Fixed Dockerfile

16de5ce

Fixed github_username

06384d0

jbrockmendel reviewed Jun 1, 2021

View reviewed changes

pandas/__init__.py Outdated Show resolved Hide resolved

jordanrmerrick added 2 commits June 1, 2021 14:47

Fixed __init__.py

b931796

Removed a setting that wasn't supposed to be there, apologies for missing it.

Fixed __init__.py

5f95f30

Removed a setting that wasn't supposed to be there, apologies for missing it.

datapythonista requested changes Jun 1, 2021

View reviewed changes

datapythonista added Bug IO Excel read_excel, to_excel Performance Memory or execution speed performance labels Jun 1, 2021

datapythonista changed the title ~~Write only~~ BUG: Improve memory usage with openpyxl Jun 1, 2021

jordanrmerrick added 2 commits June 2, 2021 21:07

changes in write_cells for write_only mode

806e544

Merge branch 'write_only' of github.com:jordanrmerrick/pandas into wr…

eab9a1c

…ite_only

jordanrmerrick commented Jun 2, 2021

View reviewed changes

lithomas1 removed the Bug label Jun 3, 2021

jordanrmerrick added 2 commits June 3, 2021 17:52

Merge branch 'master' of github.com:jordanrmerrick/pandas into write_…

f216313

…only

commenting out post-merge styling to check if rest of the code works

193a0a1

github-actions bot added the Stale label Jul 5, 2021

mroeschke closed this Aug 17, 2021

mroeschke reopened this Sep 9, 2021

jreback closed this Jan 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Improve memory usage with openpyxl #41767

BUG: Improve memory usage with openpyxl #41767

jordanrmerrick commented Jun 1, 2021

pep8speaks commented Jun 1, 2021 •

edited

Loading

datapythonista left a comment

datapythonista Jun 1, 2021

jordanrmerrick Jun 2, 2021

datapythonista Jun 1, 2021

jordanrmerrick Jun 2, 2021

datapythonista Jun 2, 2021

jordanrmerrick Jun 2, 2021

datapythonista Jun 1, 2021

jordanrmerrick Jun 2, 2021

jordanrmerrick Jun 2, 2021

jordanrmerrick Jun 2, 2021 •

edited

Loading

jordanrmerrick Jun 2, 2021

jordanrmerrick commented Jun 2, 2021

github-actions bot commented Jul 5, 2021

rhshadrach commented Jul 29, 2021

mroeschke commented Aug 17, 2021

jordanrmerrick commented Sep 9, 2021 •

edited

Loading

jordanrmerrick commented Oct 5, 2021

jreback commented Jan 16, 2022

BUG: Improve memory usage with openpyxl #41767

BUG: Improve memory usage with openpyxl #41767

Conversation

jordanrmerrick commented Jun 1, 2021

pep8speaks commented Jun 1, 2021 • edited Loading

Comment last updated at 2021-09-09 18:56:54 UTC

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordanrmerrick Jun 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordanrmerrick commented Jun 2, 2021

github-actions bot commented Jul 5, 2021

rhshadrach commented Jul 29, 2021

mroeschke commented Aug 17, 2021

jordanrmerrick commented Sep 9, 2021 • edited Loading

jordanrmerrick commented Oct 5, 2021

jreback commented Jan 16, 2022

pep8speaks commented Jun 1, 2021 •

edited

Loading

jordanrmerrick Jun 2, 2021 •

edited

Loading

jordanrmerrick commented Sep 9, 2021 •

edited

Loading