-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Improve memory usage with openpyxl #41767
Conversation
jordanrmerrick
commented
Jun 1, 2021
- closes BUG: df.to_excel() with openpyxl engine doesn't use write-optimized mode, resulting in higher memory consumption #41681
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
Accidentally added this
Hello @jordanrmerrick! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-09-09 18:56:54 UTC |
Removed a setting that wasn't supposed to be there, apologies for missing it.
Removed a setting that wasn't supposed to be there, apologies for missing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @jordanrmerrick
Since this is fixing a bug, the first thing would be to write a test that it fails with the current implementation. So we can see if the test is fixed after making the code changes. In this case can be a bit trickier, since for what I understand the bug is too high memory consumption. But in any case, if we don't have the test, how do we know this implementation makes sense?
And we also need a release note.
pandas/io/excel/_openpyxl.py
Outdated
try: | ||
# Sheets are not automatically created in the workbook | ||
self.book = Workbook(write_only=True) | ||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to know if lxml
is installed, use import lxml
. This ImportError
can be generated by different missing dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to know if
lxml
is installed, useimport lxml
. ThisImportError
can be generated by different missing dependencies.
Sounds good, I'll make that change now.
pandas/io/excel/_openpyxl.py
Outdated
# Sheets are not automatically created in the workbook | ||
self.book = Workbook(write_only=True) | ||
except ImportError: | ||
print("Warning: lxml is not installed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to send warning to the user, we better use proper warnings, this is not the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to send warning to the user, we better use proper warnings, this is not the way.
I thought so, can I ask if there is a standard that pandas uses for passing errors to the user? I think this also brings up an interesting point; if write_only
mode is unavailable, should that throw an error and stop the code, or just initialize a read/write workbook instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python has a warning
module, you can grep in the code for samples.
My understanding is that we are currently using unnecessary memory when reading Excel files. I think what we want is to try to save memory, but if it's not possible continue to open the file anyway. Warning the user that could be saving memory in the operation by installing a library seems reasonable. But I don't think it makes sense to fail if the user has enough memory for the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your understanding is correct! I agree, a warning seems reasonable enough. I'm mainly asking because I'm new to contributing to pandas and still trying to get a feel for best practices within the library :)
pandas/io/excel/_openpyxl.py
Outdated
|
||
def save(self): | ||
""" | ||
Save workbook to disk. | ||
""" | ||
self.book.save(self.handles.handle) | ||
# TODO: Handle errors from saving more than once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we supposed to do with this TODO? Are you fixing it later in this PR, or do we need to create an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we supposed to do with this TODO? Are you fixing it later in this PR, or do we need to create an issue?
That was a TODO for me! I'm committing it today.
# Because this is used a few times in the class, | ||
# it's declared as a variable | ||
# Maybe move to within __init__? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to remove this, I ended up moving the variable into __init__
. Should it remain there?
col = (col - mod) // 26 | ||
|
||
return "%s%d" % (col_name, row) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the nature of how merging cells in write_only
mode works, we need the row and column in Excel notation rather than its index. This converts a row and column number to excel format (e.g. A1
, XB19
, etc.).
if style_kwargs: | ||
first_row = startrow + cell.row + 1 | ||
last_row = startrow + cell.mergestart + 1 | ||
first_col = startcol + cell.col + 1 | ||
last_col = startcol + cell.mergeend + 1 | ||
|
||
for row in range(first_row, last_row + 1): | ||
for col in range(first_col, last_col + 1): | ||
if row == first_row and col == first_col: | ||
# Ignore first cell. It is already handled. | ||
continue | ||
xcell = wks.cell(column=col, row=row) | ||
for k, v in style_kwargs.items(): | ||
setattr(xcell, k, v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't finished this part yet. I'm still trying to figure out a way to apply the style to the other cells in the merged range. I'm not entirely sure there is one, but I'll spend more time prodding around.
If I can't find a reasonable workaround, does this have serious ramifications? I imagine probably not given that it's just styling the underlying cells, but maybe I'm missing something.
The above commit won't work right now, I'm just adding it so the progress is visible. I still need to change how the cells are styled before it should work. |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@jordanrmerrick Friendly ping to see if you're interested in continuing this PR. |
Thanks for the contribution, but it appears this PR has gone stale. Closing due to inactivity but if you're interested in continuing we would be happy to reopen. |
Hello, unfortunately I had to take an extended hiatus from pretty much all my personal coding work. I'm thankfully able to pick it up again, and this is a big priority of mine! If you're able to reopen this PR, I would appreciate it. I'll be working on this PR consistently. |
Hi, posting some updates here: I'm still working on this, but progress has been slow. The way It's taking so long to add to this because I think I'm going to have to rewrite a lot of Pandas' implementation of openpyxl and I'm trying to plan it out beforehand. Sorry about that! I should be pushing |
closing as stale but ping if you want to continue |