Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Improve memory usage with openpyxl #41767

Closed
wants to merge 17 commits into from

Conversation

jordanrmerrick
Copy link

@pep8speaks
Copy link

pep8speaks commented Jun 1, 2021

Hello @jordanrmerrick! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-09 18:56:54 UTC

pandas/__init__.py Outdated Show resolved Hide resolved
Removed a setting that wasn't supposed to be there, apologies for missing it.
Removed a setting that wasn't supposed to be there, apologies for missing it.
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @jordanrmerrick

Since this is fixing a bug, the first thing would be to write a test that it fails with the current implementation. So we can see if the test is fixed after making the code changes. In this case can be a bit trickier, since for what I understand the bug is too high memory consumption. But in any case, if we don't have the test, how do we know this implementation makes sense?

And we also need a release note.

try:
# Sheets are not automatically created in the workbook
self.book = Workbook(write_only=True)
except ImportError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to know if lxml is installed, use import lxml. This ImportError can be generated by different missing dependencies.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to know if lxml is installed, use import lxml. This ImportError can be generated by different missing dependencies.

Sounds good, I'll make that change now.

# Sheets are not automatically created in the workbook
self.book = Workbook(write_only=True)
except ImportError:
print("Warning: lxml is not installed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to send warning to the user, we better use proper warnings, this is not the way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to send warning to the user, we better use proper warnings, this is not the way.

I thought so, can I ask if there is a standard that pandas uses for passing errors to the user? I think this also brings up an interesting point; if write_only mode is unavailable, should that throw an error and stop the code, or just initialize a read/write workbook instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python has a warning module, you can grep in the code for samples.

My understanding is that we are currently using unnecessary memory when reading Excel files. I think what we want is to try to save memory, but if it's not possible continue to open the file anyway. Warning the user that could be saving memory in the operation by installing a library seems reasonable. But I don't think it makes sense to fail if the user has enough memory for the operation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct! I agree, a warning seems reasonable enough. I'm mainly asking because I'm new to contributing to pandas and still trying to get a feel for best practices within the library :)


def save(self):
"""
Save workbook to disk.
"""
self.book.save(self.handles.handle)
# TODO: Handle errors from saving more than once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we supposed to do with this TODO? Are you fixing it later in this PR, or do we need to create an issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we supposed to do with this TODO? Are you fixing it later in this PR, or do we need to create an issue?

That was a TODO for me! I'm committing it today.

@datapythonista datapythonista added Bug IO Excel read_excel, to_excel Performance Memory or execution speed performance labels Jun 1, 2021
@datapythonista datapythonista changed the title Write only BUG: Improve memory usage with openpyxl Jun 1, 2021
# Because this is used a few times in the class,
# it's declared as a variable
# Maybe move to within __init__?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove this, I ended up moving the variable into __init__. Should it remain there?

col = (col - mod) // 26

return "%s%d" % (col_name, row)

Copy link
Author

@jordanrmerrick jordanrmerrick Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the nature of how merging cells in write_only mode works, we need the row and column in Excel notation rather than its index. This converts a row and column number to excel format (e.g. A1, XB19, etc.).

Comment on lines +544 to +557
if style_kwargs:
first_row = startrow + cell.row + 1
last_row = startrow + cell.mergestart + 1
first_col = startcol + cell.col + 1
last_col = startcol + cell.mergeend + 1

for row in range(first_row, last_row + 1):
for col in range(first_col, last_col + 1):
if row == first_row and col == first_col:
# Ignore first cell. It is already handled.
continue
xcell = wks.cell(column=col, row=row)
for k, v in style_kwargs.items():
setattr(xcell, k, v)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't finished this part yet. I'm still trying to figure out a way to apply the style to the other cells in the merged range. I'm not entirely sure there is one, but I'll spend more time prodding around.

If I can't find a reasonable workaround, does this have serious ramifications? I imagine probably not given that it's just styling the underlying cells, but maybe I'm missing something.

@jordanrmerrick
Copy link
Author

The above commit won't work right now, I'm just adding it so the progress is visible. I still need to change how the cells are styled before it should work.

@lithomas1 lithomas1 removed the Bug label Jun 3, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Jul 5, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 5, 2021
@rhshadrach
Copy link
Member

@jordanrmerrick Friendly ping to see if you're interested in continuing this PR.

@mroeschke
Copy link
Member

Thanks for the contribution, but it appears this PR has gone stale. Closing due to inactivity but if you're interested in continuing we would be happy to reopen.

@mroeschke mroeschke closed this Aug 17, 2021
@jordanrmerrick
Copy link
Author

jordanrmerrick commented Sep 9, 2021

Thanks for the contribution, but it appears this PR has gone stale. Closing due to inactivity but if you're interested in continuing we would be happy to reopen.

Hello, unfortunately I had to take an extended hiatus from pretty much all my personal coding work. I'm thankfully able to pick it up again, and this is a big priority of mine!

If you're able to reopen this PR, I would appreciate it. I'll be working on this PR consistently.

@mroeschke mroeschke reopened this Sep 9, 2021
@jordanrmerrick
Copy link
Author

Hi, posting some updates here:

I'm still working on this, but progress has been slow. The way write_only workbooks work is causing complications with how to add styles (fonts, colors, etc.) to cells in a way that works with the existing methods in Pandas' code. Merged cells are also proving to be quite tricky.

It's taking so long to add to this because I think I'm going to have to rewrite a lot of Pandas' implementation of openpyxl and I'm trying to plan it out beforehand. Sorry about that! I should be pushing some code in the next day or so.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

closing as stale but ping if you want to continue

@jreback jreback closed this Jan 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Performance Memory or execution speed performance Stale
Projects
None yet
8 participants