Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: GH17054: read_html() handles rowspan/colspan and infers headers #17089

Closed
wants to merge 145 commits into from

Conversation

jowens
Copy link

@jowens jowens commented Jul 26, 2017

@jowens jowens mentioned this pull request Jul 26, 2017
4 tasks
@gfyoung gfyoung added Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Jul 26, 2017
@jowens
Copy link
Author

jowens commented Jul 27, 2017

@gfyoung can you help me reconcile failing tests in the CI checks vs. clean tests on my machine?

$ py.test-2.7 pandas/tests/io/test_html.py
============================= test session starts ==============================
platform darwin -- Python 2.7.13, pytest-3.1.3, py-1.4.34, pluggy-0.4.0
rootdir: /Users/jowens/Documents/src/pandas-jowens, inifile: setup.cfg
collected 76 items

pandas/tests/io/test_html.py ............................................................................

========================== 76 passed in 49.13 seconds ==========================

@gfyoung
Copy link
Member

gfyoung commented Jul 27, 2017

@jowens : Judging from your Travis build output, here's what I see:

  • You have a style-check error on the Python 2.7 machine. However, all tests are passing. That indicates your patch is working for Python 2.7. You should patch this first to get that build working at least.

  • Your changes are failing for Python 3.x, meaning that they are perhaps not compatible with Python 3. I would suggest that you create a separate environment using Python 3 as your interpreter (I see you are using Python 2), install your branch, and rerun tests to see if you can replicate the failures.

If you can, then continue developing in the Python 3.x environment to get tests passing. If you can't replicate, then that's going to be tricky to debug, but we can address that if we get there.

@jowens
Copy link
Author

jowens commented Jul 27, 2017

Can you point me to the "style-check error"?

@gfyoung
Copy link
Member

gfyoung commented Jul 27, 2017

The error can be found here. BTW, you can also see the error yourself by running the style-check command in the PR description:

git diff upstream/master -u -- "*.py" | flake8 --diff

@jowens
Copy link
Author

jowens commented Jul 27, 2017

Yeah, I did that before checking in:

$ git diff upstream/master -u -- "*.py" | flake8-2.7 --diff
$

Not sure what went wrong. I'm on it!

@jowens
Copy link
Author

jowens commented Jul 27, 2017

Cool! @gfyoung what's next?

@jowens
Copy link
Author

jowens commented Aug 29, 2017

@gfyoung any chance of moving this forward? I'd prefer not to have a private build of Pandas for the rest of my life.

@gfyoung
Copy link
Member

gfyoung commented Aug 29, 2017

@jowens : Sorry about that! My inbox is very full most of the time, so I think I just missed your comment about a month ago.

Let's do this:

  • Rebase onto master (and fix the merge conflict) to see if tests still pass
  • Ping @TomAugspurger for further review of the PR

@jowens
Copy link
Author

jowens commented Aug 29, 2017

@gfyoung FWIW the merge conflict is only "whatsnew"

@jowens
Copy link
Author

jowens commented Aug 29, 2017

@gfyoung because I really don't want to get this wrong and have a commit that's gonna be enormous, can you sketch out how I "rebase onto master" properly?

@gfyoung
Copy link
Member

gfyoung commented Aug 29, 2017

@jowens: True, but it's been a month since you've run the tests on Travis, Appveyor, etc. I just want to make sure that tests still pass.

@gfyoung
Copy link
Member

gfyoung commented Aug 29, 2017

@gfyoung because I really don't want to get this wrong and have a commit that's gonna be enormous, can you sketch out how I "rebase onto master" properly?

I would recommend to try using GitHub's UI editor for resolving the merge conflict. If that doesn't work out, I can give you further suggestions, but they will be all local (and probably) less visual.

@gfyoung
Copy link
Member

gfyoung commented Aug 29, 2017

Don't worry about rebasing onto master yet. Let's make sure this PR is merge-able first.

@jowens
Copy link
Author

jowens commented Aug 29, 2017

@gfyoung OK, looks like it's running and conflict-free. Stay tuned!

@jowens
Copy link
Author

jowens commented Aug 29, 2017

@TomAugspurger can you help me take the next step with this PR? Thanks @gfyoung!

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 30, 2017 via email

@jowens
Copy link
Author

jowens commented Sep 6, 2017

@TomAugspurger @gfyoung could either one of you guys please take a look?

- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
- :func:`DataFrame.add_prefix` and :func:`DataFrame.add_suffix` now accept strings containing the '%' character. (:issue:`17151`)
- `read_*` methods can now infer compression from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`).
- :func:`pd.read_sas()` now recognizes much more of the most frequently used date (datetime) formats in SAS7BDAT files (:issue:`15871`).
- :func:`DataFrame.items` and :func:`Series.items` is now present in both Python 2 and 3 and is lazy in all cases (:issue:`13918`, :issue:`17213`)



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add these lines back.

Copy link
Member

@gfyoung gfyoung Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, your whatsnew has a merge conflict, so you definitely need to fix that.


Parameters
----------
rows : iterable of node-like
A list of row elements.
doc : tree-like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As tree-like is not a Python data-type, we should re-specify this.

Copy link
Member

@gfyoung gfyoung Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger : If you think "node-like" or "tree-like" is OK for this documentation, then we can ignore this comment (and ones similar to that). However, IMO we should have proper data types in the doc-strings if possible.

def _parse_td(self, obj):
"""Return the td elements from a row element.
def _equals_tag(self, obj, tag):
"""Returns whether an individual DOM node matches a tag

Parameters
----------
obj : node-like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take advantage of this time to re-specify this as a proper data type if possible.


Parameters
----------
obj : node-like
A DOM node.

tag : string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string --> str

columns : list of node-like
These are the elements of each row, i.e., the columns.
boolean
Does the object match tag 'tag'?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just say "boolean indicating the object matches the tag tag"

@jowens
Copy link
Author

jowens commented Sep 25, 2017

@TomAugspurger Next steps?

@jowens
Copy link
Author

jowens commented Oct 2, 2017

@TomAugspurger @gfyoung What can I do to move this forward?

@gfyoung
Copy link
Member

gfyoung commented Oct 2, 2017

@TomAugspurger : Can you take a look at this? I've already approved, and @jowens and I are awaiting your comments before proceeding.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall things look pretty good. I don't completely understand the logic in _expand_colspan_rowspan, and there were several lines that seem to be untested (perhaps they're dead code now?).

Could you attempt to refactor those lmaps to use list-comprehensions and chain? Could you add some more tests for

  • table with just a thead, but no body
  • table with just a tbody, but each row is a th

And cases where rowspan / colspan vary for each of those.

raise_with_traceback, binary_type)
from pandas import Series
from pandas.core.common import AbstractMethodError
from pandas.core.common import (AbstractMethodError, flatten)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the parens

defined by subclasses.
match : str or regular expression
The text to search for in the DOM tree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove these blank lines between parameters


Returns
-------
tuple of (header, body, footer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this line (multiple returns are always tuples)

footer = self._parse_raw_tfoot(table)
def _build_table(self, table_html):
header, body, footer = self._parse_raw_thead_tbody_tfoot(table_html)
# the above "footer" actually produces a footer. The below "footer"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the plan here? Do we have a GH issue open for replacing the old with the new? i.e., what does the person reading the code 3 years from now need to know whether this has been resolved?

while all(self._equals_tag(t, 'th') for t in
self._extract_td(body_rows[-1])):
# this row should be a footer row, move it from body to footer
footer_rows.insert(0, body_rows.pop())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be untested.

# splitting criterion: if all tags within a row are th, it's part
# of the header/footer
while all(self._equals_tag(t, 'th') for t in
self._extract_td(body_rows[0])):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with a table of all "headers". Does the condition need to be while body_rows and all(...) (is it possible to get to this situation, and do you have a test for it)?

for col in extracted_row]
# expand cols using col_colspans
# maybe this can be done with a list comprehension, dunno
cols = list(zip(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is equivalent to

list(chain.from_iterable([text] * nc for text, nc in zip(cols_text, col_colspans)))

which I find a bit more readable. Can you confirm if that works? (import chain from itertools).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, reading a bit further, maybe not? But I think that's close. Could you post a sample of col_colspans, col_rowspans, and cols_texts? Are they all just flat lists of ints / strs?

return element.find_all('tr')

def _extract_th(self, element):
return element.find_all('th')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Untested?

return table.xpath('.//tfoot')

def _parse_raw_thead(self, table):
expr = './/thead'
thead = table.xpath(expr)
res = []
if thead:
trs = self._parse_tr(thead[0])
trs = self._extract_tr(thead[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Untested?

the header, otherwise the function attempts to find the header within
the body (by putting rows with only ``<th>`` elements into the header).

.. versionadded:: 0.21.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a versionmodified, though it may not be necessary.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2017

needs to be rebased

@jowens
Copy link
Author

jowens commented Nov 12, 2017

@TomAugspurger made a lot of good suggestions that I simply haven't had time to get to yet. Is there an impending deadline here?

@jreback
Copy link
Contributor

jreback commented Nov 12, 2017

@jowens nope. just pinging on older pr's. rebase / fixup when you can.

@ostrokach
Copy link

@jowens I just tried out this branch for extracting tables from PubMed Central XML files, and and it works much much better than the current read_html implementation (which was basically unusable)! Thank you for your work guys! Hope this can make it into the master branch soon :).

@jreback
Copy link
Contributor

jreback commented Feb 10, 2018

can you git merge master resolve conflicts and push again (and move whatsnew to 0.23.0)

@jowens
Copy link
Author

jowens commented Feb 11, 2018

@TomAugspurger's suggestions are still all good ones, but I haven't had a full day where I can sit down and do them.

@wkerzendorf
Copy link

great work! can't wait for it to get merged

@jowens
Copy link
Author

jowens commented Apr 19, 2018

I'll repeat that "@TomAugspurger's suggestions are still all good ones, but I haven't had a full day where I can sit down and do them." :)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 19, 2018 via email

@cancan101
Copy link
Contributor

any updates on this PR?

@jowens
Copy link
Author

jowens commented Jun 13, 2018

I'll repeat again that "@TomAugspurger's suggestions are still all good ones, but I haven't had a full day where I can sit down and do them." :)

@adamhooper
Copy link
Contributor

@jowens Hi! This feature is at the top of my backlog this week. So effectively, I have a full day where I can sit down and incorporate @TomAugspurger's suggestions. Would that be okay with you?

@jowens
Copy link
Author

jowens commented Jun 13, 2018

Yes PLEASE! I would be delighted (and so would the other folks in this thread).

@WillAyd
Copy link
Member

WillAyd commented Aug 17, 2018

Superseded by #21487

@WillAyd WillAyd closed this Aug 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: read_html to handle rowspan, colspan