-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: GH17054: read_html() handles rowspan/colspan and infers headers #17089
Conversation
@gfyoung can you help me reconcile failing tests in the CI checks vs. clean tests on my machine?
|
@jowens : Judging from your Travis build output, here's what I see:
If you can, then continue developing in the Python 3.x environment to get tests passing. If you can't replicate, then that's going to be tricky to debug, but we can address that if we get there. |
Can you point me to the "style-check error"? |
The error can be found here. BTW, you can also see the error yourself by running the style-check command in the PR description:
|
Yeah, I did that before checking in:
Not sure what went wrong. I'm on it! |
Cool! @gfyoung what's next? |
@gfyoung any chance of moving this forward? I'd prefer not to have a private build of Pandas for the rest of my life. |
@jowens : Sorry about that! My inbox is very full most of the time, so I think I just missed your comment about a month ago. Let's do this:
|
@gfyoung FWIW the merge conflict is only "whatsnew" |
@gfyoung because I really don't want to get this wrong and have a commit that's gonna be enormous, can you sketch out how I "rebase onto master" properly? |
@jowens: True, but it's been a month since you've run the tests on Travis, Appveyor, etc. I just want to make sure that tests still pass. |
I would recommend to try using GitHub's UI editor for resolving the merge conflict. If that doesn't work out, I can give you further suggestions, but they will be all local (and probably) less visual. |
Don't worry about rebasing onto master yet. Let's make sure this PR is merge-able first. |
@gfyoung OK, looks like it's running and conflict-free. Stay tuned! |
@TomAugspurger can you help me take the next step with this PR? Thanks @gfyoung! |
I won't have a chance to review this in detail for a while, so if you think
it's good then feel free to merge @gfyoung.
…On Tue, Aug 29, 2017 at 3:14 PM, John Owens ***@***.***> wrote:
@TomAugspurger <https://github.com/tomaugspurger> can you help me take
the next step with this PR? Thanks @gfyoung <https://github.com/gfyoung>!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17089 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIumSd6yu0bcha3VfFOCHko-9Fx8Aks5sdHErgaJpZM4Okn1p>
.
|
@TomAugspurger @gfyoung could either one of you guys please take a look? |
doc/source/whatsnew/v0.21.0.txt
Outdated
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. | ||
- :func:`DataFrame.add_prefix` and :func:`DataFrame.add_suffix` now accept strings containing the '%' character. (:issue:`17151`) | ||
- `read_*` methods can now infer compression from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`). | ||
- :func:`pd.read_sas()` now recognizes much more of the most frequently used date (datetime) formats in SAS7BDAT files (:issue:`15871`). | ||
- :func:`DataFrame.items` and :func:`Series.items` is now present in both Python 2 and 3 and is lazy in all cases (:issue:`13918`, :issue:`17213`) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add these lines back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, your whatsnew
has a merge conflict, so you definitely need to fix that.
pandas/io/html.py
Outdated
|
||
Parameters | ||
---------- | ||
rows : iterable of node-like | ||
A list of row elements. | ||
doc : tree-like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As tree-like is not a Python data-type, we should re-specify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger : If you think "node-like" or "tree-like" is OK for this documentation, then we can ignore this comment (and ones similar to that). However, IMO we should have proper data types in the doc-strings if possible.
pandas/io/html.py
Outdated
def _parse_td(self, obj): | ||
"""Return the td elements from a row element. | ||
def _equals_tag(self, obj, tag): | ||
"""Returns whether an individual DOM node matches a tag | ||
|
||
Parameters | ||
---------- | ||
obj : node-like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take advantage of this time to re-specify this as a proper data type if possible.
pandas/io/html.py
Outdated
|
||
Parameters | ||
---------- | ||
obj : node-like | ||
A DOM node. | ||
|
||
tag : string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string --> str
pandas/io/html.py
Outdated
columns : list of node-like | ||
These are the elements of each row, i.e., the columns. | ||
boolean | ||
Does the object match tag 'tag'? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just say "boolean indicating the object matches the tag tag
"
…ndas into read_html_with_colspan_rowspan
@TomAugspurger Next steps? |
@TomAugspurger @gfyoung What can I do to move this forward? |
@TomAugspurger : Can you take a look at this? I've already approved, and @jowens and I are awaiting your comments before proceeding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall things look pretty good. I don't completely understand the logic in _expand_colspan_rowspan
, and there were several lines that seem to be untested (perhaps they're dead code now?).
Could you attempt to refactor those lmaps
to use list-comprehensions and chain
? Could you add some more tests for
- table with just a
thead
, but no body - table with just a
tbody
, but each row is ath
And cases where rowspan
/ colspan
vary for each of those.
raise_with_traceback, binary_type) | ||
from pandas import Series | ||
from pandas.core.common import AbstractMethodError | ||
from pandas.core.common import (AbstractMethodError, flatten) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need the parens
defined by subclasses. | ||
match : str or regular expression | ||
The text to search for in the DOM tree. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove these blank lines between parameters
|
||
Returns | ||
------- | ||
tuple of (header, body, footer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove this line (multiple returns are always tuples)
footer = self._parse_raw_tfoot(table) | ||
def _build_table(self, table_html): | ||
header, body, footer = self._parse_raw_thead_tbody_tfoot(table_html) | ||
# the above "footer" actually produces a footer. The below "footer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the plan here? Do we have a GH issue open for replacing the old with the new? i.e., what does the person reading the code 3 years from now need to know whether this has been resolved?
while all(self._equals_tag(t, 'th') for t in | ||
self._extract_td(body_rows[-1])): | ||
# this row should be a footer row, move it from body to footer | ||
footer_rows.insert(0, body_rows.pop()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be untested.
# splitting criterion: if all tags within a row are th, it's part | ||
# of the header/footer | ||
while all(self._equals_tag(t, 'th') for t in | ||
self._extract_td(body_rows[0])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with a table of all "headers". Does the condition need to be while body_rows and all(...)
(is it possible to get to this situation, and do you have a test for it)?
for col in extracted_row] | ||
# expand cols using col_colspans | ||
# maybe this can be done with a list comprehension, dunno | ||
cols = list(zip( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is equivalent to
list(chain.from_iterable([text] * nc for text, nc in zip(cols_text, col_colspans)))
which I find a bit more readable. Can you confirm if that works? (import chain
from itertools).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, reading a bit further, maybe not? But I think that's close. Could you post a sample of col_colspans
, col_rowspans
, and cols_texts
? Are they all just flat lists of ints / strs?
return element.find_all('tr') | ||
|
||
def _extract_th(self, element): | ||
return element.find_all('th') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Untested?
return table.xpath('.//tfoot') | ||
|
||
def _parse_raw_thead(self, table): | ||
expr = './/thead' | ||
thead = table.xpath(expr) | ||
res = [] | ||
if thead: | ||
trs = self._parse_tr(thead[0]) | ||
trs = self._extract_tr(thead[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Untested?
the header, otherwise the function attempts to find the header within | ||
the body (by putting rows with only ``<th>`` elements into the header). | ||
|
||
.. versionadded:: 0.21.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a versionmodified
, though it may not be necessary.
needs to be rebased |
@TomAugspurger made a lot of good suggestions that I simply haven't had time to get to yet. Is there an impending deadline here? |
@jowens nope. just pinging on older pr's. rebase / fixup when you can. |
@jowens I just tried out this branch for extracting tables from PubMed Central XML files, and and it works much much better than the current |
can you |
@TomAugspurger's suggestions are still all good ones, but I haven't had a full day where I can sit down and do them. |
great work! can't wait for it to get merged |
I'll repeat that "@TomAugspurger's suggestions are still all good ones, but I haven't had a full day where I can sit down and do them." :) |
No rush, we're patient :)
…On Thu, Apr 19, 2018 at 1:31 PM, John Owens ***@***.***> wrote:
I'll repeat that ***@***.*** <https://github.com/TomAugspurger>'s
suggestions are still all good ones, but I haven't had a full day where I
can sit down and do them." :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17089 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIl6-no2ugOPI4Jct6JpVcLw7hqRyks5tqNgVgaJpZM4Okn1p>
.
|
any updates on this PR? |
I'll repeat again that "@TomAugspurger's suggestions are still all good ones, but I haven't had a full day where I can sit down and do them." :) |
@jowens Hi! This feature is at the top of my backlog this week. So effectively, I have a full day where I can sit down and incorporate @TomAugspurger's suggestions. Would that be okay with you? |
Yes PLEASE! I would be delighted (and so would the other folks in this thread). |
Superseded by #21487 |
git diff upstream/master -u -- "*.py" | flake8 --diff