Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read from multiple <tbody> within a <table> #20891

Merged
merged 2 commits into from
May 1, 2018

Conversation

adamhooper
Copy link
Contributor

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change

return self._parse_raw_data(res)
raw_data = []

if len(tbodies) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PEP8 compliance use the boolean-ness of the sequence, so just if tbodies:

</tbody>
</table>'''
expected = DataFrame({'A': [1, 3], 'B': [2, 4]})
result = self.read_html(StringIO(data))
Copy link
Member

@WillAyd WillAyd May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the indexer to this line and make the subsequent line just tm.assert_frame_equal(result, expected). Obviously doesn't change result just reads better with the rest of the tests

@codecov
Copy link

codecov bot commented May 1, 2018

Codecov Report

Merging #20891 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20891      +/-   ##
==========================================
+ Coverage   91.78%   91.78%   +<.01%     
==========================================
  Files         153      153              
  Lines       49337    49338       +1     
==========================================
+ Hits        45285    45286       +1     
  Misses       4052     4052
Flag Coverage Δ
#multiple 90.17% <100%> (ø) ⬆️
#single 41.93% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/html.py 88.82% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f799916...a881c60. Read the comment docs.

@jreback jreback added Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap labels May 1, 2018
@jreback jreback added this to the 0.23.0 milestone May 1, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small changes. ping on green.

@@ -396,6 +396,34 @@ def test_empty_tables(self):
res2 = self.read_html(StringIO(data2))
assert_framelist_equal(res1, res2)

def test_multiple_tbody(self):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number here, you can use use # and not a triple-quotes here for the description of the test

@TomAugspurger
Copy link
Contributor

@adamhooper do you know if / when you'll be able to update? I'm cutting the RC soon (hopefully a couple hours from now).

If you're busy now I can push the changes and then we'll get this merged before the RC.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging on green

@TomAugspurger TomAugspurger mentioned this pull request May 1, 2018
71 tasks
@TomAugspurger
Copy link
Contributor

Test failures from travis were unrelated. Fixing in #20906

@TomAugspurger
Copy link
Contributor

Failures should be fixed in masater.

@TomAugspurger TomAugspurger merged commit 926f241 into pandas-dev:master May 1, 2018
@TomAugspurger
Copy link
Contributor

Thanks @adamhooper!

@adamhooper
Copy link
Contributor Author

Er ... finally I'm available to fix things. You beat me to it -- thanks!

@adamhooper adamhooper deleted the issue-20690 branch May 1, 2018 19:05
@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 1, 2018 via email

adamhooper added a commit to adamhooper/pandas that referenced this pull request Jun 14, 2018
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

* test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError,
  because the ParserError was a bug caused by missing colspan support.
  Now, test that MultiIndex works as expected.
* I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead,
  the virtual cells created by rowspan/colspan are always copies of the
  real cells' text. This prevents _infer_columns() from naming virtual
  cells as "Unnamed: ..."
* I removed a small layer of abstraction to respect pandas-dev#20891 (multiple
  <tbody> support), which was implemented after @jowens' pull request.
  Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
  _parse_tfoot_trs, each returning a list of <tr>s. That let me remove
  _parse_tr, Making All The Tests Pass.
* That caused a snowball effect. lxml does not fix malformed <thead>, as
  tested by spam.html. The previous hacky workaround was in
  _parse_raw_thead, but the new _parse_thead_trs signature returns nodes
  instead of text. The new hacky solution: return the <thead> itself,
  pretending it's a <tr>. This works in all the tests. A better solution
  is to use html5lib with lxml; but that might belong in a separate pull
  request.
adamhooper added a commit to adamhooper/pandas that referenced this pull request Jun 26, 2018
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

* test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError,
  because the ParserError was a bug caused by missing colspan support.
  Now, test that MultiIndex works as expected.
* I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead,
  the virtual cells created by rowspan/colspan are always copies of the
  real cells' text. This prevents _infer_columns() from naming virtual
  cells as "Unnamed: ..."
* I removed a small layer of abstraction to respect pandas-dev#20891 (multiple
  <tbody> support), which was implemented after @jowens' pull request.
  Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
  _parse_tfoot_trs, each returning a list of <tr>s. That let me remove
  _parse_tr, Making All The Tests Pass.
* That caused a snowball effect. lxml does not fix malformed <thead>, as
  tested by spam.html. The previous hacky workaround was in
  _parse_raw_thead, but the new _parse_thead_trs signature returns nodes
  instead of text. The new hacky solution: return the <thead> itself,
  pretending it's a <tr>. This works in all the tests. A better solution
  is to use html5lib with lxml; but that might belong in a separate pull
  request.
adamhooper added a commit to adamhooper/pandas that referenced this pull request Jun 27, 2018
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

* test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError,
  because the ParserError was a bug caused by missing colspan support.
  Now, test that MultiIndex works as expected.
* I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead,
  the virtual cells created by rowspan/colspan are always copies of the
  real cells' text. This prevents _infer_columns() from naming virtual
  cells as "Unnamed: ..."
* I removed a small layer of abstraction to respect pandas-dev#20891 (multiple
  <tbody> support), which was implemented after @jowens' pull request.
  Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
  _parse_tfoot_trs, each returning a list of <tr>s. That let me remove
  _parse_tr, Making All The Tests Pass.
* That caused a snowball effect. lxml does not fix malformed <thead>, as
  tested by spam.html. The previous hacky workaround was in
  _parse_raw_thead, but the new _parse_thead_trs signature returns nodes
  instead of text. The new hacky solution: return the <thead> itself,
  pretending it's a <tr>. This works in all the tests. A better solution
  is to use html5lib with lxml; but that might belong in a separate pull
  request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: handle multiple tbody in read_html()
4 participants