-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestReadHtml.test_computer_sales_page - what's it doing? #17074
Comments
Are you referring to the code you're working on for #17054? |
gfyoung
added
IO HTML
read_html, to_html, Styler.apply, Styler.applymap
Testing
pandas testing functions or related to the test suite
labels
Jul 26, 2017
@jowens : For organization, it's preferable that you post such questions in the issue you're tackling so that all of the discussion regarding work on that issue is in one place. I will close this but will copy and paste your description into that issue. |
adamhooper
added a commit
to adamhooper/pandas
that referenced
this issue
Jun 14, 2018
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
5 tasks
adamhooper
added a commit
to adamhooper/pandas
that referenced
this issue
Jun 26, 2018
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
adamhooper
added a commit
to adamhooper/pandas
that referenced
this issue
Jun 27, 2018
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@chris-b1 or anyone else, help a brother out? Can you tell me what this test does? It's just expecting the parser to throw an error? The output from the test code (where it's failing) is at the bottom. It's a pretty weird HTML file.
Now, if I call it with my current in-progress code as
dfs = pd.read_html('computer_sales_page.html', header=[0, 1])
, I see:and if I call it without a header argument (
dfs = pd.read_html('computer_sales_page.html')
), I see:These seem like OK outputs to me. I'm not sure what the original test is supposed to show. I think I'd like to just delete the test if it's supposed to fail (and no longer fails).
The text was updated successfully, but these errors were encountered: