-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: read_html to handle rowspan, colspan #17054
Comments
Thanks for the detailed issue! I think handling duplicate of #14267, but I'll close that one. |
@chris-b1 my last effort to provide a PR was pretty much a debacle, so I'm probably not your guy. That being said, since this does seem to be a topic of interest, a little guidance as to how it could be done would help either me or anyone else provide a PR (e.g.: "should probably start with this function"). I don't actually know if this is something that should be "fixed" in pandas or through pandas's setup of the underlying parser(s). |
I haven't done anything with the
In this case, what most likely needs done is modifying step 2 in the presence of In [8]: from pandas.io.parsers import TextParser
In [14]: df = TextParser([
...: ['a', 'a', 'b'],
...: ['sub1', 'sub2', 'sub2'],
...: [1, 2, 3],
...: [4, 5, 6],
...: ],
...: header=[0, 1]).read()
In [16]: df
Out[16]:
a b
sub1 sub2 sub2
0 1 2 3
1 4 5 6
In [17]: df.columns
Out[17]:
MultiIndex(levels=[['a', 'b'], ['sub1', 'sub2']],
labels=[[0, 0, 1], [0, 1, 1]]) |
FYI: All relevant logic appears to be in |
... although there is no current capability for the parser to get attributes (e.g., |
@chris-b1 would you mind eyeballing the following output for the 4 tables on this web page: https://www.ssa.gov/policy/docs/statcomps/supplement/2015/5h.html? This seems to me to be the right pieces to pass to
|
Here's the current output (from trunk).
|
Yeah, at a quick glance that's looking good! |
There is a larger structural problem with the code in that currently, the parsing is divided into three pieces— cc: some of the folks who have recently edited this file for comment/advice: @jreback @brianhuey @gte620v @jorisvandenbossche @hnykda @mjsu @cpcloud |
(For posterity: A lot of the reason that I see there's different pieces for head, body, and foot is basically for flexibility on HTML tables: there might or might not be a head or foot, the body might or might not be declared with |
xref discussion in #17073 : it will be addressed when this issue gets resolved. |
From #17074: @chris-b1 or anyone else, help a brother out? Can you tell me what this test does? It's just expecting the parser to throw an error? The output from the test code (where it's failing) is at the bottom. It's a pretty weird HTML file. Now, if I call it with my current in-progress code as
and if I call it without a header argument (
These seem like OK outputs to me. I'm not sure what the original test is supposed to show. I think I'd like to just delete the test if it's supposed to fail (and no longer fails).
|
@jowens - can you open a PR with your WIP code? Easier to answer these type of questions that way. |
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
Code Sample, a copy-pastable example if possible
This has complex table headings:
read_html
output begins with:(row 0 of the output is probably something one would have to manually eliminate)
Problem description
For HTML headings with rowspan and colspan elements,
read_html
has undesirable behavior. Basicallyread_html
packs all heading<th>
elements in any particular row to the left, so any particular column no longer has any association with the<th>
elements that are actually above it in the HTML table.Ample discussion here about the analogous pandas+Excel test case: #4679
Relevant web discussions:
This may be an issue with the underlying parsers and cannot be solved well in pandas. This appears to be the behavior with both lxml and bs4/html5lib.
Expected Output
Each column should be associated with the
<th>
elements above it in the table. This might be a multi-row column name (as it is now) (aMultiIndex
?) or a tuple (presumably if the argumenttupleize_cols
is set toTrue
). Instead, currently, column n is associated with the n th<th>
entry in the table row regardless of the settings of rowspan/colspan.It may be this is possible to do properly in current pandas in which case I apologize for filing the issue (but I'd be happy to know how to do it).
Output of
pd.show_versions()
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: 0.26
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.3.0
sphinx: 1.6.3
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.5.3
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: