Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: read_html to handle rowspan, colspan #17054

Closed
jowens opened this issue Jul 22, 2017 · 13 comments · Fixed by #21487
Closed

ENH: read_html to handle rowspan, colspan #17054

jowens opened this issue Jul 22, 2017 · 13 comments · Fixed by #21487
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Milestone

Comments

@jowens
Copy link

jowens commented Jul 22, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.read_html('https://www.ssa.gov/policy/docs/statcomps/supplement/2015/5h.html')[0]

This has complex table headings:

annual_statistical_supplement__2015_-_beneficiary_families_with_oasdi_benefits_in_current-payment_status__5_h_

read_html output begins with:

                                                 Year                            Retired-worker families        Survivor families             Disabled-worker families                                Unnamed: 4_level_0                              Unnamed: 5_level_0 Unnamed: 6_level_0 Unnamed: 7_level_0 Unnamed: 8_level_0 Unnamed: 9_level_0 Unnamed: 10_level_0 Unnamed: 11_level_0  \
                                          Worker only                                  Worker and wife?a  Non-disabled widow only        Widowed mother or father and?                                       Worker only                            Worker, wife,?b and?  Worker and spouse Unnamed: 7_level_1 Unnamed: 8_level_1 Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1
                                                  All                                                Men                    Women                              1?child                                        2?children                              3 or more children                All                Men              Women            1?child  2 or more children Unnamed: 11_level_2
0                                                 NaN                                 Number?(thousands)                      NaN                                  NaN                                               NaN                                             NaN                NaN                NaN                NaN                NaN                 NaN                 NaN
1                                                1945                                                416                      338                                   78                                               181                                              95              86.00              48.00              24.00              .?.?.               .?.?.               .?.?.

(row 0 of the output is probably something one would have to manually eliminate)

Problem description

For HTML headings with rowspan and colspan elements, read_html has undesirable behavior. Basically read_html packs all heading <th> elements in any particular row to the left, so any particular column no longer has any association with the <th> elements that are actually above it in the HTML table.

Ample discussion here about the analogous pandas+Excel test case: #4679

Relevant web discussions:

This may be an issue with the underlying parsers and cannot be solved well in pandas. This appears to be the behavior with both lxml and bs4/html5lib.

Expected Output

Each column should be associated with the <th> elements above it in the table. This might be a multi-row column name (as it is now) (a MultiIndex?) or a tuple (presumably if the argument tupleize_cols is set to True). Instead, currently, column n is associated with the n th <th> entry in the table row regardless of the settings of rowspan/colspan.

It may be this is possible to do properly in current pandas in which case I apologize for filing the issue (but I'd be happy to know how to do it).

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.US-ASCII LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: 0.26
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.3.0
sphinx: 1.6.3
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.5.3
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

Thanks for the detailed issue! I think handling rowspan and colspan correctly would be a welcome enhancement, agree it could basically work like Excel, either tuple-izing or creating a MultiIndex. Appreciate a PR if you're interested.

duplicate of #14267, but I'll close that one.

@chris-b1 chris-b1 marked this as a duplicate of #14267 Jul 23, 2017
@chris-b1 chris-b1 added Difficulty Intermediate Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Jul 23, 2017
@chris-b1 chris-b1 added this to the Next Major Release milestone Jul 23, 2017
@jowens
Copy link
Author

jowens commented Jul 23, 2017

@chris-b1 my last effort to provide a PR was pretty much a debacle, so I'm probably not your guy. That being said, since this does seem to be a topic of interest, a little guidance as to how it could be done would help either me or anyone else provide a PR (e.g.: "should probably start with this function"). I don't actually know if this is something that should be "fixed" in pandas or through pandas's setup of the underlying parser(s).

@chris-b1
Copy link
Contributor

chris-b1 commented Jul 23, 2017

I haven't done anything with the read_html code, but my understanding is it works like excel, with 3 overall steps:

  1. Read from external data source w/ third party package
  2. (Bulk of logic) Convert that data into a list-of-lists data structure representing the rows
  3. That list-of-lists is passed to TextParser which is generic logic that actually converts the data into a DataFrame

In this case, what most likely needs done is modifying step 2 in the presence of rowspan/colspan, adjusting the data. Can look to read_excel for inspiration, or a simple example below - key things are the padding of data and header keyword. (index_col works the same for index)

In [8]: from pandas.io.parsers import TextParser

In [14]: df = TextParser([
    ...:     ['a', 'a', 'b'],
    ...:     ['sub1', 'sub2', 'sub2'],
    ...:     [1, 2, 3],
    ...:     [4, 5, 6],
    ...:     ],
    ...:     header=[0, 1]).read()

In [16]: df
Out[16]: 
     a         b
  sub1 sub2 sub2
0    1    2    3
1    4    5    6

In [17]: df.columns
Out[17]: 
MultiIndex(levels=[['a', 'b'], ['sub1', 'sub2']],
           labels=[[0, 0, 1], [0, 1, 1]])

@jowens
Copy link
Author

jowens commented Jul 23, 2017

FYI: All relevant logic appears to be in io/html.py in the function _HtmlFrameParser:_parse_raw_thead; it does not rely on the parser chosen.

@jowens
Copy link
Author

jowens commented Jul 23, 2017

... although there is no current capability for the parser to get attributes (e.g., rowspan, colspan) from elements, so that must be added. (There's currently a text_getter that returns a string; we need an analogous attrs_getter that returns a dict with keys=attributes, values=attribute_values.)

@jowens
Copy link
Author

jowens commented Jul 24, 2017

@chris-b1 would you mind eyeballing the following output for the 4 tables on this web page: https://www.ssa.gov/policy/docs/statcomps/supplement/2015/5h.html? This seems to me to be the right pieces to pass to TextParser (as long as I'm returning this from _parse_raw_thead, everything else ought to just work fine):

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Year', u'Retired-worker families', u'Retired-worker families', u'Retired-worker families', u'Retired-worker families', u'Survivor families', u'Survivor families', u'Survivor families', u'Survivor families', u'Disabled-worker families', u'Disabled-worker families', u'Disabled-worker families', u'Disabled-worker families', u'Disabled-worker families', u'Disabled-worker families'], [u'Year', u'Worker only', u'Worker only', u'Worker only', u'Worker and wife\xa0a', u'Non-disabled widow only', u'Widowed mother or father and\u2014', u'Widowed mother or father and\u2014', u'Widowed mother or father and\u2014', u'Worker only', u'Worker only', u'Worker only', u'Worker, wife,\xa0b and\u2014', u'Worker, wife,\xa0b and\u2014', u'Worker and spouse'], [u'Year', u'All', u'Men', u'Women', u'Worker and wife\xa0a', u'Non-disabled widow only', u'1\xa0child', u'2\xa0children', u'3 or more children', u'All', u'Men', u'Women', u'1\xa0child', u'2 or more children', u'Worker and spouse']]

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Family group', u'Number (thousands)', u'Number (thousands)', u'Average primary insurance amount (dollars)', u'Average monthly family benefit (dollars)'], [u'Family group', u'Families', u'Beneficiaries', u'Average primary insurance amount (dollars)', u'Average monthly family benefit (dollars)']]

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Monthly family benefit\xa0a (dollars)', u'Retired worker only', u'Retired worker only', u'Retired worker and wife', u'Retired worker, wife, and\u2014', u'Retired worker, wife, and\u2014', u'Disabled worker only', u'Disabled worker only', u'Disabled worker, wife, and\u2014', u'Disabled worker, wife, and\u2014'], [u'Monthly family benefit\xa0a (dollars)', u'Men', u'Women', u'Retired worker and wife', u'1\xa0child', u'2 or more children', u'Men', u'Women', u'1\xa0child', u'2 or more children']]

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Monthly family benefit (dollars)', u'Widowed mother or father and\u2014', u'Widowed mother or father and\u2014', u'Widowed mother or father and\u2014', u'Children only', u'Children only', u'Children only', u'Widow only', u'Widow only'], [u'Monthly family benefit (dollars)', u'1\xa0child', u'2\xa0children', u'3 or more children', u'1\xa0child', u'2\xa0children', u'3 or more children', u'Nondisabled', u'Disabled']]

@jowens
Copy link
Author

jowens commented Jul 24, 2017

Here's the current output (from trunk).

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Year', u'Retired-worker families', u'Survivor families', u'Disabled-worker families'], [u'Worker only', u'Worker and wife\xa0a', u'Non-disabled widow only', u'Widowed mother or father and\u2014', u'Worker only', u'Worker, wife,\xa0b and\u2014', u'Worker and spouse'], [u'All', u'Men', u'Women', u'1\xa0child', u'2\xa0children', u'3 or more children', u'All', u'Men', u'Women', u'1\xa0child', u'2 or more children']]

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Family group', u'Number (thousands)', u'Average primary insurance amount (dollars)', u'Average monthly family benefit (dollars)'], [u'Families', u'Beneficiaries']]

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Monthly family benefit\xa0a (dollars)', u'Retired worker only', u'Retired worker and wife', u'Retired worker, wife, and\u2014', u'Disabled worker only', u'Disabled worker, wife, and\u2014'], [u'Men', u'Women', u'1\xa0child', u'2 or more children', u'Men', u'Women', u'1\xa0child', u'2 or more children']]

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Monthly family benefit (dollars)', u'Widowed mother or father and\u2014', u'Children only', u'Widow only'], [u'1\xa0child', u'2\xa0children', u'3 or more children', u'1\xa0child', u'2\xa0children', u'3 or more children', u'Nondisabled', u'Disabled']]

@chris-b1
Copy link
Contributor

Yeah, at a quick glance that's looking good!

@jowens
Copy link
Author

jowens commented Jul 24, 2017

There is a larger structural problem with the code in that currently, the parsing is divided into three pieces—parse_thead, parse_tbody, and parse_tfoot, each of which has its own custom logic. My current code is focused in parse_thead, where I thought it would be most relevant. However, (a) rowspan and colspan certainly can appear in the body and foot and (b) Wikipedia tables don't have a <thead> at all and so everything gets dumped into the body. So I think—lacking global knowledge about doing a big refactoring like this—that it might be better to have one chunk of code that does all parsing (hopefully in the generic parser code, not in the parser-specific parser code) and that special cases for header and footer might be in that one chunk of code. But this sort of refactoring is likely beyond what I could do well.

cc: some of the folks who have recently edited this file for comment/advice: @jreback @brianhuey @gte620v @jorisvandenbossche @hnykda @mjsu @cpcloud

@jowens
Copy link
Author

jowens commented Jul 24, 2017

(For posterity: A lot of the reason that I see there's different pieces for head, body, and foot is basically for flexibility on HTML tables: there might or might not be a head or foot, the body might or might not be declared with <tbody>, etc. (For Wikipedia tables, no <thead> but rows with <th> and not <td> means we should probably interpret those rows as header rows.) But, there's no documentation as far as I can tell to say, basically, these are the different styles of tables that pandas supports. The conditionals in the parse routines aren't commented so I'm just guessing on which different table behaviors they're handling. Hopefully the current test cases are comprehensive enough to cover 'em.)

@gfyoung
Copy link
Member

gfyoung commented Jul 25, 2017

xref discussion in #17073 : it will be addressed when this issue gets resolved.

@gfyoung
Copy link
Member

gfyoung commented Jul 26, 2017

From #17074:

@chris-b1 or anyone else, help a brother out? Can you tell me what this test does? It's just expecting the parser to throw an error? The output from the test code (where it's failing) is at the bottom. It's a pretty weird HTML file.

computer_sales_page_html

Now, if I call it with my current in-progress code as dfs = pd.read_html('computer_sales_page.html', header=[0, 1]), I see:

Index([         (u'Unnamed: 0_level_0', u'Unnamed: 0_level_1'),
                (u'Unnamed: 1_level_0', u'Unnamed: 1_level_1'),
                     (u'Three months ended April?30', u'2013'),
              u'(u'Three months ended April\xa030', '2013').1',
       (u'Three months ended April?30', u'Unnamed: 4_level_1'),
                     (u'Three months ended April?30', u'2012'),
              u'(u'Three months ended April\xa030', '2012').1',
                (u'Unnamed: 7_level_0', u'Unnamed: 7_level_1'),
                       (u'Six months ended April?30', u'2013'),
                u'(u'Six months ended April\xa030', '2013').1',
        (u'Six months ended April?30', u'Unnamed: 10_level_1'),
                       (u'Six months ended April?30', u'2012'),
                u'(u'Six months ended April\xa030', '2012').1',
              (u'Unnamed: 13_level_0', u'Unnamed: 13_level_1')],
      dtype='object')

and if I call it without a header argument (dfs = pd.read_html('computer_sales_page.html')), I see:

Index([   (u'Unnamed: 0_level_0', u'Unnamed: 0_level_1', u'Unnamed: 0_level_2'),
          (u'Unnamed: 1_level_0', u'Unnamed: 1_level_1', u'Unnamed: 1_level_2'),
                      (u'Three months ended April?30', u'2013', u'In millions'),
                u'(u'Three months ended April\xa030', '2013', 'In millions').1',
        (u'Three months ended April?30', u'Unnamed: 4_level_1', u'In millions'),
                      (u'Three months ended April?30', u'2012', u'In millions'),
                u'(u'Three months ended April\xa030', '2012', 'In millions').1',
                 (u'Unnamed: 7_level_0', u'Unnamed: 7_level_1', u'In millions'),
                        (u'Six months ended April?30', u'2013', u'In millions'),
                  u'(u'Six months ended April\xa030', '2013', 'In millions').1',
         (u'Six months ended April?30', u'Unnamed: 10_level_1', u'In millions'),
                        (u'Six months ended April?30', u'2012', u'In millions'),
                  u'(u'Six months ended April\xa030', '2012', 'In millions').1',
       (u'Unnamed: 13_level_0', u'Unnamed: 13_level_1', u'Unnamed: 13_level_2')],
      dtype='object')

These seem like OK outputs to me. I'm not sure what the original test is supposed to show. I think I'd like to just delete the test if it's supposed to fail (and no longer fails).

____________________ TestReadHtml.test_computer_sales_page _____________________

self = <pandas.tests.io.test_html.TestReadHtml object at 0x1120aa390>

    def test_computer_sales_page(self):
        data = os.path.join(DATA_PATH, 'computer_sales_page.html')
        with tm.assert_raises_regex(ParserError,
                                    r"Passed header=\[0,1\] are "
                                    r"too many rows for this "
                                    r"multi_index of columns"):
>           self.read_html(data, header=[0, 1])

pandas/tests/io/test_html.py:778:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pandas.util.testing._AssertRaisesContextmanager object at 0x1120aab50>
exc_type = None, exc_value = None, trace_back = None

    def __exit__(self, exc_type, exc_value, trace_back):
        expected = self.exception

        if not exc_type:
            exp_name = getattr(expected, "__name__", str(expected))
>           raise AssertionError("{0} not raised.".format(exp_name))
E           AssertionError: ParserError not raised.

pandas/util/testing.py:2491: AssertionError

@chris-b1
Copy link
Contributor

@jowens - can you open a PR with your WIP code? Easier to answer these type of questions that way.

adamhooper added a commit to adamhooper/pandas that referenced this issue Jun 14, 2018
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

* test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError,
  because the ParserError was a bug caused by missing colspan support.
  Now, test that MultiIndex works as expected.
* I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead,
  the virtual cells created by rowspan/colspan are always copies of the
  real cells' text. This prevents _infer_columns() from naming virtual
  cells as "Unnamed: ..."
* I removed a small layer of abstraction to respect pandas-dev#20891 (multiple
  <tbody> support), which was implemented after @jowens' pull request.
  Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
  _parse_tfoot_trs, each returning a list of <tr>s. That let me remove
  _parse_tr, Making All The Tests Pass.
* That caused a snowball effect. lxml does not fix malformed <thead>, as
  tested by spam.html. The previous hacky workaround was in
  _parse_raw_thead, but the new _parse_thead_trs signature returns nodes
  instead of text. The new hacky solution: return the <thead> itself,
  pretending it's a <tr>. This works in all the tests. A better solution
  is to use html5lib with lxml; but that might belong in a separate pull
  request.
adamhooper added a commit to adamhooper/pandas that referenced this issue Jun 26, 2018
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

* test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError,
  because the ParserError was a bug caused by missing colspan support.
  Now, test that MultiIndex works as expected.
* I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead,
  the virtual cells created by rowspan/colspan are always copies of the
  real cells' text. This prevents _infer_columns() from naming virtual
  cells as "Unnamed: ..."
* I removed a small layer of abstraction to respect pandas-dev#20891 (multiple
  <tbody> support), which was implemented after @jowens' pull request.
  Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
  _parse_tfoot_trs, each returning a list of <tr>s. That let me remove
  _parse_tr, Making All The Tests Pass.
* That caused a snowball effect. lxml does not fix malformed <thead>, as
  tested by spam.html. The previous hacky workaround was in
  _parse_raw_thead, but the new _parse_thead_trs signature returns nodes
  instead of text. The new hacky solution: return the <thead> itself,
  pretending it's a <tr>. This works in all the tests. A better solution
  is to use html5lib with lxml; but that might belong in a separate pull
  request.
adamhooper added a commit to adamhooper/pandas that referenced this issue Jun 27, 2018
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

* test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError,
  because the ParserError was a bug caused by missing colspan support.
  Now, test that MultiIndex works as expected.
* I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead,
  the virtual cells created by rowspan/colspan are always copies of the
  real cells' text. This prevents _infer_columns() from naming virtual
  cells as "Unnamed: ..."
* I removed a small layer of abstraction to respect pandas-dev#20891 (multiple
  <tbody> support), which was implemented after @jowens' pull request.
  Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
  _parse_tfoot_trs, each returning a list of <tr>s. That let me remove
  _parse_tr, Making All The Tests Pass.
* That caused a snowball effect. lxml does not fix malformed <thead>, as
  tested by spam.html. The previous hacky workaround was in
  _parse_raw_thead, but the new _parse_thead_trs signature returns nodes
  instead of text. The new hacky solution: return the <thead> itself,
  pretending it's a <tr>. This works in all the tests. A better solution
  is to use html5lib with lxml; but that might belong in a separate pull
  request.
@jreback jreback modified the milestones: Next Major Release, 0.23.2 Jul 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
4 participants