Skip to content

Commit

Permalink
Handle colspan and rowspan
Browse files Browse the repository at this point in the history
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

* test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError,
  because the ParserError was a bug caused by missing colspan support.
  Now, test that MultiIndex works as expected.
* I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead,
  the virtual cells created by rowspan/colspan are always copies of the
  real cells' text. This prevents _infer_columns() from naming virtual
  cells as "Unnamed: ..."
* I removed a small layer of abstraction to respect pandas-dev#20891 (multiple
  <tbody> support), which was implemented after @jowens' pull request.
  Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
  _parse_tfoot_trs, each returning a list of <tr>s. That let me remove
  _parse_tr, Making All The Tests Pass.
* That caused a snowball effect. lxml does not fix malformed <thead>, as
  tested by spam.html. The previous hacky workaround was in
  _parse_raw_thead, but the new _parse_thead_trs signature returns nodes
  instead of text. The new hacky solution: return the <thead> itself,
  pretending it's a <tr>. This works in all the tests. A better solution
  is to use html5lib with lxml; but that might belong in a separate pull
  request.
  • Loading branch information
adamhooper committed Jun 26, 2018
1 parent dbd102c commit c96814d
Show file tree
Hide file tree
Showing 3 changed files with 411 additions and 184 deletions.
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Other Enhancements
- :func:`Series.mode` and :func:`DataFrame.mode` now support the ``dropna`` parameter which can be used to specify whether NaN/NaT values should be considered (:issue:`17534`)
- :func:`to_csv` now supports ``compression`` keyword when a file handle is passed. (:issue:`21227`)
- :meth:`Index.droplevel` is now implemented also for flat indexes, for compatibility with :class:`MultiIndex` (:issue:`21115`)
- :func:`read_html` handles colspan and rowspan arguments and attempts to infer a header if the header is not explicitly specified (:issue:`17054`)


.. _whatsnew_0240.api_breaking:
Expand Down Expand Up @@ -217,7 +218,7 @@ MultiIndex
I/O
^^^

-
- :func:`read_html()` no longer ignores all-whitespace ``<tr>`` within ``<thead>`` when considering the ``skiprows`` and ``header`` arguments. Previously, users had to decrease their ``header`` and ``skiprows`` values on such tables to work around the issue. (:issue:`21641`)
-
-

Expand Down
Loading

1 comment on commit c96814d

@jowens
Copy link

@jowens jowens commented on c96814d Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're a hero, @adamhooper

Please sign in to comment.