Feature Request: expose full DOM nodes to converters in html_read #14608
Description
With #13461 bringing converters
in read_html
, I would be very interested in having the full DOM nodes exposed to them : the added flexibility would make easy to perform tasks like the one described in #13141 (extracting links instead of text, or any other non-displayed information).
The old behavior could be emulated by changing the default converter (currently None
) to something like
def default_converter(tag):
remove_whitespace(tag.text)
Would this feature be interesting or is there a reason to stick with the current parsing ?
P.S : I am new to pandas and git-(hub) so please let me know if there is anything wrong in this post.
edit :
The following code emulates the treatement on a random exemple. I would say that in general first_link
, links
(and whitespace
) would be the the most used converters and would worth to be predefined as suggested below.
Sometimes it is interesting to parse input tags like <input id="hidObj_12546" value="1" type="hidden">
so we may also want input_id
and input_value
or even img_alt
. But if the goal is to just cover typical usecases, I would say that links
and whitespace
are sufficient, with maybe another one for documentation purpose.
from __future__ import (print_function)
import pandas as pa
import re
def series_key(tag):
return re.search(r'KEY=(.*)',tag.a['href']).group(1)
def first_link(tag):
return tag.a['href']
def default_conv(tag):
return tag.text #whitespaces should be removed
def changed_read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None,
attrs=None, parse_dates=False, tupleize_cols=False, thousands=',', encoding=None,
decimal='.', converters=None, na_values=None, keep_default_na=True):
original_parse_raw_data = pa.io.html._HtmlFrameParser._parse_raw_data
def new_parse_raw_data(self, rows):
data = [[col for col in self._parse_td(row)] for row in rows]
return data
pa.io.html._HtmlFrameParser._parse_raw_data = new_parse_raw_data
df = pa.read_html(io, match, flavor, header, index_col, skiprows,
attrs, parse_dates, tupleize_cols, thousands, encoding,
decimal, converters, na_values, keep_default_na)
pa.io.html._HtmlFrameParser._parse_raw_data = original_parse_raw_data
return df
converters = {0 : default_conv, 1 : series_key, 2 : default_conv, 3 : default_conv}
ecb_old = pa.read_html('http://sdw.ecb.europa.eu/', match='Selected Indicators for the Euro Area',
attrs = {'class': 'tableopenpage'})
ecb = changed_read_html('http://sdw.ecb.europa.eu/', match='Selected Indicators for the Euro Area',
attrs = {'class': 'tableopenpage'}, converters = converters)
print("actual read_html\n")
print(ecb_old)
print("\nproposed read_html\n")
print(ecb)