Skip to content

Feature Request: expose full DOM nodes to converters in html_read #14608

Open
@Amaelb

Description

With #13461 bringing converters in read_html, I would be very interested in having the full DOM nodes exposed to them : the added flexibility would make easy to perform tasks like the one described in #13141 (extracting links instead of text, or any other non-displayed information).

The old behavior could be emulated by changing the default converter (currently None) to something like

def default_converter(tag):
    remove_whitespace(tag.text)

Would this feature be interesting or is there a reason to stick with the current parsing ?

P.S : I am new to pandas and git-(hub) so please let me know if there is anything wrong in this post.

edit :

The following code emulates the treatement on a random exemple. I would say that in general first_link, links (and whitespace) would be the the most used converters and would worth to be predefined as suggested below.

Sometimes it is interesting to parse input tags like <input id="hidObj_12546" value="1" type="hidden"> so we may also want input_id and input_value or even img_alt. But if the goal is to just cover typical usecases, I would say that links and whitespace are sufficient, with maybe another one for documentation purpose.

from __future__ import (print_function)
import pandas as pa
import re


def series_key(tag):
    return re.search(r'KEY=(.*)',tag.a['href']).group(1)


def first_link(tag):
    return tag.a['href']


def default_conv(tag):
    return tag.text    #whitespaces should be removed


def changed_read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None,
                        attrs=None, parse_dates=False, tupleize_cols=False, thousands=',', encoding=None,
                        decimal='.', converters=None, na_values=None, keep_default_na=True):

    original_parse_raw_data = pa.io.html._HtmlFrameParser._parse_raw_data

    def new_parse_raw_data(self, rows):
        data = [[col for col in self._parse_td(row)] for row in rows]
        return data

    pa.io.html._HtmlFrameParser._parse_raw_data = new_parse_raw_data
    df = pa.read_html(io, match, flavor, header, index_col, skiprows,
                    attrs, parse_dates, tupleize_cols, thousands, encoding,
                    decimal, converters, na_values, keep_default_na)
    pa.io.html._HtmlFrameParser._parse_raw_data = original_parse_raw_data
    return df


converters = {0 : default_conv, 1 : series_key, 2 : default_conv, 3 : default_conv}


ecb_old = pa.read_html('http://sdw.ecb.europa.eu/', match='Selected Indicators for the Euro Area',
                        attrs = {'class': 'tableopenpage'})
ecb = changed_read_html('http://sdw.ecb.europa.eu/', match='Selected Indicators for the Euro Area',
                        attrs = {'class': 'tableopenpage'}, converters = converters)

print("actual read_html\n")
print(ecb_old)
print("\nproposed read_html\n")
print(ecb)

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO HTMLread_html, to_html, Styler.apply, Styler.applymap

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions