Feature Request: expose full DOM nodes to converters in html_read

With #13461 bringing `converters` in `read_html`, I would be very interested in having the full DOM nodes exposed to them : the added flexibility would make easy to perform tasks like the one described in #13141 (extracting links instead of text, or any other non-displayed information).

The old behavior could be emulated by changing the default converter (currently `None`) to something like
```
def default_converter(tag):
    remove_whitespace(tag.text)
```

Would this feature be interesting or is there a reason to stick with the current parsing ?

P.S : I am new to pandas and git-(hub) so please let me know if there is anything wrong in this post.



edit :

The following code emulates the treatement on a random exemple. I would say that in general `first_link`, `links` (and `whitespace`) would be the the most used converters and would worth to be predefined as suggested below.

Sometimes it is interesting to parse input tags like `<input id="hidObj_12546" value="1" type="hidden">` so we may also want `input_id` and `input_value` or even `img_alt`. But if the goal is to just cover typical usecases, I would say that `links` and `whitespace` are sufficient, with maybe another one for documentation purpose.
 
<details>

```
from __future__ import (print_function)
import pandas as pa
import re


def series_key(tag):
    return re.search(r'KEY=(.*)',tag.a['href']).group(1)


def first_link(tag):
    return tag.a['href']


def default_conv(tag):
    return tag.text    #whitespaces should be removed


def changed_read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None,
                        attrs=None, parse_dates=False, tupleize_cols=False, thousands=',', encoding=None,
                        decimal='.', converters=None, na_values=None, keep_default_na=True):

    original_parse_raw_data = pa.io.html._HtmlFrameParser._parse_raw_data

    def new_parse_raw_data(self, rows):
        data = [[col for col in self._parse_td(row)] for row in rows]
        return data

    pa.io.html._HtmlFrameParser._parse_raw_data = new_parse_raw_data
    df = pa.read_html(io, match, flavor, header, index_col, skiprows,
                    attrs, parse_dates, tupleize_cols, thousands, encoding,
                    decimal, converters, na_values, keep_default_na)
    pa.io.html._HtmlFrameParser._parse_raw_data = original_parse_raw_data
    return df


converters = {0 : default_conv, 1 : series_key, 2 : default_conv, 3 : default_conv}


ecb_old = pa.read_html('http://sdw.ecb.europa.eu/', match='Selected Indicators for the Euro Area',
                        attrs = {'class': 'tableopenpage'})
ecb = changed_read_html('http://sdw.ecb.europa.eu/', match='Selected Indicators for the Euro Area',
                        attrs = {'class': 'tableopenpage'}, converters = converters)

print("actual read_html\n")
print(ecb_old)
print("\nproposed read_html\n")
print(ecb)

```
</details>





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: expose full DOM nodes to converters in html_read #14608

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development