Skip to content

Enhancement request: read_html() parameter to control contents of DataFrame #26636

Open
@thedatadoc

Description

@thedatadoc

The current read_html() method is useful when parsing simple tables, however in practice, not every html table has been created with simple parsing in mind. It is often required that we identify the type or purpose of data in a table cell. HTML tables generally express information this via the assignment of a CSS class name to a th or td element.

I propose an optional read_html() string parameter called "get_attribute" that indicates if the parser should get the text of the cell or if it should get the value of an attribute. If get_attribute is None, behavior of read_html() is the same as its current functionality: th/td cell values will be read into the dataframe. If it is a string value, read_html() will retrieve attribute values rather than cell text. If a th or td element has no matching attribute, the value returned is None.

I would use this functionality as follows:

myTableData = pandas.read_html( table )[ 0 ]
myTableClasses = pandas.read_html( table, get_attribute = 'class' )[ 0 ]

The dataframe myTableClasses would contain information that I could use to map cell styling to the values in myTableData. I imagine it could be useful for other types of information as well, conveyed via other attributes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO HTMLread_html, to_html, Styler.apply, Styler.applymap

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions