Enhancement request:  read_html() parameter to control contents of DataFrame

The current read_html() method is useful when parsing simple tables, however in practice, not every html table has been created with simple parsing in mind.  It is often required that we identify the type or purpose of data in a table cell.  HTML tables generally express information this via the assignment of a CSS class name to a th or td element.  

I propose an optional read_html() string parameter called "get_attribute" that indicates if the parser should get the text of the cell or if it should get the value of an attribute.  If get_attribute is None, behavior of read_html() is the same as its current functionality:  th/td cell values will be read into the dataframe.  If it is a string value, read_html() will retrieve attribute values rather than cell text.  If a th or td element has no matching attribute, the value returned is None.

I would use this functionality as follows:

```
myTableData = pandas.read_html( table )[ 0 ]
myTableClasses = pandas.read_html( table, get_attribute = 'class' )[ 0 ]
```

The dataframe myTableClasses would contain information that I could use to map cell styling to the values in myTableData.  I imagine it could be useful for other types of information as well, conveyed via other attributes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enhancement request: read_html() parameter to control contents of DataFrame #26636

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Enhancement request: read_html() parameter to control contents of DataFrame #26636

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions