Skip to content

read_html ignores paragraphs in table cells #24766

Open
@sasan00

Description

@sasan00

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd

html = """
<html>
<body>
<table>
    <tr>
        <td>
            <p>Field 1</p>
            <p>Field 2</p>
        </td>
        <td>
            <p>Value 1</p>
            <p>Value 2</p>
        </td>
    </tr>
</table>
</body>
</html>
"""

tables = pd.read_html(html)
print(tables[0].iat[0, 0])

Problem description

In the current implementation, the p tags are ignored, and therefore it's not possible to infer that field 1 has value 1 and field 2 has value 2.

Expected Output

tables[0].iat[0, 0] == r'Field 1\nField 2'
tables[0].iat[0, 1] == r'Value 1\nValue 2'

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.3.0
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

Labels

BugIO HTMLread_html, to_html, Styler.apply, Styler.applymap

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions