Skip to content

BUG: QST: pd.read_html gives tables with duplicated columns #56337

Open
@INF800

Description

@INF800

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Wrote a simple notebook with issue causing dataset:
https://colab.research.google.com/drive/1CM4kmgcQ3mEGlX0lXYiONYk25B8YlWo8?usp=sharing

Issue Description

When I perform pd.read_html() I am getting a table with multiple duplicated columns and also the table does not make any sense.

The following table:
image

is generated as: (JPM_0000019617-23-000432_TABLE154.html)

|    | 0                             | 1                             | 2                             | 3                           | 4                           | 5                           | 6                           | 7                           | 8                           |   9 |   10 |   11 | 12                        | 13                        | 14                        | 15                        | 16                        | 17                        | 18                        | 19                        | 20                        |
|---:|:------------------------------|:------------------------------|:------------------------------|:----------------------------|:----------------------------|:----------------------------|:----------------------------|:----------------------------|:----------------------------|----:|-----:|-----:|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|
|  0 | nan                           | nan                           | nan                           | nan                         | nan                         | nan                         | nan                         | nan                         | nan                         | nan |  nan |  nan | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       |
|  1 | nan                           | nan                           | nan                           | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | nan |  nan |  nan | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, |
|  2 | (in millions)                 | (in millions)                 | (in millions)                 | 2023                        | 2023                        | 2023                        | 2022                        | 2022                        | 2022                        | nan |  nan |  nan | 2023                      | 2023                      | 2023                      | nan                       | nan                       | nan                       | 2022                      | 2022                      | 2022                      |
|  3 | Underwriting                  | Underwriting                  | Underwriting                  | nan                         | nan                         | nan                         | nan                         | nan                         | nan                         | nan |  nan |  nan | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       |
|  4 | Equity                        | Equity                        | Equity                        | $                           | 317                         | nan                         | $                           | 230                         | nan                         | nan |  nan |  nan | $                         | 550                       | nan                       | nan                       | nan                       | nan                       | $                         | 472                       | nan                       |
|  5 | Debt                          | Debt                          | Debt                          | 704                         | 704                         | nan                         | 711                         | 711                         | nan                         | nan |  nan |  nan | 1376                      | 1376                      | nan                       | nan                       | nan                       | nan                       | 1685                      | 1685                      | nan                       |
|  6 | Total underwriting            | Total underwriting            | Total underwriting            | 1021                        | 1021                        | nan                         | 941                         | 941                         | nan                         | nan |  nan |  nan | 1926                      | 1926                      | nan                       | nan                       | nan                       | nan                       | 2157                      | 2157                      | nan                       |
|  7 | Advisory                      | Advisory                      | Advisory                      | 492                         | 492                         | nan                         | 645                         | 645                         | nan                         | nan |  nan |  nan | 1236                      | 1236                      | nan                       | nan                       | nan                       | nan                       | 1437                      | 1437                      | nan                       |
|  8 | Total investment banking fees | Total investment banking fees | Total investment banking fees | $                           | 1513                        | nan                         | $                           | 1586                        | nan                         | nan |  nan |  nan | $                         | 3162                      | nan                       | nan                       | nan                       | nan                       | $                         | 3594                      | nan                       |

Expected Behavior

  • There would be Multi-indexed columns based on table headers
  • Only one of the columns 0, 1, and 2 would exist (and so on for wherever it matters)

Just a thought: Is there something out-of-the-box solution to standardize these kind of tables? (something like pd.melt)

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 2e218d10984e9919f0296931d92ea851c6a6faf5
python           : 3.10.12.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.15.120+
Version          : #1 SMP Wed Aug 30 11:19:59 UTC 2023
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.3
numpy            : 1.23.5
pytz             : 2023.3.post1
dateutil         : 2.8.2
setuptools       : 67.7.2
pip              : 23.1.2
Cython           : 3.0.6
pytest           : 7.4.3
hypothesis       : None
sphinx           : 5.0.2
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.9.3
html5lib         : 1.1
pymysql          : None
psycopg2         : 2.9.9
jinja2           : 3.1.2
IPython          : 7.34.0
pandas_datareader: 0.10.0
bs4              : 4.11.2
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : 2023.6.0
gcsfs            : 2023.6.0
matplotlib       : 3.7.1
numba            : 0.58.1
numexpr          : 2.8.7
odfpy            : None
openpyxl         : 3.1.2
pandas_gbq       : 0.17.9
pyarrow          : 9.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.11.4
snappy           : None
sqlalchemy       : 2.0.23
tables           : 3.8.0
tabulate         : 0.9.0
xarray           : 2023.7.0
xlrd             : 2.0.1
xlwt             : None
zstandard        : None
tzdata           : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO HTMLread_html, to_html, Styler.apply, Styler.applymapNeeds InfoClarification about behavior needed to assess issueNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions