Open
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# Wrote a simple notebook with issue causing dataset:
https://colab.research.google.com/drive/1CM4kmgcQ3mEGlX0lXYiONYk25B8YlWo8?usp=sharing
Issue Description
When I perform pd.read_html()
I am getting a table with multiple duplicated columns and also the table does not make any sense.
is generated as: (JPM_0000019617-23-000432_TABLE154.html)
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
|---:|:------------------------------|:------------------------------|:------------------------------|:----------------------------|:----------------------------|:----------------------------|:----------------------------|:----------------------------|:----------------------------|----:|-----:|-----:|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|
| 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 1 | nan | nan | nan | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | nan | nan | nan | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, |
| 2 | (in millions) | (in millions) | (in millions) | 2023 | 2023 | 2023 | 2022 | 2022 | 2022 | nan | nan | nan | 2023 | 2023 | 2023 | nan | nan | nan | 2022 | 2022 | 2022 |
| 3 | Underwriting | Underwriting | Underwriting | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 4 | Equity | Equity | Equity | $ | 317 | nan | $ | 230 | nan | nan | nan | nan | $ | 550 | nan | nan | nan | nan | $ | 472 | nan |
| 5 | Debt | Debt | Debt | 704 | 704 | nan | 711 | 711 | nan | nan | nan | nan | 1376 | 1376 | nan | nan | nan | nan | 1685 | 1685 | nan |
| 6 | Total underwriting | Total underwriting | Total underwriting | 1021 | 1021 | nan | 941 | 941 | nan | nan | nan | nan | 1926 | 1926 | nan | nan | nan | nan | 2157 | 2157 | nan |
| 7 | Advisory | Advisory | Advisory | 492 | 492 | nan | 645 | 645 | nan | nan | nan | nan | 1236 | 1236 | nan | nan | nan | nan | 1437 | 1437 | nan |
| 8 | Total investment banking fees | Total investment banking fees | Total investment banking fees | $ | 1513 | nan | $ | 1586 | nan | nan | nan | nan | $ | 3162 | nan | nan | nan | nan | $ | 3594 | nan |
Expected Behavior
- There would be Multi-indexed columns based on table headers
- Only one of the columns
0
,1
, and2
would exist (and so on for wherever it matters)
Just a thought: Is there something out-of-the-box solution to standardize these kind of tables? (something like
pd.melt
)
Installed Versions
INSTALLED VERSIONS
------------------
commit : 2e218d10984e9919f0296931d92ea851c6a6faf5
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.120+
Version : #1 SMP Wed Aug 30 11:19:59 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.5.3
numpy : 1.23.5
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.6
pytest : 7.4.3
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.2
IPython : 7.34.0
pandas_datareader: 0.10.0
bs4 : 4.11.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.1
numba : 0.58.1
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
snappy : None
sqlalchemy : 2.0.23
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.7.0
xlrd : 2.0.1
xlwt : None
zstandard : None
tzdata : None