Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem reading HTML tables with colspan #14267

Closed
adailsonfilho opened this issue Sep 21, 2016 · 2 comments
Closed

Problem reading HTML tables with colspan #14267

adailsonfilho opened this issue Sep 21, 2016 · 2 comments
Labels
Duplicate Report Duplicate issue or pull request IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@adailsonfilho
Copy link

Hi, guys! First of all, thanks for making this open source! =D Keep up the good work!

SCRIPT

import pandas as pd

dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States')

print(dfs[0])

The output we found where there is any colspan is considering only the first column, and the other values are pulled back in relation to the colspan amount, making the values on the tail's row as "NaN"

#### Expected Output
In this web page, there is a colspan in the "capital" and "largest city" columns when they are the same, I think that in a general way we would expect that the value would be duplicate on the following <colspan's amount> columns in the generated DataFrame

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 15 Stepping 13, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: 0.2.1

@sinhrks sinhrks added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Oct 4, 2016
@brianhuey
Copy link
Contributor

brianhuey commented Jan 28, 2017

I believe @adailsonfilho means that there are colspan attributes on columns labeled "Cities" and "Area in mi2 (km2)[B][16]" in the first header row. The solution would be that for any cell with a colspan value of n, the cell text is repeated across the next n-1 preceding rows. This issue applies to any td or th tag regardless of whether it is located in thead or tbody (see the cell with text "Phoenix" in the third row of the table referenced above.)

@chris-b1
Copy link
Contributor

closing in favor of #17054

@chris-b1 chris-b1 added the Duplicate Report Duplicate issue or pull request label Jul 23, 2017
@chris-b1 chris-b1 added this to the No action milestone Jul 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

4 participants