Skip to content

BUG: thousands separator in read_html alters data even though converter is set for a specific column #39005

Open
@GGegenhuber

Description

@GGegenhuber
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd

def convert_date_fix(date_str):
  return pd.to_datetime(date_str, format='%d0%m%Y')

html_table = """
<table class="gridview" style="width:600px;border-collapse:collapse;" cellspacing="0" cellpadding="4" border="0">
  <tbody>
    <tr>
      <th scope="col" align="left">Portfolio</th>
      <th scope="col" align="right">Waehrung</th>
      <th scope="col" align="right">Volumen</th>
      <th scope="col" align="right">Datum</th>
    </tr>
    <tr>
        <td align="left">Aktie A</td>
        <td align="right">EUR</td>
        <td align="right">20.320.945,77</td>
        <td align="right">05.01.2021</td>
    </tr>
    <tr>
      <td align="left">Aktie B</td>
      <td align="right">EUR</td>
      <td align="right">4.133.996,41</td>
      <td align="right">05.01.2021</td>
    </tr>
    <tr>
        <td align="left">Aktie C</td>
        <td align="right">EUR</td>
        <td align="right">3.855.218,70</td>
        <td align="right">05.01.2021</td>
      </tr>
  </tbody>
</table>"""

#ex1: date-string is converted to int64 with . separator (05.01.2021 becomes 5012021)
data_frame = pd.read_html(html_table, thousands='.', decimal=',')[0]
print(data_frame)
print(data_frame.dtypes)

#ex2: date-string is converted to object but string is altered same as above (05.01.2021 becomes 5012021)
data_frame = pd.read_html(html_table, thousands='.', decimal=',', converters={'Datum' : str})[0]
print(data_frame)
print(data_frame.dtypes)

#ex3: ugly fix that reverts conversion and acutally turns the column into a date
data_frame = pd.read_html(html_table, thousands='.', decimal=',', converters={'Datum' : convert_date_fix})[0]
print(data_frame)
print(data_frame.dtypes)

Problem description

I've set thousands and decimal separator parameters to fit the general number format that is used in Germany and also in my html-table example.

ex1: although the date column gets wrongfully interpreted as integer value, the behaviour is reasonable since no specific converters were defined for that column.

ex2 (the actual problem/bug): when setting a specific converter for the date column and thereby define it as string I'd expect the parser to keep the value untouched and return '05.01.2021', however it also interprets the string as integer and applies the corresponding conversion as above.

ex3: even when setting an own function as converter the original value is still altered and the integer conversion needs to be reverted in my helper function.

Expected Output

When a converter is defined for a specific column, the original value should be passed to the conversion function and conversions that come from thousands-separator for numeric values should not be applied.

Additional improvement/feature request

Adapt read_html to accept a date_parser as in read_csv which would make it easier to automatically parse dates in various formats. The read_html function currently has a boolean parse_dates param (same as read_csv) but in contrast does not support conversion of unconventional date formats.

see also #10684 or this stackoverflow post.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.8.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : German_Austria.1252

pandas : 1.2.0
numpy : 1.20.0rc2
pytz : 2020.5
dateutil : 2.8.1
pip : 20.2.1
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO HTMLread_html, to_html, Styler.apply, Styler.applymap

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions