Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd
def convert_date_fix(date_str):
return pd.to_datetime(date_str, format='%d0%m%Y')
html_table = """
<table class="gridview" style="width:600px;border-collapse:collapse;" cellspacing="0" cellpadding="4" border="0">
<tbody>
<tr>
<th scope="col" align="left">Portfolio</th>
<th scope="col" align="right">Waehrung</th>
<th scope="col" align="right">Volumen</th>
<th scope="col" align="right">Datum</th>
</tr>
<tr>
<td align="left">Aktie A</td>
<td align="right">EUR</td>
<td align="right">20.320.945,77</td>
<td align="right">05.01.2021</td>
</tr>
<tr>
<td align="left">Aktie B</td>
<td align="right">EUR</td>
<td align="right">4.133.996,41</td>
<td align="right">05.01.2021</td>
</tr>
<tr>
<td align="left">Aktie C</td>
<td align="right">EUR</td>
<td align="right">3.855.218,70</td>
<td align="right">05.01.2021</td>
</tr>
</tbody>
</table>"""
#ex1: date-string is converted to int64 with . separator (05.01.2021 becomes 5012021)
data_frame = pd.read_html(html_table, thousands='.', decimal=',')[0]
print(data_frame)
print(data_frame.dtypes)
#ex2: date-string is converted to object but string is altered same as above (05.01.2021 becomes 5012021)
data_frame = pd.read_html(html_table, thousands='.', decimal=',', converters={'Datum' : str})[0]
print(data_frame)
print(data_frame.dtypes)
#ex3: ugly fix that reverts conversion and acutally turns the column into a date
data_frame = pd.read_html(html_table, thousands='.', decimal=',', converters={'Datum' : convert_date_fix})[0]
print(data_frame)
print(data_frame.dtypes)
Problem description
I've set thousands
and decimal
separator parameters to fit the general number format that is used in Germany and also in my html-table example.
ex1: although the date column gets wrongfully interpreted as integer value, the behaviour is reasonable since no specific converters
were defined for that column.
ex2 (the actual problem/bug): when setting a specific converter
for the date column and thereby define it as string I'd expect the parser to keep the value untouched and return '05.01.2021', however it also interprets the string as integer and applies the corresponding conversion as above.
ex3: even when setting an own function as converter
the original value is still altered and the integer conversion needs to be reverted in my helper function.
Expected Output
When a converter
is defined for a specific column, the original value should be passed to the conversion function and conversions that come from thousands-separator for numeric values should not be applied.
Additional improvement/feature request
Adapt read_html
to accept a date_parser
as in read_csv
which would make it easier to automatically parse dates in various formats. The read_html
function currently has a boolean parse_dates
param (same as read_csv
) but in contrast does not support conversion of unconventional date formats.
see also #10684 or this stackoverflow post.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 3e89b4c
python : 3.8.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : German_Austria.1252
pandas : 1.2.0
numpy : 1.20.0rc2
pytz : 2020.5
dateutil : 2.8.1
pip : 20.2.1
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None