You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug, including details regarding any error messages, version, and platform.
Platform: Ubuntu 20.04
Version: 11.0 and 12.0 (w/ pandas 1.4.1 and 2.0.1)
Hello, I've posted this issue on the pandas board as well, and they've asked me to put it here too:
"""
user_id,value
1225717802.1679841607,33
"""
import pandas as pd
a = pd.read_csv("bug.csv", dtype={"user_id": str})
b = pd.read_csv("bug.csv", dtype={"user_id": str}, engine="pyarrow")
print(a.user_id.iloc[0]) # 1225717802.1679841607
print(b.user_id.iloc[0]) # 1225717802.1679842
assert a.user_id.dtype == b.user_id.dtype # <- both are strings
assert a.user_id.iloc[0] == b.user_id.iloc[0] # <- this fails
It seems that under the hood pyarrow.read_csv handles strings (explicitly asked as strings) differently than expected, in the sense that there is some automatic conversion happening first before the explicit string conversion takes place. In this case it is first interpreted as a float, truncated because of precision issues and just then reconverted to string type.
Component(s)
Python
The text was updated successfully, but these errors were encountered:
westonpace
changed the title
PyArrow's csv reader yields different results than the default pandas
[C++] PyArrow's csv reader yields different results than the default pandas
May 18, 2023
The behaviour you notice is indeed from casting what has been read/parsed as a float afterwards to string. However, if you use pyarrow's csv reader directly and using the column_types argument, this is done properly:
So I assume this is actually a bug in pandas after all (in how pandas integrates with the pyarrow csv reader, and how it translates its own arguments to arguments passed to pyarrow). Therefore closing this issue, and will re-open the one on the pandas side (pandas-dev/pandas#53269)
jorisvandenbossche
changed the title
[C++] PyArrow's csv reader yields different results than the default pandas
[Python] PyArrow's csv reader yields different results than the default pandas
May 23, 2023
Describe the bug, including details regarding any error messages, version, and platform.
Platform: Ubuntu 20.04
Version: 11.0 and 12.0 (w/ pandas 1.4.1 and 2.0.1)
Hello, I've posted this issue on the pandas board as well, and they've asked me to put it here too:
It seems that under the hood
pyarrow.read_csv
handles strings (explicitly asked as strings) differently than expected, in the sense that there is some automatic conversion happening first before the explicit string conversion takes place. In this case it is first interpreted as a float, truncated because of precision issues and just then reconverted to string type.Component(s)
Python
The text was updated successfully, but these errors were encountered: