[Python] PyArrow's csv reader yields different results than the default pandas #35661

Meehai · 2023-05-18T06:49:36Z

Describe the bug, including details regarding any error messages, version, and platform.

Platform: Ubuntu 20.04
Version: 11.0 and 12.0 (w/ pandas 1.4.1 and 2.0.1)

Hello, I've posted this issue on the pandas board as well, and they've asked me to put it here too:

"""
user_id,value
1225717802.1679841607,33
"""

import pandas as pd
a = pd.read_csv("bug.csv", dtype={"user_id": str})
b = pd.read_csv("bug.csv", dtype={"user_id": str}, engine="pyarrow")

print(a.user_id.iloc[0]) # 1225717802.1679841607
print(b.user_id.iloc[0]) # 1225717802.1679842

assert a.user_id.dtype == b.user_id.dtype # <- both are strings
assert a.user_id.iloc[0] == b.user_id.iloc[0] # <- this fails

It seems that under the hood pyarrow.read_csv handles strings (explicitly asked as strings) differently than expected, in the sense that there is some automatic conversion happening first before the explicit string conversion takes place. In this case it is first interpreted as a float, truncated because of precision issues and just then reconverted to string type.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-05-23T16:07:37Z

The behaviour you notice is indeed from casting what has been read/parsed as a float afterwards to string. However, if you use pyarrow's csv reader directly and using the column_types argument, this is done properly:

>>> from pyarrow import csv
>>> csv.read_csv("bug.csv")
pyarrow.Table
user_id: double
value: int64
----
user_id: [[1225717802.1679842]]
value: [[33]]

>>> csv.read_csv("bug.csv", convert_options=csv.ConvertOptions(column_types={"user_id": pa.string()}))
pyarrow.Table
user_id: string
value: int64
----
user_id: [["1225717802.1679841607"]]
value: [[33]]

So I assume this is actually a bug in pandas after all (in how pandas integrates with the pyarrow csv reader, and how it translates its own arguments to arguments passed to pyarrow). Therefore closing this issue, and will re-open the one on the pandas side (pandas-dev/pandas#53269)

Meehai added the Type: bug label May 18, 2023

github-actions bot added the Component: Python label May 18, 2023

Meehai mentioned this issue May 18, 2023

BUG: pyarrow's csv reader yields different results than the deafult csv reader on strings that look like floating points pandas-dev/pandas#53269

Open

3 tasks

westonpace changed the title ~~PyArrow's csv reader yields different results than the default pandas~~ [C++] PyArrow's csv reader yields different results than the default pandas May 18, 2023

jorisvandenbossche closed this as completed May 23, 2023

jorisvandenbossche reopened this May 23, 2023

jorisvandenbossche closed this as not planned Won't fix, can't repro, duplicate, stale May 23, 2023

jorisvandenbossche changed the title ~~[C++] PyArrow's csv reader yields different results than the default pandas~~ [Python] PyArrow's csv reader yields different results than the default pandas May 23, 2023

jorisvandenbossche added Type: usage Issue is a user question and removed Type: bug labels May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] PyArrow's csv reader yields different results than the default pandas #35661

[Python] PyArrow's csv reader yields different results than the default pandas #35661

Meehai commented May 18, 2023 •

edited

Loading

jorisvandenbossche commented May 23, 2023

[Python] PyArrow's csv reader yields different results than the default pandas #35661

[Python] PyArrow's csv reader yields different results than the default pandas #35661

Comments

Meehai commented May 18, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

jorisvandenbossche commented May 23, 2023

Meehai commented May 18, 2023 •

edited

Loading