Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] PyArrow's csv reader yields different results than the default pandas #35661

Closed
Meehai opened this issue May 18, 2023 · 1 comment
Closed
Labels
Component: Python Type: usage Issue is a user question

Comments

@Meehai
Copy link

Meehai commented May 18, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Platform: Ubuntu 20.04
Version: 11.0 and 12.0 (w/ pandas 1.4.1 and 2.0.1)

Hello, I've posted this issue on the pandas board as well, and they've asked me to put it here too:

"""
user_id,value
1225717802.1679841607,33
"""

import pandas as pd
a = pd.read_csv("bug.csv", dtype={"user_id": str})
b = pd.read_csv("bug.csv", dtype={"user_id": str}, engine="pyarrow")

print(a.user_id.iloc[0]) # 1225717802.1679841607
print(b.user_id.iloc[0]) # 1225717802.1679842

assert a.user_id.dtype == b.user_id.dtype # <- both are strings
assert a.user_id.iloc[0] == b.user_id.iloc[0] # <- this fails

It seems that under the hood pyarrow.read_csv handles strings (explicitly asked as strings) differently than expected, in the sense that there is some automatic conversion happening first before the explicit string conversion takes place. In this case it is first interpreted as a float, truncated because of precision issues and just then reconverted to string type.

Component(s)

Python

@westonpace westonpace changed the title PyArrow's csv reader yields different results than the default pandas [C++] PyArrow's csv reader yields different results than the default pandas May 18, 2023
@jorisvandenbossche
Copy link
Member

The behaviour you notice is indeed from casting what has been read/parsed as a float afterwards to string. However, if you use pyarrow's csv reader directly and using the column_types argument, this is done properly:

>>> from pyarrow import csv
>>> csv.read_csv("bug.csv")
pyarrow.Table
user_id: double
value: int64
----
user_id: [[1225717802.1679842]]
value: [[33]]

>>> csv.read_csv("bug.csv", convert_options=csv.ConvertOptions(column_types={"user_id": pa.string()}))
pyarrow.Table
user_id: string
value: int64
----
user_id: [["1225717802.1679841607"]]
value: [[33]]

So I assume this is actually a bug in pandas after all (in how pandas integrates with the pyarrow csv reader, and how it translates its own arguments to arguments passed to pyarrow). Therefore closing this issue, and will re-open the one on the pandas side (pandas-dev/pandas#53269)

@jorisvandenbossche jorisvandenbossche closed this as not planned Won't fix, can't repro, duplicate, stale May 23, 2023
@jorisvandenbossche jorisvandenbossche changed the title [C++] PyArrow's csv reader yields different results than the default pandas [Python] PyArrow's csv reader yields different results than the default pandas May 23, 2023
@jorisvandenbossche jorisvandenbossche added Type: usage Issue is a user question and removed Type: bug labels May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

2 participants