Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow numbers with thousand seperators to be parsed as Integer in read_csv #18465

Open
david-waterworth opened this issue Aug 29, 2024 · 1 comment
Labels
enhancement New feature or an improvement of an existing feature

Comments

@david-waterworth
Copy link

Description

I'm running into issues reading tsv files that have been exported as utf-16 that include numeric columns formatted to contain thousand separators (i.e. comma's). When polars attempts to infer the schema, the type will flip/flop between str and int64 depending on whether the column contains a value > 999 or not. I posted on SO but all the replies were more refined versions of my work-around, i.e. to replace the thousands separator with a zero-length string before converting to integer.

The issue here is I have to do this on any numeric column that contains a value > 999, which is data dependent. I found a work-around which is to simply use replace on all numeric columns even if they're inferred as int rather than string, i.e.

from io import BytesIO
import polars as pl

data = BytesIO(
"""
Id\tA\tB
1\t537\t2,288
2\t325\t1,047
3\t98\t194
""".encode("utf-16")
)

df = pl.read_csv(data, encoding="utf-16", separator="\t")
# B is pl.String()
df = df.with_columns(pl.col('A', 'B').str.replace_all(",", "").cast(pl.Int64))
# Now A and B are both Int64

This seems like something that could break in future releases - does pl.col('A', 'B').str select both A and B and then filter by type (pl.String) before calling replace? otherwise it seems like a side-effect?

In any case it's awkward, if I specify that a column in Int64 (i.e. using schema= or .cast(pl.Int64)) and it's in a format that can be coerced to an Int from a string, ideally Polars should do this - but when there's a thousands separator it doesn't. Or at least there should be an explcit str.convert(pl.Int64) that works in this case?

@david-waterworth david-waterworth added the enhancement New feature or an improvement of an existing feature label Aug 29, 2024
@cmdlineluser
Copy link
Contributor

Yeah, pandas has thousands= for that.

>>> pl.from_pandas(pd.read_csv(data, encoding="utf-16", sep="\t", thousands=","))
shape: (3, 3)
┌─────┬─────┬──────┐
│ IdAB    │
│ ---------  │
│ i64i64i64  │
╞═════╪═════╪══════╡
│ 15372288 │
│ 23251047 │
│ 398194  │
└─────┴─────┴──────┘

It seems .str.replace_all is actually casting from i64 to str

>>> df.select(pl.col("A").str.replace_all(",", ""))
shape: (3, 1)
┌─────┐
│ A   │
│ --- │
│ str │
╞═════╡
│ 537 │
│ 325 │
│ 98  │
└─────┘

Which feels like it should raise an InvalidOperationError instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants