Allow numbers with thousand seperators to be parsed as Integer in read_csv
#18465
Labels
enhancement
New feature or an improvement of an existing feature
read_csv
#18465
Description
I'm running into issues reading tsv files that have been exported as
utf-16
that include numeric columns formatted to contain thousand separators (i.e. comma's). When polars attempts to infer the schema, the type will flip/flop between str and int64 depending on whether the column contains a value > 999 or not. I posted on SO but all the replies were more refined versions of my work-around, i.e. to replace the thousands separator with a zero-length string before converting to integer.The issue here is I have to do this on any numeric column that contains a value > 999, which is data dependent. I found a work-around which is to simply use
replace
on all numeric columns even if they're inferred as int rather than string, i.e.This seems like something that could break in future releases - does
pl.col('A', 'B').str
select both A and B and then filter by type (pl.String) before calling replace? otherwise it seems like a side-effect?In any case it's awkward, if I specify that a column in Int64 (i.e. using
schema=
or .cast(pl.Int64)) and it's in a format that can be coerced to an Int from a string, ideally Polars should do this - but when there's a thousands separator it doesn't. Or at least there should be an explcitstr.convert(pl.Int64)
that works in this case?The text was updated successfully, but these errors were encountered: