-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading CSV files with variable number of columns not supported #1505
Comments
You can use
|
Many thanks! That's a very useful tool I wasn't aware of. |
In case when there is not a header present in the csv-file we use the first line to determine new column names (column_1, column_2 .., column_n). We probably should use the max line length of the lines we scan for dtype inference. |
@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases). |
Here it is: polars/polars/polars-io/src/csv_core/utils.rs Line 141 in 3d99b45
I think only the
Only when the column names are overwritten and there is no-header, we should modify it, I think. The other cases the dtypes dict should be correct right? So I believe we have all information to overwrite the new_names with the auto-generated ones. |
Not when the user provides, |
I am not sure I understand the issue here. I see that |
I think I already fixed this issue. Edit: not entirely certain anymore |
Okay, I will be happy to get the commit that you think might have fixed the issue. |
I normally fixed it here: ee26601 |
Infer the number of columns of a header-less csv from the same group of rows which are used to infer the types. The old logic is used if `infer_schema_length` is set to 0. Closes pola-rs#1505
Infer the number of columns of a header-less csv from the same group of rows which are used to infer the types. The old logic is used if `infer_schema_length` is set to 0. Closes pola-rs#1505
Infer the number of columns of a header-less csv from the same group of rows which are used to infer the types. The old logic is used if `infer_schema_length` is set to 0. Closes pola-rs#1505
Infer the number of columns of a header-less csv from the same group of rows which are used to infer the types. The old logic is used if `infer_schema_length` is set to 0. Closes pola-rs#1505
Infer the number of columns of a header-less csv from the same group of rows which are used to infer the types. The old logic, to count only the columns in the first row is used if `infer_schema_length` is set to 0. Closes pola-rs#1505
Infer the number of columns of a header-less csv from the same group of rows which are used to infer the types. The old logic, to count only the columns in the first row is used if `infer_schema_length` is set to 0. Closes pola-rs#1505
Infer the number of columns of a header-less csv from the same group of rows which are used to infer the types. The old logic, to count only the columns in the first row is used if `infer_schema_length` is set to 0. Closes pola-rs#1505
Infer the number of columns of a header-less csv from the same group of rows which are used to infer the types. The old logic, to count only the columns in the first row is used if `infer_schema_length` is set to 0. Closes pola-rs#1505
Infer the number of columns of a header-less csv from the same group of rows which are used to infer the types. The old logic, to count only the columns in the first row is used if `infer_schema_length` is set to 0. Closes pola-rs#1505
Any updates on this? Still doesn't work using |
is this issue resolved ? can i take this and open a pr |
@Nagaprasadvr I'm not a code reviewer so I can't give a absolutely definitive answer but @stinodego marked it as accepted and if it were fixed it'd be closed so I don't see why not. One caveat is that it needs to be a rust fix not a python fix as the maintainers don't want feature divergence between rust and python. |
ty , will take this issue and open a pr |
Are you using Python or Rust?
Python
Which feature gates did you use?
This can be ignored by Python users.
What version of polars are you using?
0.9.12
What operating system are you using polars on?
macOS
Describe your bug.
When reading a CSV file with variable number of columns, polars assumes all rows have the number of columns inferred from the first row (?) and skips parsing any subsequent columns. Providing the columns to be parsed explicitly via the columns parameter results in error:
RuntimeError: Any(NotFound("Unable to get field named "column_4". Valid fields: ["column_1", "column_2", "column_3"]"))
What are the steps to reproduce the behavior?
Dataset (test.csv):
a,b,c
a,b,c,d,e,f
g,h,i,j,k
Example 1 (no error but reads only 3 columns instead of 6)
Example (results in error)
What is the actual behavior?
Columns beyond the ones inferred from the first data row are not parsed.
What is the expected behavior?
All columns are parsed but are set to NaN/None for rows that don't have data for these columns.
The text was updated successfully, but these errors were encountered: