-
Notifications
You must be signed in to change notification settings - Fork 2
feat(io): complete read_csv #183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks @yingmanwumen , I may need more time to review this PR. |
@yingmanwumen I think the way we are comparing floating numbers in Python is not good enough. The |
| }) | ||
| ], | ||
| ) | ||
| def test_constructors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps another name such as test_inputs?
| @pytest.mark.parametrize( | ||
| "test_method, args, kwargs, expected_value", | ||
| [ | ||
| (ul.read_csv, (), { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if the error messages would be raised as expected, if the schema dtype doe not match actual dtype. So shall we add some more tests like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other cases could be:
- The
schema.len()is greater than the number of columns in the.csvfile; - The
schema.len()is less than the number of columns in the.csvfile; - The column name does not match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other cases could be:
* The `schema.len()` is greater than the number of columns in the `.csv` file; * The `schema.len()` is less than the number of columns in the `.csv` file; * The column name does not match.
Hello,
the strategy I chose is:
- If the field in
schemadoesn't exists in the.csvfile, thenulistwill return an empty list for it, such as{..., "bar": [], ...} - If the field in
.csvfile doesn't exists inschema, thenulistwill ignore it
So currently, cases above can not cause an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 2., I agree that the result of read_csv could be a subset of then content of .csv file, so could you also add another test case, where the schema has fewer columns than the csv file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1., let's do some research on mainstream data analysis, data processing libs or even databases to see what kind of behavior would be better.
- Pandas, the most popular
DataFramelib: Theread_csvfunction will always read all the columns from csv file. No matter how many col names are in the schema. See the Doc
Suppose thetmp.csvfile contains only columnsaandb
>>> import pandas as pd
>>> pd.read_csv("tmp.csv")
a b
0 1 2
>>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int})
a b
0 1 2
>>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int})
a b
0 1 2
>>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int, 'd': int})
a b
0 1 2
>>> pd.read_csv("tmp.csv", dtype={'c': int, 'd': int})
a b
0 1 2- SQL, the most used language for database: It does a very strict check for column names, and will raise an error if the name does not exist in the data base. For example, if there is table named
tmpwhich contains columnaandb. The following script will just not work.
SELECT CAST(tmp.c AS INT) as c from tmp If you don't mind, would you help test the behavior of pyarrow, which is also one of the best data processing lib.
Thanks so much~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And please feel free to investigate any other popular projects and share your idea here. The reason why we have to be cautious is because in real industry usage, the number of columns could be 10, 20 , 100 or even thousands. and it's very common that the developers type the incorrect col names in the schema accidentally. So simply ignore the typo, raise a warning or an error is an important topic to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 2., I agree that the result of
read_csvcould be a subset of then content of.csvfile, so could you also add another test case, where the schema has fewer columns than the csv file?
Sure, the test case is appended in the newest commit b86fc43:
...
(ul.read_csv, (), { # schema.len() < field.len()
"path": "./test_csv/04_test_nan.csv",
"schema": {"int": "int",
"bool": "bool"}
}, {
"int": [None, 2, 3, 4],
"bool": [True, False, True, None]
}),
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1., let's do some research on mainstream data analysis, data processing libs or even databases to see what kind of behavior would be better.
1. Pandas, the most popular `DataFrame` lib: The `read_csv` function will always read **all the columns** from csv file. No matter how many col names are in the schema. See the [Doc](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) Suppose the `tmp.csv` file contains only columns `a` and `b`>>> import pandas as pd >>> pd.read_csv("tmp.csv") a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int}) a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int}) a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int, 'd': int}) a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'c': int, 'd': int}) a b 0 1 22. SQL, the most used language for database: It does a very strict check for column names, and will raise an error if the name does not exist in the data base. For example, if there is table named `tmp` which contains column `a` and `b`. The following script will just not work.SELECT CAST(tmp.c AS INT) as c from tmpIf you don't mind, would you help test the behavior of pyarrow, which is also one of the best data processing lib.
Thanks so much~
^_^ I am glad to do this, but may be a little latter. I never think about this before, honestly. I am so inspired by your scientific attitude
ulist/python/ulist/ulist.pyi
Outdated
|
|
||
|
|
||
| def read_csv() -> list: ... | ||
| def read_csv(path: str, schema: Sequence[Tuple[str, str]]) -> list: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more specific, the return type is List[LIST_RS]?
tushushu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingmanwumen Thanks so much for the PR, I left some comments there. Happy Dragon Boat Festival~
|
Let me merge this PR and we can improve the benchmark in the future. @yingmanwumen |
Hello,
read_csv()#151 is completed. I write a few simple tests in directorytests/test_csv.There is a precision issue about
float32, for example,would be parsed to
{ "float32": 3.141590118408203, "float64": 3.14159 }in my machine, and I don't know if the result would be different in other machine.