-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
general principles of data manipulation for dicussion #2509
Comments
It seems that what you propose is better suited for the extension package rather than DataFrames.jl. For things related to your request we have now open #2508 and #2417. They are to provide primitives that the functionalities like you want can be built on top of. If you have any low-level API that would complement these proposal then we can discuss it here. Otherwise I think we can close this issue as it is better placed in high-level API for data manipulation packages. |
yeah, i think if these two issues cover this, then it's ok to close this. |
can we have a wrapper function such that it automatically preserves column names if we pass ordinary functions such as mean/mode/median to a dataframe? dfops(df,sum) -> df or is this already addressed by combine/select? this can be helpful in pipe operations such that dataframe structure is always the output |
I am not sure what you mean exactly, but it seems you want |
oh yeah, that is handy. so if i can have a nice column filter, i can use this for summary |
btw, there is mapcol but no maprow. if i do map(fn, eachrow/eachcol), it doesn't return a dataframe. it will be nice if we have closure operations where you map to a dataframe and result is a dataframe. |
basically, row and col filter returning dataframe as well as generic map that returns dataframe should cover a lot of cases. you can have row-filter |> column-filter |> map-by-row-or-col |> col-filter |> summarize we may not need macros in many cases if we have these operations. |
it is:
we might have |
ok, it's just somehow expected to think that if there is mapcol, maprow seems to be its corresponding row transform function. |
i propose all operations for df should return a dataframe so that it's trivial to join or concatenate subresults to make it follow closure operations. you operate an integer, should return integer. you operate on dataframe should return a dataframe. i just realized i meant closure property instead of closed operations. it's like type stability thing. any operations on dataframe should be a dataframe so that the succeeding filters in the pipeline can expect dataframe input and dataframe output for a consistent data interchange format. |
Typically, we use dataframes because we like its support to different column types. However, typical data processing operations are to filter rows and columns satisfying certain constraints and apply transformations which may not preserve column names or dataframe structure.
Typical operations involve statistical operations which require one to filter certain columns and apply stat/math operations forcing one to transform the data into matrix form which doesn't preserve column names and one needs extra steps to plug them back to dataframe. If one is not careful, the column names may not align or not in sync from matrix back to dataframe because of the slicing operations.
If we follow the unix pipe principles, input and output of any filter must be a dataframe. Unix uses
grep
to filter rows,cut
to filter columns and tr/sed/awk to transform filtered rows/cols. In dataframe, we want the filtering and math/stat operations to be closure operations (meaning their output should be a dataframe preserving column names).Here are some typical column oriented workflow. Assume df to have dates, numeric, categories, columns and spans many columns such that enumerating them is tedious.
In a more complex workflow, we can filter-out NA rows, filter-out columns with NAs greater than 50%, impute remaining df, filter numeric cols and do transformation, filter categorical and do transformation, and filter dates and do transformation and concatenate them in one line:
Since each transformation outputs a dataframe, you can extract each, transform, and concatenate them in one line. It becomes easy also to see the operations horizontally than vertically because you can read it from left to write without the need to create temporary variables which is prone to bugs and logical errors.
The text was updated successfully, but these errors were encountered: