general principles of data manipulation for dicussion

Typically, we use dataframes because we like its support to different column types. However, typical data processing operations are to filter rows and columns satisfying certain constraints and apply transformations which may not preserve column names or dataframe structure. 

Typical operations involve statistical operations which require one to filter certain columns and apply stat/math operations forcing one to transform the data into matrix form which doesn't preserve column names and one needs extra steps to plug them back to dataframe. If one is not careful, the column names may not align or not in sync from matrix back to dataframe because of the slicing operations.

If we follow the unix pipe principles, input and output of any filter must be a dataframe. Unix uses `grep` to filter rows, `cut` to filter columns and tr/sed/awk to transform filtered rows/cols. In dataframe, we want the filtering and math/stat operations to be closure operations (meaning their output should be a dataframe preserving column names).

Here are some typical column oriented workflow. Assume df to have dates, numeric, categories, columns and spans many columns such that enumerating them is tedious.

```julia
df |> filter-date-cols |> extract-day/hours/dayofweek
df |> filter-categorical-cols |> hot-encode
df |> filter-numeric-cols |> log/sqrt/scale or pca/ica/svd
df |> filter-numeric-cols |> summarize(mean, median, other stats)
df |> filter-numeric-cols |> filter-cols-with-missing |> summarize(freq of missing)
df |> filter-cols-with-missing |> summarize(freq)
```

In a more complex workflow, we can filter-out NA rows, filter-out columns with NAs greater than 50%, impute remaining df, filter numeric cols and do transformation, filter categorical and do transformation, and filter dates and do transformation and concatenate them in one line:

```julia
df |> row-NA-rm |> col-NA-rm |>
    ( (filter-numeric |> scale) +  
       (filter-date |> extract-hour) + 
       (filter-cat |> hotbit-encode)
     ) |> CSV.write("training.csv")
```

Since each transformation outputs a dataframe, you can extract each, transform, and concatenate them in one line. It becomes easy also to see the operations horizontally than vertically because you can read it from left to write without the need to create temporary variables which is prone to bugs and logical errors.
 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

general principles of data manipulation for dicussion #2509

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development