Skip to content

general principles of data manipulation for dicussion #2509

Closed
@ppalmes

Description

Typically, we use dataframes because we like its support to different column types. However, typical data processing operations are to filter rows and columns satisfying certain constraints and apply transformations which may not preserve column names or dataframe structure.

Typical operations involve statistical operations which require one to filter certain columns and apply stat/math operations forcing one to transform the data into matrix form which doesn't preserve column names and one needs extra steps to plug them back to dataframe. If one is not careful, the column names may not align or not in sync from matrix back to dataframe because of the slicing operations.

If we follow the unix pipe principles, input and output of any filter must be a dataframe. Unix uses grep to filter rows, cut to filter columns and tr/sed/awk to transform filtered rows/cols. In dataframe, we want the filtering and math/stat operations to be closure operations (meaning their output should be a dataframe preserving column names).

Here are some typical column oriented workflow. Assume df to have dates, numeric, categories, columns and spans many columns such that enumerating them is tedious.

df |> filter-date-cols |> extract-day/hours/dayofweek
df |> filter-categorical-cols |> hot-encode
df |> filter-numeric-cols |> log/sqrt/scale or pca/ica/svd
df |> filter-numeric-cols |> summarize(mean, median, other stats)
df |> filter-numeric-cols |> filter-cols-with-missing |> summarize(freq of missing)
df |> filter-cols-with-missing |> summarize(freq)

In a more complex workflow, we can filter-out NA rows, filter-out columns with NAs greater than 50%, impute remaining df, filter numeric cols and do transformation, filter categorical and do transformation, and filter dates and do transformation and concatenate them in one line:

df |> row-NA-rm |> col-NA-rm |>
    ( (filter-numeric |> scale) +  
       (filter-date |> extract-hour) + 
       (filter-cat |> hotbit-encode)
     ) |> CSV.write("training.csv")

Since each transformation outputs a dataframe, you can extract each, transform, and concatenate them in one line. It becomes easy also to see the operations horizontally than vertically because you can read it from left to write without the need to create temporary variables which is prone to bugs and logical errors.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions