Skip to content

Support Functors as Functions in columns transformation #2984

Open
@jeremiedb

Description

@jeremiedb

This issue relates to the transformations dispatch mechanism that doesn't recognize Functors as Functions as discussed on discourse .

I have a use case where I use Functors as pre-trained features transformations. In such context, defining those structs as sub-types of Function doesn’t seem a natural choice as a system.

Here’s a functor that applies learned normalization:

using DataFrames
using Statistics: mean, std

struct Normalizer
    μ
    σ
end

Normalizer(x::AbstractVector) = Normalizer(mean(x), std(x))

function (m::Normalizer)(x::Real)
    return (x - m.μ) / m.σ
end

function (m::Normalizer)(x::AbstractVector)
    return (x .- m.μ) ./ m.σ
end

df = DataFrame(:v1 => rand(5), :v2 => rand(5))
feat_names = names(df)
norms = map((feat) -> Normalizer(df[:, feat]), feat_names)

The following doesn’t work:

transform(df, feat_names .=> norms .=> feat_names)
ERROR: LoadError: ArgumentError: Unrecognized column selector: "v1" => (Normalizer(0.5407170762469404, 0.1599492895436335) => "v1")

However, somewhat surprisingly, using ByRow does work:

transform(df, feat_names .=> ByRow.(norms) .=> feat_names)
5×2 DataFrame
 Row │ v1          v2        
     │ Float64     Float64
─────┼───────────────────────
   10.0386826   0.479449
   20.919179   -1.61432
   31.05579     0.584841
   4-0.930937    0.854153
   5-1.08272    -0.304124

So to use the vectorized form, it seems like a mapping of the Functors into Functions is required:

norms_f = map(f -> (x) -> f(x), norms)
transform(df, feat_names .=> norms_f .=> feat_names)
5×2 DataFrame
 Row │ v1          v2        
     │ Float64     Float64
─────┼───────────────────────
   10.0386826   0.479449
   20.919179   -1.61432
   31.05579     0.584841
   4-0.930937    0.854153
   5-1.08272    -0.304124

I can see that there’s a not too complicated way to circumvent the functor limitation through that remapping. Yet, isn’t it counterintuitive to see the Functor works in the ByRow but not in the vectorized case? Although dispatch happens differently under ByRow, from a user perspective,

Having the opportunity to recognize Functors as Functions in the transform would be their most natural handling in my opinion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions