Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Functors as Functions in columns transformation #2984

Open
jeremiedb opened this issue Jan 7, 2022 · 6 comments
Open

Support Functors as Functions in columns transformation #2984

jeremiedb opened this issue Jan 7, 2022 · 6 comments
Labels
Milestone

Comments

@jeremiedb
Copy link

This issue relates to the transformations dispatch mechanism that doesn't recognize Functors as Functions as discussed on discourse .

I have a use case where I use Functors as pre-trained features transformations. In such context, defining those structs as sub-types of Function doesn’t seem a natural choice as a system.

Here’s a functor that applies learned normalization:

using DataFrames
using Statistics: mean, std

struct Normalizer
    μ
    σ
end

Normalizer(x::AbstractVector) = Normalizer(mean(x), std(x))

function (m::Normalizer)(x::Real)
    return (x - m.μ) / m.σ
end

function (m::Normalizer)(x::AbstractVector)
    return (x .- m.μ) ./ m.σ
end

df = DataFrame(:v1 => rand(5), :v2 => rand(5))
feat_names = names(df)
norms = map((feat) -> Normalizer(df[:, feat]), feat_names)

The following doesn’t work:

transform(df, feat_names .=> norms .=> feat_names)
ERROR: LoadError: ArgumentError: Unrecognized column selector: "v1" => (Normalizer(0.5407170762469404, 0.1599492895436335) => "v1")

However, somewhat surprisingly, using ByRow does work:

transform(df, feat_names .=> ByRow.(norms) .=> feat_names)
5×2 DataFrame
 Row │ v1          v2        
     │ Float64     Float64
─────┼───────────────────────
   10.0386826   0.479449
   20.919179   -1.61432
   31.05579     0.584841
   4-0.930937    0.854153
   5-1.08272    -0.304124

So to use the vectorized form, it seems like a mapping of the Functors into Functions is required:

norms_f = map(f -> (x) -> f(x), norms)
transform(df, feat_names .=> norms_f .=> feat_names)
5×2 DataFrame
 Row │ v1          v2        
     │ Float64     Float64
─────┼───────────────────────
   10.0386826   0.479449
   20.919179   -1.61432
   31.05579     0.584841
   4-0.930937    0.854153
   5-1.08272    -0.304124

I can see that there’s a not too complicated way to circumvent the functor limitation through that remapping. Yet, isn’t it counterintuitive to see the Functor works in the ByRow but not in the vectorized case? Although dispatch happens differently under ByRow, from a user perspective,

Having the opportunity to recognize Functors as Functions in the transform would be their most natural handling in my opinion.

@bkamins bkamins added this to the 1.x milestone Jan 7, 2022
@bkamins
Copy link
Member

bkamins commented Jan 7, 2022

The challenge is that we already have quite a complex system of rules how these transformations are interpreted, see:

julia> using DataFrames

julia> methods(DataFrames.normalize_selection)
# 14 methods for generic function "normalize_selection":
[1] normalize_selection(idx::DataFrames.AbstractIndex, sel::Union{AbstractString, Signed, Symbol, Unsigned}, renamecols::Bool)
[2] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{typeof(nrow), <:AbstractString}, renamecols::Bool)
[3] normalize_selection(idx::DataFrames.AbstractIndex, sel::Colon, renamecols::Bool)
[4] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{typeof(nrow), Symbol}, renamecols::Bool)
[5] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, <:Pair{<:Union{Function, Type}, <:Union{AbstractString, Symbol}}}, renamecols::Bool)
[6] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, <:AbstractString}, renamecols::Bool)
[7] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, Symbol}, renamecols::Bool)
[8] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, <:Union{Function, Type}}, renamecols::Bool)
[9] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, <:Union{AbstractVector{Symbol}, AbstractVector{<:AbstractString}}}, renamecols::Bool)
[10] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Any, <:Pair{<:Union{Function, Type}, <:Union{AbstractVector{Symbol}, AbstractString, DataType, Function, Symbol, AbstractVector{<:AbstractString}}}}, renamecols::Bool)
[11] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Any, <:Union{Function, Type}}, renamecols::Bool)
[12] normalize_selection(idx::DataFrames.AbstractIndex, sel::typeof(nrow), renamecols::Bool)
[13] normalize_selection(idx::DataFrames.AbstractIndex, sel::Union{Function, Type}, renamecols::Bool)
[14] normalize_selection(idx::DataFrames.AbstractIndex, sel, renamecols::Bool)

and it is quite tricky to mess with them. I will think of what can be done here.

@nalimilan - do you have any opinion here?

@nalimilan
Copy link
Member

I'm also hesitant in general to accept objects of any type as it can create ambiguities, but I have to admit not supporting non-Function functors is a bit annoying. In theory, we could consider that any type which isn't known to be an index is a function or functor, right? The main risk would be if some types can be both, but that's not too likely hopefully.

I have a use case where I use Functors as pre-trained features transformations. In such context, defining those structs as sub-types of Function doesn’t seem a natural choice as a system.

@jeremiedb "Natural" is very hard to define. Any particular reason why you wouldn't want your functors to inherit from Function? The reasons I can see is 1) you cannot inherit from two different types, 2) by default, the compiler only specializes on Functions arguments when they are called (i.e. accessing fields is not enough), though you can force specialization by having a type parameter.

@jeremiedb
Copy link
Author

"Natural" is very hard to define.

Sorry for the vague wording. I had 1) in mind, that is having a type hierarchy of transformation functions such as:

abstract type Projector end

struct Normalizer <: Projector
    μ
    σ
end

struct Quantilizer <: Projector
    quantiles
end

@bkamins
Copy link
Member

bkamins commented Jan 8, 2022

Yes, but I assume that @nalimilan wants to understand why not have Projector <: Function?

@jeremiedb
Copy link
Author

Oh I just didn't realized it could makes sense! But you're right, by doing abstract type Projector <: Function end, then it works.
Defining the Functors as subtypes of Function is a minimal modification, so it seems like a legitimate trick, perhaps it just needs some disclaimer somewhere.

@bkamins
Copy link
Member

bkamins commented Jan 8, 2022

so it seems like a legitimate trick

For me (and I guess also @nalimilan) this is natural. Then you, through type hierarchy, signal that your object is callable.

Note that this is not a unique feature of DataFrames.jl. Actually 143 methods in base Julia rely on the fact that some object is callable, e.g. to quote some common ones replace!, findfirst (and similar) etc..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants