Penalize solutions without the full set of predictors #1029

albertbuchard · 2025-09-08T21:00:08Z

albertbuchard
Sep 8, 2025

Hi,

I’m trying to implement an efficient way to penalize the number of predictors used in a solution. The goal is to encourage solutions that make use of all predictors. My buddy Chat and I came up with the following approach:

# Skip penalty if no real predictors
apply_penalty = penalize_absent_features and (X.shape[1] > 0)

if apply_penalty:
    coeff = f"{penalty_coeff:.6g}" if isinstance(penalty_coeff, (int, float)) else str(penalty_coeff)

    penalty_code = lambda total_vars, coeff: fr"""
feature_absent_penalty(ex, dataset, options) = begin
    # Base MSE
    pred, ok = eval_tree_array(ex, dataset.X, options)
    if !ok
        return Inf
    end
    base = sum((pred .- dataset.y).^2) / dataset.n

    # Count distinct variable leaves
    used = Set{{Int}}()
    function walk(n)
        if n.degree == 0
            if !n.constant
                push!(used, Int(n.feature))
            end
        elseif n.degree == 1
            walk(n.l)
        else
            walk(n.l); walk(n.r)
        end
    end
    walk(ex.tree)

    missing = max(0, {total_vars} - length(used))
    return base + {coeff} * missing
end
"""

This works, but I lose about an order of magnitude in iteration speed when running on the full dataset (from ~1e6 it/s down to ~1e5 it/s).

I don’t know the internals of the codebase, but I looked through DynamicExpressions.jl and couldn’t find any precomputed property that tracks the number of variables actually used in the final expression. Is there such a magic feature, function or property that I could use instead of traversing the tree at every evaluation?

I’d be very open to suggestions or pointers—thanks!

MilesCranmer · 2025-09-08T21:38:23Z

MilesCranmer
Sep 8, 2025
Maintainer

This is likely the top FAQ at this point, I really need to make an FAQ page :)

Check out some of the discussion here: #273 (comment)

I think there's a discussion somewhere here where I give a fast loss function for this exact purpose, but can't find it at the moment. If you Google around and find it, can you link it here?

1 reply

albertbuchard Sep 9, 2025
Author

Thank you for getting back to me so quickly!

Is this the one you had in mind?
#594 (comment)

So the idea is: loop over all nodes in the tree, extract node.feature, collect them into a used_features set, and then compute

missing = all_features - used_features

Would that be faster than the walk I’m currently doing? It should be O(N) as well I think

MilesCranmer · 2025-09-08T21:48:47Z

MilesCranmer
Sep 8, 2025
Maintainer

Actually I think your loss function looks okay. I think it's slow because of the recursive closure function, which makes the compiler do dynamic dispatch (makes it unable to compile it, so you get Python-level speeds).

I would do

function feature_absent_penalty(ex, dataset::Dataset{T,L}, options) where {T,L}
    # Base MSE
    pred, ok = eval_tree_array(ex, dataset.X, options)
    if !ok
        return L(Inf)
    end
    base = sum(i -> (pred[i] - dataset.y[i])^2, eachindex(pred)) / dataset.n

    # Count distinct variable leaves
    total_vars = 8 # TODO
    used = sizehint!(Set{Int}(), total_vars)
    foreach(ex) do node  # faster version of 'for node in ex'
        if node.degree == 0 && !node.constant
            push!(used, node.feature)
        end
    end

    miss = max(0, total_vars - length(used))
    return L(base + coeff * miss)
end

9 replies

MilesCranmer Sep 9, 2025
Maintainer

Oh. You should do ex.tree first. I didn't realise you were using loss_function_expression

albertbuchard Sep 9, 2025
Author

Oh. You should do ex.tree first. I didn't realise you were using loss_function_expression

Awesome ! That is a 30% speedup :)

MilesCranmer Sep 10, 2025
Maintainer

For more speedup I wonder if this might also help:

function feature_absent_penalty(ex, dataset::Dataset{T,L}, options) where {T,L}
    # Base MSE
    pred, ok = eval_tree_array(ex, dataset.X, options)
    if !ok
        return L(Inf)
    end
    base = sum(i -> @inbounds((pred[i] - dataset.y[i])^2), eachindex(pred)) / dataset.n

    # Count distinct variable leaves
    total_vars = 8 # TODO
    used = zeros(Bool, total_vars)
    foreach(ex.tree) do node
        if node.degree == 0 && !node.constant
            @inbounds used[node.feature] = true
        end
    end

    miss = total_vars - count(used)
    return L(base + coeff * miss)
end

albertbuchard Sep 10, 2025
Author

Thanks !
V2 is actually slightly faster. V2 > V3 >> V1

    if apply_penalty:
        coeff = f"{penalty_coeff:.6g}" if isinstance(penalty_coeff, (int, float)) else str(penalty_coeff)

        penalty_code = lambda total_vars, coeff: fr"""
function feature_absent_penalty(ex, dataset::Dataset{{T,L}}, options) where {{T,L}}
    # base MSE
    pred, ok = eval_tree_array(ex, dataset.X, options)
    if !ok
        return L(Inf)
    end
    base = sum((pred .- dataset.y).^2) / dataset.n

    # count distinct variable leaves
    used = Set{{Int}}()
    function walk(n)
        if n.degree == 0
            if !n.constant
                push!(used, Int(n.feature))
            end
        elseif n.degree == 1
            walk(n.l)
        else
            walk(n.l); walk(n.r)
        end
    end
    walk(ex.tree)

    missing = max(0, {total_vars} - length(used))
    return L(base + {coeff} * missing)
end
"""
        penalty_code_v2 = lambda total_vars, coeff: fr"""
 function feature_absent_penalty(ex, dataset::Dataset{{T,L}}, options) where {{T,L}}
    # Base MSE
    pred, ok = eval_tree_array(ex, dataset.X, options)
    if !ok
        return L(Inf)
    end
    base = sum(i -> (pred[i] - dataset.y[i])^2, eachindex(pred)) / dataset.n

    # Count distinct variables
    total_vars = {total_vars}
    used = sizehint!(Set{{Int}}(), total_vars)
    foreach(ex.tree) do node  # faster version of 'for node in ex'\
        if node.degree == 0 && !node.constant
            push!(used, node.feature)
        end
    end

    miss = max(0, total_vars - length(used))
    return L(base + {coeff} * miss)
end
"""

        penalty_code_v3 = lambda total_vars, coeff: fr"""
function feature_absent_penalty(ex, dataset::Dataset{{T,L}}, options) where {{T,L}}
    # Base MSE
    pred, ok = eval_tree_array(ex, dataset.X, options)
    if !ok
        return L(Inf)
    end
    base = sum(i -> @inbounds((pred[i] - dataset.y[i])^2), eachindex(pred)) / dataset.n

    # Count distinct variable leaves
    total_vars = {total_vars}
    used = zeros(Bool, total_vars)
    foreach(ex.tree) do node
        if node.degree == 0 && !node.constant
            @inbounds used[node.feature] = true
        end
    end

    miss = max(0, total_vars - count(used))
    return L(base + {coeff} * miss)
end"""

MilesCranmer Sep 10, 2025
Maintainer

Interesting!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Penalize solutions without the full set of predictors #1029

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 10 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Penalize solutions without the full set of predictors #1029

Uh oh!

albertbuchard Sep 8, 2025

Replies: 2 comments · 10 replies

Uh oh!

MilesCranmer Sep 8, 2025 Maintainer

Uh oh!

albertbuchard Sep 9, 2025 Author

Uh oh!

Uh oh!

MilesCranmer Sep 8, 2025 Maintainer

Uh oh!

MilesCranmer Sep 9, 2025 Maintainer

Uh oh!

albertbuchard Sep 9, 2025 Author

Uh oh!

MilesCranmer Sep 10, 2025 Maintainer

Uh oh!

albertbuchard Sep 10, 2025 Author

Uh oh!

MilesCranmer Sep 10, 2025 Maintainer

albertbuchard
Sep 8, 2025

Replies: 2 comments 10 replies

MilesCranmer
Sep 8, 2025
Maintainer

albertbuchard Sep 9, 2025
Author

MilesCranmer
Sep 8, 2025
Maintainer

MilesCranmer Sep 9, 2025
Maintainer

albertbuchard Sep 9, 2025
Author

MilesCranmer Sep 10, 2025
Maintainer

albertbuchard Sep 10, 2025
Author

MilesCranmer Sep 10, 2025
Maintainer