Skip to content

Redefining model parameter compute #187

Open
@nicrie

Description

@nicrie

Right now, setting compute=True triggers a sequential computation of the following steps:

  1. the preprocessor scaler
  2. optional NaN checks
  3. SVD
  4. scores and components

I’m starting to doubt the usefulness of the compute parameter in its current form. When @slevang managed to implement fully lazy evaluation, I thought it made sense to keep this parameter, but now I'm not so sure. My experience has been that it’s almost always faster to let Dask handle everything and find the most efficient way to compute, rather than forcing intermediate steps with compute=True. I plan to run some benchmarks when I get a chance to confirm this.

The only real benefit I see for this parameter might be in cross-decomposition models (like CCA, MCA, etc.), where PCA preprocessing is done on individual datasets to make subsequent steps more computationally feasible. Often, this results in a reduced PC space that fits into memory, so continuing with eager computation for the SVD, components, scores, etc., makes sense. Another advantage could be when defining the number of PCs based on explained variance rather than a fixed number—this requires computing individual PCA models upfront.

So, I’m thinking it might be more practical to redefine the compute parameter as something like compute_preprocessor. This would compute the following in one go, using Dask’s efficiency:

  1. the preprocessor scaler
  2. optional NaN checks
  3. PCA preprocessing

From there, we can continue with eager computation for the SVD, scores, and components.

I’d love to hear your thoughts on this @slevang since you have probably much more experience with Dask’s magic! :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    designInternal code structure

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions