Scaling PySR to very large and high-dimensional datasets #774

muhlbach · 2024-12-11T08:19:16Z

muhlbach
Dec 11, 2024

Hi there! First of all, I love this effort! Very well done to all contributors.

For the question, I want to experiment with symbolic regression but in a fairly low signal-to-noise ratio environment but with millions of samples and fairly high-dimensional feature space (hundreds of features). I've seen many discussions where ~1,000 samples seem to be max and then potentially batch processing. With this is mind, is it even feasible to run symbolic regression to my applications and if so, how?

Looking forward to hearing your thoughts!

Answered by MilesCranmer

Dec 11, 2024

For these settings I like to fit another machine learning model to the dataset first, and treat its predictions as “denoised”, because it will effectively average out the noise.

This is actually what the denoise and Xresampled options do!

However, this uses a Gaussian process as the secondary machine learning models. Gaussian processes take O(N^3) compute for a given dataset with N points. So for this problem I think I would try something like a neural network or even XGBoost. Then, take a grid of input features, evaluate the model over that grid, and that becomes your “denoised” y vector that you can feed PySR.

The high-dimensional feature space is a bit trickier since it’s not a good se…

View full answer

MilesCranmer · 2024-12-11T13:51:20Z

MilesCranmer
Dec 11, 2024
Maintainer

For these settings I like to fit another machine learning model to the dataset first, and treat its predictions as “denoised”, because it will effectively average out the noise.

This is actually what the denoise and Xresampled options do!

However, this uses a Gaussian process as the secondary machine learning models. Gaussian processes take O(N^3) compute for a given dataset with N points. So for this problem I think I would try something like a neural network or even XGBoost. Then, take a grid of input features, evaluate the model over that grid, and that becomes your “denoised” y vector that you can feed PySR.

The high-dimensional feature space is a bit trickier since it’s not a good setting for symbolic regression (an equation that uses all 100 features would be extremely complex). So you could either try the new TemplateExpressionSpec to impose some aggregation over the features if there is something that makes sense (see examples page in the docs), or just try PCA to do a dimensionality reduction and then fit the principal components as input features.

1 reply

muhlbach Jan 9, 2025
Author

Thanks for the detailed reply! Both suggestions are on my to-do now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling PySR to very large and high-dimensional datasets #774

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Scaling PySR to very large and high-dimensional datasets #774

muhlbach Dec 11, 2024

Replies: 1 comment · 1 reply

MilesCranmer Dec 11, 2024 Maintainer

muhlbach Jan 9, 2025 Author

muhlbach
Dec 11, 2024

Replies: 1 comment 1 reply

MilesCranmer
Dec 11, 2024
Maintainer

muhlbach Jan 9, 2025
Author