Let's consider Linear Regression to illustrate a few shortcomings of traditional (in this case linear) ML methods. While the number of terms in the equetion is derived from the dimension of our input data, the structure of this equation is fixed and known in advance
In Python we would end up with a decision
X = # get data to predict on
W = [1.0000000000000002, 1.9999999999999991]
b = 3.0000000000000018
predictions = b + sum(X[i] * wi for i, wi in enumerate(W))
In this scenario we learned the coefficients, but the overall structure will never change
- we multiply every input feature by its corresponding coefficient
- we aggregate resulting values using a sum function
- we add the bias
Of course we have access to a variety of more complex, non-linear models out there, but they are usually harder to explain ...
Symbolic Regression aims at learning not only the (potential) coefficients but first and foremost the structure of the model. The resulting relaxed model consists in polynomial, built by composition from
- our input variables, eg.
Age
andHeight
- a set of operator at hand, eg. addition, substraction, multiplication and division
Symbolic Regression runs Genetic Programming
in order to produce the N bests polynomials for your data.
- built-in
explainability
along with non-linearity : if you understand all the operators you provide the algorithm with, you understand all potential polynomials derived from these operators - built-in feature selection & model-size reduction : when passing existing traits to a new offspring, we take the polynomial complexity in consideration (eq. to the model size in a Minimum Description Length setting). We then build a
skyline
based on both performance and complexity, and only consider models on this skyline. This process efficiently avoids "bloating" (complex individuals will not mate on the next round), controls overfitting (the simplest the polynomial, the more likely it does generalize to new data) and embeds explainability in the learning process - easier transfer learning : it's fairly easy to extract the list of learned polynomials from a Symbolic Regression model and instantiates a new one with this list, to solve a new but similar problem
At each round in the genetic programming process, we need to evaluate the entire population of polynomials on the training data. If you consider a usual population size of 1000 and 500 rounds until what we assume to be convergence, this can quickly become very expensive to run on an off-the-shelf laptop, even for a medium-sized dataset.
Taking a step back, we can do better. At each round, indivuals are likely to share sub-expressions (they may share a common parent, and potentially grand-parents ...).
In sympy this is known as a Common Subexpression Elimination, but it's essentialy a graph optimization technique. Extracting the common subparts to all our expressions, we can precompute results and inline them back into our original expressions ! This saves an enormous amount a CPU and RAM !!
The polars engine not only submit computation to multiple threads, but applies graph optimization before running your expressions. The only extra step we need is to convert a sympy expr to a polars expr (see core.py )
- add unary operators like sin and cos [X]
- try an online version with RiverML [ ]
- pareto front optimisation [X]
- multi-objective optimisation (eg. return vs risk ) ? [ ]
- SymbolicRegression for symbolic regression on timeseries [ ]
- scikit-learn tree structure into a sympy expression (that can be translated into a polars expression later on)