Skip to content

psaegert/flash-ansr

Repository files navigation

⚡ANSR:
Flash Amortized Neural Symbolic Regression

pytest quality checks CodeQL Advanced

⚡ANSR Training on Fully Procedurally Generated Data Inspired by NeSymReS (Biggio et al. 2021)

Introduction

Abstract

Symbolic Regression has been approached with many different methods and paradigms. The overwhelming success of transformer-based language models in recent years has since motivated researchers to solve Symbolic Regression with large-scale pre-training of data-conditioned "equation generators" at competitive levels. However, as most traditional methods, the majority of these Amortized Neural Symbolic Regression methods rely on SymPy to simplify and compile randomly generated training equations, a choice that inevitably brings tradeoffs and requires workarounds to efficiently work at scale. I show that replacing SymPy with a novel token-based simplification algorithm with hand-crafted transformation rules enables training on fully-procedurally generated and higher-quality synthetic data, and thus develop ⚡ANSR. On various test sets, my method perfectly recovers $+80$% more equations numerically than the NeSymReS baseline while being 84 times faster natively, and yields comparable recovery rates to PySR in a quarter of its time. I provide an in-depth performance analysis of my method on stricter and more meaningful metrics than previous work. ⚡ANSR is open-source and available on GitHub and Huggingface, and allows for straight-forward replicability on consumer-grade hardware.

Main Results

Model Comparison. Up to 3 variables. Default Model Configurations (32 threads / beams).
Bootstrapped Median, 5p, 95p and AR-p (Noreen 1989) values (n=1000).
N = 5000 (⚡ v7.0), 1000 (PySR, NeSymReS 100M).
AMD 9950X (16C32T), RTX 4090.

Table of Contents

Requirements

Hardware

  • 32 GB Memory
  • CUDA-enabled GPU
  • 12 GB VRAM
  • 64 GB Storage (subject to change)

Software

Getting Started

1. Clone the repository

git clone https://github.com/psaegert/flash-ansr
cd flash-ansr

2. Install the package

Create a virtual environment (optional):

conda:

conda create -n ansr python=3.11 ipykernel ipywidgets
conda activate ansr

Then, install the package via

pip install -e .
pip install -e ./nsrops

Usage

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Import flash_ansr
from flash_ansr import FlashANSR, GenerationConfig, install_model, get_path

# Specify the model
# Here: https://huggingface.co/psaegert/flash-ansr-v7.0
MODEL = "psaegert/flash-ansr-v7.0"

# Download the latest snapshot of the model
# By default, the model is downloaded to the directory `./models/` in the package root
install_model(MODEL)

# Load the model
ansr = FlashANSR.load(
    directory=get_path('models', MODEL),
    generation_config=GenerationConfig(method='beam_search', beam_width=32),  # optional
    n_restarts=32,  # optional
).to(device)

# Define data
X = ...
y = ...

# Fit the model to the data
ansr.fit(X, y, verbose=True)

# Show the best expression
print(ansr.get_expression())

# Predict with the best expression
y_pred = ansr.predict(X)

Training

Express

Use, copy or modify a config in ./configs:

./configs
├── my_config
│   ├── dataset_train.yaml          # Link to skeleton pool and padding for training
│   ├── dataset_val.yaml            # Link to skeleton pool and padding for validation
│   ├── evaluation.yaml             # Evaluation settings
│   ├── expression_space.yaml       # Operators and variables
│   ├── nsr.yaml                    # Model settings and link to expression space
│   ├── skeleton_pool_train.yaml    # Sampling and holdout settings for training
│   ├── skeleton_pool_val.yaml      # Sampling and holdout settings for validation
│   └── train.yaml                  # Data and schedule for training

Run the training and evaluation pipeline with

./scripts/run.sh my_config

For more information see below.

Manual

0. Prerequisites

Test data structured as follows:

./data/ansr-data/test_set
├── feynman
│   └── FeynmanEquations.csv
├── nguyen
│   └── nguyen.csv
└── soose_nc
    └── nc.csv

The test data can be cloned from the Hugging Face data repository:

git clone https://huggingface.co/psaegert/ansr-data data/ansr-data

1. Import test data

External datasets must be imported into the ANSR format:

flash_ansr import-data -i "{{ROOT}}/data/ansr-data/test_set/soose_nc/nc.csv" -p "soose" -e "{{ROOT}}/configs/test_set_base/expression_space.yaml" -b "{{ROOT}}/configs/test_set_base/skeleton_pool.yaml" -o "{{ROOT}}/data/ansr-data/test_set/soose_nc/skeleton_pool" -v
flash_ansr import-data -i "{{ROOT}}/data/ansr-data/test_set/feynman/FeynmanEquations.csv" -p "feynman" -e "{{ROOT}}/configs/test_set_base/expression_space.yaml" -b "{{ROOT}}/configs/test_set_base/skeleton_pool.yaml" -o "{{ROOT}}/data/ansr-data/test_set/feynman/skeleton_pool" -v
flash_ansr import-data -i "{{ROOT}}/data/ansr-data/test_set/nguyen/nguyen.csv" -p "nguyen" -e "{{ROOT}}/configs/test_set_base/expression_space.yaml" -b "{{ROOT}}/configs/test_set_base/skeleton_pool.yaml" -o "{{ROOT}}/data/ansr-data/test_set/nguyen/skeleton_pool" -v

with

  • -i the input file

  • -p the name of the parser implemented in ./src/flash_ansr/compat/convert_data.py

  • -e the expression space

  • -b the config of a base skeleton pool to add the data to

  • -o the output directory for the resulting skeleton pool

  • -v verbose output

This will create and save a skeleton pool with the parsed imported skeletons in the specified directory:

./data/ansr-data/test_set/<test_set>
└── skeleton_pool
    ├── expression_space.yaml
    ├── skeleton_pool.yaml
    └── skeletons.pkl

2. Generate validation data

Validation data is generated by randomly sampling according to the settings in the skeleton pool config:

flash_ansr generate-skeleton-pool -c {{ROOT}}/configs/${CONFIG}/skeleton_pool_val.yaml -o {{ROOT}}/data/ansr-data/${CONFIG}/skeleton_pool_val -s 5000 -v

with

  • -c the skeleton pool config
  • -o the output directory to save the skeleton pool
  • -s the number of unique skeletons to sample
  • -v verbose output

3. Train the model

flash_ansr train -c {{ROOT}}/configs/${CONFIG}/train.yaml -o {{ROOT}}/models/ansr-models/${CONFIG} -v -ci 100000 -vi 10000

with

  • -c the training config
  • -o the output directory to save the model and checkpoints
  • -v verbose output
  • -ci the interval to save checkpoints
  • -vi the interval for validation

4. Evaluate the model

flash_ansr evaluate -c {{ROOT}}/configs/${CONFIG}/evaluation.yaml -m "{{ROOT}}/models/ansr-models/${MODEL}" -d "{{ROOT}}/data/ansr-data/test_set/soose_nc/dataset.yaml" -n 5000 -o {{ROOT}}/results/evaluation/${CONFIG}/soose_nc.pickle -v
flash_ansr evaluate -c {{ROOT}}/configs/${CONFIG}/evaluation.yaml -m "{{ROOT}}/models/ansr-models/${MODEL}" -d "{{ROOT}}/data/ansr-data/test_set/feynman/dataset.yaml" -n 5000 -o {{ROOT}}/results/evaluation/${CONFIG}/feynman.pickle -v
flash_ansr evaluate -c {{ROOT}}/configs/${CONFIG}/evaluation.yaml -m "{{ROOT}}/models/ansr-models/${MODEL}" -d "{{ROOT}}/data/ansr-data/test_set/nguyen/dataset.yaml" -n 5000 -o {{ROOT}}/results/evaluation/${CONFIG}/nguyen.pickle -v
flash_ansr evaluate -c {{ROOT}}/configs/${CONFIG}/evaluation.yaml -m "{{ROOT}}/models/ansr-models/${MODEL}" -d "{{ROOT}}/configs/${CONFIG}/dataset_val.yaml" -n 5000 -o {{ROOT}}/results/evaluation/${CONFIG}/val.pickle -v
flash_ansr evaluate -c {{ROOT}}/configs/${CONFIG}/evaluation.yaml -m "{{ROOT}}/models/ansr-models/${MODEL}" -d "{{ROOT}}/data/ansr-data/test_set/pool_15/dataset.yaml" -n 5000 -o {{ROOT}}/results/evaluation/${CONFIG}/pool_15.pickle -v
flash_ansr evaluate -c {{ROOT}}/configs/${CONFIG}/evaluation.yaml -m "{{ROOT}}/models/ansr-models/${MODEL}" -d "{{ROOT}}/configs/${CONFIG}/dataset_train.yaml" -n 5000 -o {{ROOT}}/results/evaluation/${CONFIG}/train.pickle -v

with

  • -c the evaluation config
  • -m the model to evaluate
  • -d the dataset to evaluate on
  • -n the number of samples to evaluate
  • -o the output file for results
  • -v verbose output

4.1 Evaluate NeSymReS

  1. Clone NeuralSymbolicRegressionThatScales to a directory of your choice.
  2. Download the 100M model as described here
  3. Move the 100M model into flash-ansr/models/nesymres/
  4. Create a Python 3.10 (!) environment and install flash-ansr as in the previous steps.
  5. Install NeSymReS in the same environment:
cd NeuralSymbolicRegressionThatScales
pip install -e src/
pip install lightning
  1. Navigate back to this repository and run the evaluation
cd flash-ansr
./scripts/evaluate_nesymres <test_set>

4.2 Evaluate PySR

  1. Install PySR in the same environment as flash-ansr.
  2. Run the evaluation
./scripts/evaluate_pysr <test_set>

Development

Setup

To set up the development environment, run the following commands:

pip install -e .[dev]
pip install -e ./nsrops
pre-commit install

Tests

Test the package with ./scripts/pytest.sh. Run pylint with ./scripts/pylint.sh.

Citation

@software{flash-ansr2024,
    author = {Paul Saegert},
    title = {Flash Amortized Neural Symbolic Regression},
    year = 2024,
    publisher = {GitHub},
    version = {0.3.0},
    url = {https://github.com/psaegert/flash-ansr}
}

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •