This repository provides a reference implementation for the paper PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models.
The architecture and training code is an improved version of the original implementation for the ICLR 2026 paper Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data.
PluRel is a framework for synthesizing diverse multi-tabular relational databases using Structural Causal Models (SCMs). This repository provides:
- Scalable generation of synthetic relational data (from scratch or SQL schemas) compatible with relbench.
- High-performance context sampling via a Rust-based sampler (rustler).
- Pretraining of relational transformers on synthetic data.
To use PluRel as a library:
pip install plurelRequires Python 3.12+. This installs the synthetic database generator only; the Rust context sampler and training scripts under rt/ are part of the development setup below.
Note
The published PyPI package tracks the v1.0.0 release. The latest features and performance improvements live on main and may not yet be in a tagged release — install from source if you need them.
For development, testing, or running the pretraining scripts, set up the full environment with pixi.
# setup pixi environment
$ pixi install
# Compile and install the rust sampler
$ cd rustler && pixi run maturin develop --uv --release && cd ..
# Run tests
$ pixi run pytest
# Lint and format code
$ pixi run ruff check .
$ pixi run ruff format .
# Install pre-commit hooks
$ pixi run pre-commit install
# link cache repository
$ mkdir ~/scratch
$ ln -s ~/.cache/relbench ~/scratch/relbench- The
SyntheticDatasetclass can be used to create relbench compatible dataset objects. - It only requires a
seedand aConfigobject that containsdatabase,scmanddaglevel params for sampling. See example below.
from plurel import SyntheticDataset, Config
# create relbench compatible dataset
dataset = SyntheticDataset(seed=0, config=Config())
# create database which can be cached via relbench APIs
db = dataset.make_db()The Config class controls all aspects of synthetic database generation through three parameter groups:
| Parameters | Description |
|---|---|
DatabaseParams |
Table layout (BarabasiAlbert, ReverseRandomTree, WattsStrogatz), number of tables, row counts, column counts, and timestamp ranges. |
SCMParams |
SCM graph layouts, column types, MLP initialization, activation functions, noise distributions, and time-series trend/cycle parameters. |
DAGParams |
DAG-specific parameters like edge dropout, in-degree limits, and rewiring probabilities for different graph types. |
from plurel import Config, DatabaseParams, SCMParams
config = Config(
database_params=DatabaseParams(num_tables_choices=Choices(kind="range", value=[5, 10])),
schema_file="path/to/schema.sql", # optional: generate from SQL schema
cache_dir="~/.cache/relbench", # optional: cache generated databases
)We also provide a multiprocessing-based script to generate databases in parallel.
$ pixi run python scripts/synthetic_gen.py \
--seed_offset 0 \
--num_dbs 1000 \
--num_proc 16 \
--preprocess| Argument | Description |
|---|---|
--seed_offset |
Seed offset for database generation. DBs will be named rel-synthetic-<seed>. |
--num_dbs |
Number of databases to generate. |
--num_proc |
Number of parallel processes (default: number of CPU cores). |
--preprocess |
Run preprocessing and embedding steps. Omit to skip. |
Note
Checkout notebooks in examples/ for synthesizing from SQL schemas
The preprocessed synthetic data is available on the Hugging Face Hub at kvignesh1420/plurel.
- Install the HuggingFace CLI (if not present)
pixi add huggingface_hub- Create the destination
mkdir -p ~/scratch/pre- Download the repository contents into ~/scratch/pre
pixi run hf download kvignesh1420/plurel \
--repo-type dataset \
--local-dir ~/scratch/preThe preprocessed relbench data is available on the Hugging Face Hub at hvag976/relational-transformer.
pixi run hf download hvag976/relational-transformer \
--repo-type dataset \
--local-dir ~/scratch/preThe synthetic pretrained model checkpoints are hosted on the Hugging Face Hub at kvignesh1420/relational-transformer-plurel.
$ mkdir -p ~/scratch/rt_hf_ckpts
$ pixi run hf download kvignesh1420/relational-transformer-plurel \
--repo-type model \
--local-dir ~/scratch/rt_hf_ckptsOne of the downloaded checkpoints will be listed as:
$ ls ~/scratch/rt_hf_ckpts
# model pretrained on a dataset of size 4B tokens curated from 1024 synthetic RDBs
synthetic-pretrain_rdb_1024_size_4b.pt- Baseline (real-world) pretraining on relbench datasets with a randomly initialized relational-transformer (RT) model.
$ pixi run torchrun --standalone --nproc_per_node=1 scripts/baseline_pretrain.py- Synthetic pretraining on varying number of databases and dataset sizes with a randomly initialized RT model.
$ pixi run torchrun --standalone --nproc_per_node=1 scripts/synthetic_pretrain.py- Continued pretraining on relbench datasets using the synthetic pretrained models. For faster experimentation, the downloaded models from huggingface (stored in
~/scratch/rt_hf_ckpts) can be passed to theload_ckpt_pathargument in the training script.
$ pixi run torchrun --standalone --nproc_per_node=1 scripts/cntd_pretrain.pyIf you find this work useful, please cite our paper:
@article{kothapalli2026plurel,
title={{PluRel:} Synthetic Data unlocks Scaling Laws for Relational Foundation Models},
author={Kothapalli, Vignesh and Ranjan, Rishabh and Hudovernik, Valter and Dwivedi, Vijay Prakash and Hoffart, Johannes and Guestrin, Carlos and Leskovec, Jure},
journal={arXiv preprint arXiv:2602.04029},
year={2026}
}If you use the architecture, training loop or sampler code, please also cite the Relational Transformer paper:
@inproceedings{ranjan2026relationaltransformer,
title={{Relational Transformer:} Toward Zero-Shot Foundation Models for Relational Data},
author={Rishabh Ranjan and Valter Hudovernik and Mark Znidar and Charilaos Kanatsoulis and Roshan Upendra and Mahmoud Mohammadi and Joe Meyer and Tom Palczewski and Carlos Guestrin and Jure Leskovec},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}
