Daan Roos*, Oscar Davis*, Floor Eijkelboom*,
Michael Bronstein, Max Welling, İsmail İlkan Ceylan, Luca Ambrogioni, Jan-Willem van de Meent
Official implementation of the text experiments. 🚀
This repository contains all the code for the text experiments from the Categorical Flow Maps paper. The main module of the code is located in semicat/models/semicat.py 🧠. The module is general and ready to accept many other data types. Text-specific code is to be found in semicat/models/textsemicat.py 📝.
- Install the dependencies:
mamba env create -f environment.yaml- Activate the environment:
mamba activate semicat- Create a
.envfile containing the directory that will cache the processed LM1B data:
DATASET_CACHE_DIR=/the/dir/for/lm1b
- Run the experiment you want! 💥 For example,
python -m semicat.train experiment=lm1b_dit trainer=gpuFor wandb logging, add logger=wandb as an argument.
To download the dataset, follow the steps in github.com/andrew-cr/discrete_flow_models, placing the data in ./data/text8.
LM1B is automatically downloaded into DATASET_CACHE_DIR, and then sequence-packed, etc. You can also run python -m semicat.data.lm1b separately in order to set up the data before launching your runs.
To cite the paper or the code, please use the following:
@misc{roos2026categoricalflowmaps,
title={Categorical Flow Maps},
author={Daan Roos and Oscar Davis and Floor Eijkelboom and Michael Bronstein and Max Welling and İsmail İlkan Ceylan and Luca Ambrogioni and Jan-Willem van de Meent},
year={2026},
eprint={2602.12233},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.12233},
}