Semantic regexes are an automated interpretability method that describe LLM features using a structured language. Semantic regexes provide accurate, concise, and consistent feature descriptions that help humans build mental models of feature activations.
This repo contains a package to generate your own semantic regexes, pre-computed semantic regexes and evaluation scores, and an interactive viewer to explore results.
This code accompanies the research paper:
Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language
Angie Boggust, Donghao Ren, Yannick Assogba, Dominik Moritz, Arvind Satyanarayan, Fred Hohman
arXiv, 2025.
Paper, GitHub, Python package, Viewer
experiments: the code to replicable the experimental results from the paper.semantic-regex: a lightweight Python package to generate semantic regexes.viewer: a web-based viewer to browse experimental results from the paper.
When making contributions, refer to the CONTRIBUTING guidelines and read the CODE OF CONDUCT.
To cite our paper, please use:
@article{boggust2025semantic,
title={{Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language}},
author={Boggust, Arvind and Ren, Donghao and Assogba, Yannick and Moritz, Dominik, and Satyanarayan, Arvind and Hohman, Fred},
journal={arXiv preprint arXiv:2510.06378},
year={2025}
}This code is released under the LICENSE terms.