Skip to content

apple/ml-semantic-regex

Repository files navigation

Semantic Regexes

Semantic regexes are an automated interpretability method that describe LLM features using a structured language. Semantic regexes provide accurate, concise, and consistent feature descriptions that help humans build mental models of feature activations.

This repo contains a package to generate your own semantic regexes, pre-computed semantic regexes and evaluation scores, and an interactive viewer to explore results.

This code accompanies the research paper:

Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language
Angie Boggust, Donghao Ren, Yannick Assogba, Dominik Moritz, Arvind Satyanarayan, Fred Hohman
arXiv, 2025.
Paper, GitHub, Python package, Viewer

Repo Structure

  • experiments: the code to replicable the experimental results from the paper.
  • semantic-regex: a lightweight Python package to generate semantic regexes.
  • viewer: a web-based viewer to browse experimental results from the paper.

Contributing

When making contributions, refer to the CONTRIBUTING guidelines and read the CODE OF CONDUCT.

BibTeX

To cite our paper, please use:

@article{boggust2025semantic,
    title={{Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language}},
    author={Boggust, Arvind and Ren, Donghao and Assogba, Yannick and Moritz, Dominik, and Satyanarayan, Arvind and Hohman, Fred},
    journal={arXiv preprint arXiv:2510.06378},
    year={2025}
}

License

This code is released under the LICENSE terms.

Packages

No packages published

Contributors 6