I am currently working on LLM Evaluations at Apollo Research.
Past OSS contributions:
- I contributed to the mechanistic interpretability library TransformerLens. Most notably, I added support for BERT to the library.
- I worked on MazeDataset, a library for generation, filtering, solving, visualizing, and processing of mazes for training ML systems.
Research:
- I co-authored a neurips workshop, paper Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation, where we used language models to automate generation of narrative-based jailbreaks on GPT-4 and other SOTA models.