Control GenAI interactions with power, precision, and consistency using LLM-native Conversation Design paradigms
-
Updated
Apr 8, 2025 - Python
Control GenAI interactions with power, precision, and consistency using LLM-native Conversation Design paradigms
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
Code accompanying the paper Pretraining Language Models with Human Preferences
[AAAI'25 Oral] "MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector".
[TMLR 2024] Official implementation of "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics"
Official Implementation of Nabla-GFlowNet (ICLR 2025)
Scan your AI/ML models for problems before you put them into production.
Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023
a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation
An implementation of iterated distillation and amplification
The open-source repository for PAL: Sample-Efficient Personalized Reward Modeling for Pluralistic Alignment.
Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLM-s with simplified observation format. The benchmark themes include multi-objective homeostasis, (multi-objective) diminishing returns, complementary goods, sustainability, multi-agent resource sharing.
sinewCHAT uses instanced chatbots to emulate neural nodes to enrich and generate positive weighted responses.
A persona chat based on the VIA Character Strengths. Reads emotional tone and summons appropriate virtue to respond.
This project is about AI Alignment where I is source data from history of AI incidents occurred and learn about it to provide a solution to mitigate any future occurrences again
GAA is a modification of the RLHF PPO loop that addresses the 'negative side effects from misspecified reward functions' problem
Code for our May 2024 AI security evaluation research sprint project
bbBOT is a felixble persona based branching binary sentiment chatbot.
Simplified, modern implementation of Rating and Preference-based Reinforcement Learning.
Add a description, image, and links to the ai-alignment topic page so that developers can more easily learn about it.
To associate your repository with the ai-alignment topic, visit your repo's landing page and select "manage topics."