Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
-
Updated
Sep 8, 2025 - Python
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
Deliver safe & effective language models
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
Aligning AI With Shared Human Values (ICLR 2021)
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
[AAAI 2025 oral] Official repository of Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
RuLES: a benchmark for evaluating rule-following in language models
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
Code accompanying the paper Pretraining Language Models with Human Preferences
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
Attack to induce LLMs within hallucinations
[CCS'24] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
The Open Source Firewall for LLMs. A self-hosted gateway to secure and control AI applications with powerful guardrails.
[SafeAI'21] Feature Space Singularity for Out-of-Distribution Detection.
A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference time) and detection without classification accuracy drop, hyperparameter tuning, or collecting additional data.
Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.
To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."