Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
-
Updated
Nov 24, 2025 - Python
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
Deliver safe & effective language models
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
Aligning AI With Shared Human Values (ICLR 2021)
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
[AAAI 2025 oral] Official repository of Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
RuLES: a benchmark for evaluating rule-following in language models
Code accompanying the paper Pretraining Language Models with Human Preferences
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
Attack to induce LLMs within hallucinations
[CCS'24] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
The Open Source Firewall for LLMs. A self-hosted gateway to secure and control AI applications with powerful guardrails.
Dialectical reasoning architecture for LLMs (Thesis → Antithesis → Synthesis)
[SafeAI'21] Feature Space Singularity for Out-of-Distribution Detection.
Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.
To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."