[NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct
-
Updated
Jan 16, 2025 - Python
[NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct
[CoRL'23] Adversarial Training for Safe End-to-End Driving
[ACL 2025 Findings] Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements
AI-Generated Video Detection via Perceptual Straightening (NeurIPS2025)
Can Large Language Models Solve Security Challenges? We test LLMs' ability to interact and break out of shell environments using the OverTheWire wargames environment, showing the models' surprising ability to do action-oriented cyberexploits in shell environments
A high-performance string formatter written in Rust. This project detects and blocks LLM prompt injection and jailbreak attacks. It also features a customizable rule-based system and defends against obfuscated prompt attacks.
[NeurIPS 2024] SACPO (Stepwise Alignment for Constrained Policy Optimization)
An official repository for the Capability-Based Scaling Laws for LLM Red-Teaming paper.
A benchmark for evaluating hallucinations in large visual language models
Safe Option Critic: Learning Safe Options in the A2OC Architecture
Explore techniques to use small models as jailbreaking judges
Finetuning of Mistral Nemo 13B on the WildJailbreak dataset to produce a red-teaming model
AI Safety Evaluation Library
Multi-agent simulation using LLMs. Agents autonomously decide actions for survival, reproduction, and social behavior in a grid world.This project aims to replicate a paper published in 2025 (arXiv:2508.12920).
Secure AI agent runtime with kernel-hard sandboxing, real-time PII masking, and cryptographic audit trails. Production-ready, open source (GPL-3). Supports OpenAI, Anthropic, xAI, Google, Mistral.
Learned Semantic Decoder for Language Models.- Its the little model that sits under a big model's hat to explain what its thinking, just like the little cat's from Cat in the Hat! VOOM > FOOM
The Reference Implementation for EU AI Act (Article 10). Cryptographic semantic binding to ensure deterministic integrity for High-Risk AI. (NEN/ISO JTC 25 Aligned)
a Python library for peer-to-peer communication over the Yggdrasil network
Add a description, image, and links to the aisafety topic page so that developers can more easily learn about it.
To associate your repository with the aisafety topic, visit your repo's landing page and select "manage topics."