aisafety

[ACL 2025 Findings] The official GitHub repo for the paper "Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents"

agent ai-safety aisafety llm

Updated May 20, 2025
Python

ChristianInterno / ReStraV

Star

AI-Generated Video Detection via Perceptual Straightening (NeurIPS2025)

computer-vision cognitive-neuroscience geometric-deep-learning detection-algorithm aisafety aivideo deepfake-detection videogeneration aidetection

Updated Jan 2, 2026
Python

ZiyueWang25 / llm-security-challenge

Star

Can Large Language Models Solve Security Challenges? We test LLMs' ability to interact and break out of shell environments using the OverTheWire wargames environment, showing the models' surprising ability to do action-oriented cyberexploits in shell environments

cybersecurity aisafety llm

Updated Aug 21, 2023
Python

mirseo / string-formatter

Star

A high-performance string formatter written in Rust. This project detects and blocks LLM prompt injection and jailbreak attacks. It also features a customizable rule-based system and defends against obfuscated prompt attacks.

rust high-performance python3 cybersecurity ai-security aisafety rules-based llm prompt-injection text-security llmsecurity jailbreak-protection

Updated Sep 28, 2025
Python

line / sacpo

Star

[NeurIPS 2024] SACPO (Stepwise Alignment for Constrained Policy Optimization)

alignment aisafety large-language-models large-language-model

Updated Dec 23, 2024
Python

kotekjedi / capability-based-scaling

Star

An official repository for the Capability-Based Scaling Laws for LLM Red-Teaming paper.

jailbreak redteaming aisafety

Updated Feb 10, 2026
Python

AikyamLab / hallucinogen

Star

A benchmark for evaluating hallucinations in large visual language models

ai aisafety visual-language-models hallucination-evaluation hallucination-detection medical-safety medical-visual-language-model

Updated Mar 18, 2025
Python

kkhetarpal / safe_a2oc_delib

Star

Safe Option Critic: Learning Safe Options in the A2OC Architecture

options-framework aisafety asynchr-advantage-option-critic safe-option-critic

Updated Dec 17, 2018
Python

romaingrx / llm-as-a-jailbreak-judge

Sponsor

Star

Explore techniques to use small models as jailbreaking judges

jailbreak aisafety llm-as-a-judge

Updated Sep 18, 2024
Python

romaingrx / red-teamer-mistral-nemo

Sponsor

Star

Finetuning of Mistral Nemo 13B on the WildJailbreak dataset to produce a red-teaming model

jailbreak finetuning aisafety

Updated Sep 18, 2024
Python

gabe-mousa / Apolien

Star

AI Safety Evaluation Library

ai evaluations benchmark-framework aisafety llm prompt-engineering

Updated Dec 9, 2025
Python

yukincom / llm-SugarScape

Star

Multi-agent simulation using LLMs. Agents autonomously decide actions for survival, reproduction, and social behavior in a grid world.This project aims to replicate a paper published in 2025 (arXiv:2508.12920).

python simulation alignment agent-based-modeling grok sugarscape aisafety llm ai-testing llm-eval llm-evaluation llm-testing grok-api xai-api

Updated Nov 28, 2025
Python

akios-ai / akios

Star

Secure AI agent runtime with kernel-hard sandboxing, real-time PII masking, and cryptographic audit trails. Production-ready, open source (GPL-3). Supports OpenAI, Anthropic, xAI, Google, Mistral.

agent security devops ai infra devops-tools aisafety aigovernance sandbox--playground

Updated Feb 13, 2026
Python

p0ss / HatCat

Star

Learned Semantic Decoder for Language Models.- Its the little model that sits under a big model's hat to explain what its thinking, just like the little cat's from Cat in the Hat! VOOM > FOOM

ai model-evaluation ai-safety interpretability ai-security aisafety mechanistic-interpretability llm-security model-wellbeing

Updated Jan 11, 2026
Python

MetricProvenance / odgs-core

Star

The Reference Implementation for EU AI Act (Article 10). Cryptographic semantic binding to ensure deterministic integrity for High-Risk AI. (NEN/ISO JTC 25 Aligned)

artificial-intelligence ai-agents data-sovereignty data-governance open-standard compliance-as-code aisafety semantic-hashing semantic-layer nen headless-bi ai-act

Updated Feb 14, 2026
Python

immartian / YggPeer

Star

a Python library for peer-to-peer communication over the Yggdrasil network

peer-to-peer yggdrasil e2ee aisafety aiagent

Updated Sep 19, 2024
Python

Improve this page

Add a description, image, and links to the aisafety topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the aisafety topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aisafety

Here are 27 public repositories matching this topic...

PKU-Alignment / aligner

trendmicro / ais

metadriverse / cat

kaustpradalab / Fraud-R1

pillowsofwind / LLM-CBRN-Risks

ChristianInterno / ReStraV

ZiyueWang25 / llm-security-challenge

mirseo / string-formatter

line / sacpo

kotekjedi / capability-based-scaling

AikyamLab / hallucinogen

kkhetarpal / safe_a2oc_delib

romaingrx / llm-as-a-jailbreak-judge

romaingrx / red-teamer-mistral-nemo

gabe-mousa / Apolien

yukincom / llm-SugarScape

akios-ai / akios

p0ss / HatCat

MetricProvenance / odgs-core

immartian / YggPeer

Improve this page

Add this topic to your repo