ai-safety

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Feb 26, 2024
Python

hendrycks / ethics

Star

Aligning AI With Shared Human Values (ICLR 2021)

ai-safety machine-ethics ml-safety ethical-ai gpt-3

Updated Apr 21, 2023
Python

ShengranHu / Thought-Cloning

Star

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

reinforcement-learning deep-learning pytorch artificial-intelligence imitation-learning ai-safety

Updated Jun 28, 2024
Python

WindVChen / DiffAttack

Star

An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.

ai-safety diffusion-models unrestricted-attacks adverarial-attacks transferable-attacks diffusion-adversarial-attack imperceptible-attacks

Updated Nov 23, 2025
Python

Jiaqi-Chen-00 / ImBD

Star

[AAAI 2025 oral] Official repository of Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection

ai-safety ai-content-detector llm-detection mechine-text-detection

Updated Apr 2, 2025
Python

cvs-health / langfair

Star

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

python ai artificial-intelligence bias fairness ai-safety fairness-testing bias-detection fairness-ai fairness-ml responsible-ai ethical-ai large-language-models llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Nov 26, 2025
Python

normster / llm_rules

Star

RuLES: a benchmark for evaluating rule-following in language models

ai-safety ai-security gpt-4

Updated Feb 24, 2025
Python

tomekkorbak / pretraining-with-human-feedback

Star

Code accompanying the paper Pretraining Language Models with Human Preferences

reinforcement-learning gpt language-models ai-safety ai-alignment pretraining decision-transformers rlhf

Updated Feb 13, 2024
Python

ryoungj / ToolEmu

Star

[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use

agent language-model ai-safety large-language-models prompt-engineering language-agent

Updated Mar 22, 2024
Python

PKU-YuanGroup / Hallucination-Attack

Star

Attack to induce LLMs within hallucinations

nlp machine-learning deep-learning ai-safety adversarial-attacks hallucinations llm llm-safety

Updated May 17, 2024
Python

LetterLiGo / SafeGen_CCS2024

Star

[CCS'24] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

text-to-image ai-safety ai-security generative-ai thrustworthy-ai

Updated Jul 1, 2025
Python

SafeAILab / RAIN

Star

[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning

alignment ai-safety large-language-models

Updated May 23, 2024
Python

microsoft / SafeNLP

Star

Safety Score for Pre-Trained Language Models

nlp ai-safety fairness-ai

Updated Oct 18, 2023
Python

trylonai / gateway

Star

The Open Source Firewall for LLMs. A self-hosted gateway to secure and control AI applications with powerful guardrails.

self-hosted ai-safety content-moderation pii-redaction prompt-injection llm-security ai-gateway llm-firewall llm-guardrails

Updated Jun 25, 2025
Python

Hmbown / Hegelion

Sponsor

Star

Dialectical reasoning architecture for LLMs (Thesis → Antithesis → Synthesis)

benchmarking research mcp evaluation openai ai-safety reasoning llm anthropic ollama hegelian dialecticism

Updated Dec 1, 2025
Python

megvii-research / FSSD_OoD_Detection

Star

[SafeAI'21] Feature Space Singularity for Out-of-Distribution Detection.

anomaly ai-safety anomaly-detection out-of-distribution-detection ood-detection

Updated Feb 15, 2021
Python

Improve this page

Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-safety

Here are 218 public repositories matching this topic...

PKU-Alignment / safe-rlhf

OpenLMLab / MOSS-RLHF

cvs-health / uqlm

Pacific-AI-Corp / langtest

agencyenterprise / PromptInject

hendrycks / ethics

ShengranHu / Thought-Cloning

WindVChen / DiffAttack

Jiaqi-Chen-00 / ImBD

cvs-health / langfair

normster / llm_rules

tomekkorbak / pretraining-with-human-feedback

ryoungj / ToolEmu

PKU-YuanGroup / Hallucination-Attack

LetterLiGo / SafeGen_CCS2024

SafeAILab / RAIN

microsoft / SafeNLP

trylonai / gateway

Hmbown / Hegelion

megvii-research / FSSD_OoD_Detection

Improve this page

Add this topic to your repo