ai-alignment

Star

Here are 21 public repositories matching this topic...

emcie-co / parlant

Star

Control GenAI interactions with power, precision, and consistency using LLM-native Conversation Design paradigms

python gemini openai customer-service customer-success ai-agents ai-alignment llm genai llama3

Updated Apr 8, 2025
Python

agencyenterprise / PromptInject

Star

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Feb 26, 2024
Python

tomekkorbak / pretraining-with-human-feedback

Star

Code accompanying the paper Pretraining Language Models with Human Preferences

reinforcement-learning gpt language-models ai-safety ai-alignment pretraining decision-transformers rlhf

Updated Feb 13, 2024
Python

tsinghua-fib-lab / AAAI2025_MIA-Tuner

Star

[AAAI'25 Oral] "MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector".

ai-alignment membership-inference-attack large-language-models pretraining-data-detection

Updated Mar 17, 2025
Python

UCSC-VLAA / Sight-Beyond-Text

Star

[TMLR 2024] Official implementation of "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics"

alignment vlm ai-alignment vision-language vicuna llm mllm llava llama2

Updated Sep 15, 2023
Python

lzzcd001 / nabla-gfn

Star

Official Implementation of Nabla-GFlowNet (ICLR 2025)

generative-model finetuning ai-alignment diffusion-models gflownet

Updated Apr 8, 2025
Python

IQTLabs / daisybell

Star

Scan your AI/ML models for problems before you put them into production.

cybersecurity ai-safety bias-correction bias-detection ai-alignment model-poison ai-assurance

Updated Mar 31, 2025
Python

phelps-sg / llm-cooperation

Sponsor

Star

Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023

economics ai-safety gametheory experimental-economics behavioral-economics prisoners-dilemma ai-alignment experimental-psychology social-dilemmas gpt-3 gpt-4 llm principal-agent-problem

Updated Dec 11, 2024
Python

ai-fail-safe / safe-reward

Star

a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation

failsafe ai-safety anomaly-detection ai-alignment fail-safe

Updated Nov 8, 2022
Python

rmoehn / amplification

Star

An implementation of iterated distillation and amplification

transformer ida supervised-learning ai-safety ai-alignment

Updated Jun 22, 2022
Python

RamyaLab / pluralistic-alignment

Star

The open-source repository for PAL: Sample-Efficient Personalized Reward Modeling for Pluralistic Alignment.

ai-alignment rlhf pluralistic-alignment

Updated Mar 1, 2025
Python

levitation-opensource / bioblue

Star

Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLM-s with simplified observation format. The benchmark themes include multi-objective homeostasis, (multi-objective) diminishing returns, complementary goods, sustainability, multi-agent resource sharing.

python benchmarking sustainability multi-agent multi-objective ai-safety homeostasis ai-alignment llm-benchmarking diminishing-returns complementary-goods

Updated Apr 6, 2025
Python

EveryOneIsGross / sinewCHAT

Star

sinewCHAT uses instanced chatbots to emulate neural nodes to enrich and generate positive weighted responses.

ml ai-alignment rrn openai-api

Updated Jul 16, 2023
Python

EveryOneIsGross / areteCHAT

Star

A persona chat based on the VIA Character Strengths. Reads emotional tone and summons appropriate virtue to respond.

sentiment-analysis chatbot via chatbot-framework ai-alignment virtues

Updated Jul 14, 2023
Python

SylvesterDuah / The_Guardian_of_AI_Alignment

Star

This project is about AI Alignment where I is source data from history of AI incidents occurred and learn about it to provide a solution to mitigate any future occurrences again

artificial-intelligence safety artificial-neural-networks human-pose-estimation cyber-threat-intelligence ai-alignment ai-cyber-security