Skip to content

A Python library for guardrail models evaluation with vLLM support.

License

Notifications You must be signed in to change notification settings

neuralmagic/GuardBench

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM notes

This benchmark is based on prompting models to look at some text and produce either "safe" or "unsafe" keywords to classify the given input (optionally with generating the list of violated policies). We would like to benchmark models with the vLLM inference engine instead of the default transformers backend. Since the benchmark consists of extracting logits for safe and unsafe keywords from the first generated token by an LLM, this requires transferring logits for the full vocab size over HTTP from vllm server to our client. This is prohibitively expensive (vLLM becomes 6x slower than Transformers), therefore we only transfer Top_K logits (e.g. 10). Under the assumption that model will always assign large logits to safe/unsafe keywords (it has been trained to do so), and under the GuardBench evaluation pipeline where only the relative ratio of safe/unsafe logits is used to compute F1 and Recall metrics, we are guaranteed to have exactly the same score with Top_K logits as we would have with full dictionary logits. This has been verified and battle tested across all 40 benchmarks in the GuardBench repository.

Example commands for meta-llama/Llama-Guard-4-12B

  1. setup environment: uv pip install -r llama4guard_vllm_requirements.txt
  2. serve the model with: vllm serve meta-llama/Llama-Guard-4-12B -tp 1 --api-key EMPTY --logprobs-mode processed_logits --max-logprobs 10 --max-model-len 131072
  3. evaluate the model with: python vllm_server_mm_eval.py --model meta-llama/Llama-Guard-4-12B --datasets all --output_dir output_dir/Llama-Guard-4-12B --top_logprobs 10

PyPI version Documentation Status License: EUPL-1.2

GuardBench

🔥 News

  • [October 9, 2025] GuardBench now supports four additional datasets: JBB Behaviors, NicheHazardQA, HarmEval, and TechHazardQA. Also, it now allows for choosing the metrics to show at the end of the evaluation. Supported metrics are: precision (Precision), recall (Recall), f1 (F1), mcc (Matthews Correlation Coefficient), auprc (AUPRC), sensitivity (Sensitivity), specificity (Specificity), g_mean (G-Mean), fpr (False Positive Rate), fnr (False Negative Rate).

⚡️ Introduction

GuardBench is a Python library for the evaluation of guardrail models, i.e., LLMs fine-tuned to detect unsafe content in human-AI interactions. GuardBench provides a common interface to 40 evaluation datasets, which are downloaded and converted into a standardized format for improved usability. It also allows to quickly compare results and export LaTeX tables for scientific publications. GuardBench's benchmarking pipeline can also be leveraged on custom datasets.

GuardBench was featured in EMNLP 2024. The related paper is available here.

GuardBench has a public leaderboard available on HuggingFace.

You can find the list of supported datasets here. A few of them requires authorization. Please, read this.

If you use GuardBench to evaluate guardrail models for your scientific publications, please consider citing our work.

✨ Features

🔌 Requirements

python>=3.10

💾 Installation

pip install guardbench

💡 Usage

from guardbench import benchmark

def moderate(
    conversations: list[list[dict[str, str]]],  # MANDATORY!
    # additional `kwargs` as needed
) -> list[float]:
    # do moderation
    # return list of floats (unsafe probabilities)

benchmark(
    moderate=moderate,  # User-defined moderation function
    model_name="My Guardrail Model",
    batch_size=1,              # Default value
    datasets="all",            # Default value
    metrics=["f1", "recall"],  # Default value
    # Note: you can pass additional `kwargs` for `moderate`
)

📖 Examples

📚 Documentation

Browse the documentation for more details about:

🏆 Leaderboard

You can find GuardBench's leaderboard here. If you want to submit your results, please contact us.

👨‍💻 Authors

  • Elias Bassani (European Commission - Joint Research Centre)

🎓 Citation

@inproceedings{guardbench,
    title = "{G}uard{B}ench: A Large-Scale Benchmark for Guardrail Models",
    author = "Bassani, Elias  and
      Sanchez, Ignacio",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.1022",
    doi = "10.18653/v1/2024.emnlp-main.1022",
    pages = "18393--18409",
}

🎁 Feature Requests

Would you like to see other features implemented? Please, open a feature request.

📄 License

GuardBench is provided as open-source software licensed under EUPL v1.2.

About

A Python library for guardrail models evaluation with vLLM support.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.1%
  • Makefile 0.9%