This benchmark is based on prompting models to look at some text and produce either "safe" or "unsafe" keywords to classify the given input (optionally with generating the list of violated policies). We would like to benchmark models with the vLLM inference engine instead of the default transformers backend. Since the benchmark consists of extracting logits for safe and unsafe keywords from the first generated token by an LLM, this requires transferring logits for the full vocab size over HTTP from vllm server to our client. This is prohibitively expensive (vLLM becomes 6x slower than Transformers), therefore we only transfer Top_K logits (e.g. 10). Under the assumption that model will always assign large logits to safe/unsafe keywords (it has been trained to do so), and under the GuardBench evaluation pipeline where only the relative ratio of safe/unsafe logits is used to compute F1 and Recall metrics, we are guaranteed to have exactly the same score with Top_K logits as we would have with full dictionary logits. This has been verified and battle tested across all 40 benchmarks in the GuardBench repository.
- setup environment:
uv pip install -r llama4guard_vllm_requirements.txt - serve the model with:
vllm serve meta-llama/Llama-Guard-4-12B -tp 1 --api-key EMPTY --logprobs-mode processed_logits --max-logprobs 10 --max-model-len 131072 - evaluate the model with:
python vllm_server_mm_eval.py --model meta-llama/Llama-Guard-4-12B --datasets all --output_dir output_dir/Llama-Guard-4-12B --top_logprobs 10
- [October 9, 2025] GuardBench now supports four additional datasets: JBB Behaviors, NicheHazardQA, HarmEval, and TechHazardQA. Also, it now allows for choosing the metrics to show at the end of the evaluation. Supported metrics are:
precision(Precision),recall(Recall),f1(F1),mcc(Matthews Correlation Coefficient),auprc(AUPRC),sensitivity(Sensitivity),specificity(Specificity),g_mean(G-Mean),fpr(False Positive Rate),fnr(False Negative Rate).
GuardBench is a Python library for the evaluation of guardrail models, i.e., LLMs fine-tuned to detect unsafe content in human-AI interactions.
GuardBench provides a common interface to 40 evaluation datasets, which are downloaded and converted into a standardized format for improved usability.
It also allows to quickly compare results and export LaTeX tables for scientific publications.
GuardBench's benchmarking pipeline can also be leveraged on custom datasets.
GuardBench was featured in EMNLP 2024.
The related paper is available here.
GuardBench has a public leaderboard available on HuggingFace.
You can find the list of supported datasets here. A few of them requires authorization. Please, read this.
If you use GuardBench to evaluate guardrail models for your scientific publications, please consider citing our work.
- 40 datasets for guardrail models evaluation.
- Automated evaluation pipeline.
- User-friendly.
- Extendable.
- Reproducible and sharable evaluation.
- Exportable evaluation reports.
python>=3.10pip install guardbenchfrom guardbench import benchmark
def moderate(
conversations: list[list[dict[str, str]]], # MANDATORY!
# additional `kwargs` as needed
) -> list[float]:
# do moderation
# return list of floats (unsafe probabilities)
benchmark(
moderate=moderate, # User-defined moderation function
model_name="My Guardrail Model",
batch_size=1, # Default value
datasets="all", # Default value
metrics=["f1", "recall"], # Default value
# Note: you can pass additional `kwargs` for `moderate`
)- Follow our tutorial on benchmarking
Llama GuardwithGuardBench. - More examples are available in the
scriptsfolder.
Browse the documentation for more details about:
- The datasets and how to obtain them.
- The data format used by
GuardBench. - How to use the
Reportclass to compare models and export results asLaTeXtables. - How to leverage
GuardBench's benchmarking pipeline on custom datasets.
You can find GuardBench's leaderboard here. If you want to submit your results, please contact us.
- Elias Bassani (European Commission - Joint Research Centre)
@inproceedings{guardbench,
title = "{G}uard{B}ench: A Large-Scale Benchmark for Guardrail Models",
author = "Bassani, Elias and
Sanchez, Ignacio",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.1022",
doi = "10.18653/v1/2024.emnlp-main.1022",
pages = "18393--18409",
}Would you like to see other features implemented? Please, open a feature request.
GuardBench is provided as open-source software licensed under EUPL v1.2.