from sympy import python
- Overview
- Features
- Installation
- Quick Start
- Datasets
- Methodologies
- Configuration
- Usage Examples
- Results
- Web Platform
- Repository Structure
- Citation
- Contributing
FactCheck is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) capabilities in Knowledge Graph (KG) fact verification. This repository implements multiple verification methodologies and provides extensive evaluation across three real-world KG datasets with over 13,530 facts.
- RQ1: How effective are LLMs in fact-checking KGs using only their internal knowledge?
- RQ2: Can LLMs effectively fact-check KGs using external evidence through RAG?
- RQ3: Do multi-model consensus approaches improve KG fact verification accuracy?
- ✅ Open-source LLMs can effectively verify KG facts (up to 0.90 balanced accuracy)
- ✅ RAG integration improves accuracy but increases computational cost (~10×)
- ✅ Multi-model consensus consistently outperforms individual models (+4.5% improvement)
- 🚧 For ablation study results, see Ablation Study Results.
- Multiple LLM Support: Both open-source (Gemma2, Qwen2.5, Llama3.1, Mistral) and commercial (GPT-4o mini) models
- Diverse Methodologies: Direct Knowledge Assessment (DKA), Guided Iterative Verification (GIV), RAG, and Multi-model Consensus
- Real-world Datasets: FactBench, YAGO, and DBpedia with 13,530 total facts
- RAG Dataset: 2+ million documents specifically curated for KG fact verification
- Mock API: Simulated API for testing and development -- refer to FactCheck MockAPI.
- Interactive Platform: Web-based exploration tool for verification analysis
- Comprehensive Evaluation: Balanced accuracy, F1-macro, efficiency metrics, and cost analysis
Prompt templates for each methodology are available in the prompts directory.
- Python 3.8+
- Ollama (for open-source models)
- Azure OpenAI API access (for commercial models)
- Clone the repository
git clone https://github.com/FactCheck-AI/factcheck-benchmark
cd factcheck-benchmark
- Install dependencies
pip install -r requirements.txt
- Install Ollama (for open-source models)
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Start Ollama service
ollama serve
- Download required models
ollama pull gemma2:9b
ollama pull qwen2.5:7b
ollama pull llama3.1:8b
ollama pull mistral:7b
python main.py # Run with default configuration in `config.yml`
Dataset | Facts | Predicates | Gold Accuracy | Description |
---|---|---|---|---|
FactBench | 2,800 | 10 | 0.54 | Systematically generated with balanced true/false distribution |
YAGO | 1,386 | 16 | 0.99 | High-quality facts with extreme class imbalance |
DBpedia | 9,344 | 1,092 | 0.85 | Diverse schema with extensive predicate coverage |
- 130,820 questions generated from KG facts
- 2,090,305 documents from Google SERP
- 87.4% text coverage rate
- Similarity scores for question relevance ranking
Basic fact verification using only LLM's internal knowledge without external guidance.
method:
name: "DKA"
Enhanced verification with structured guidelines and examples.
GIV-Z (Zero-shot):
method:
name: "GIV-Z"
GIV-F (Few-shot):
method:
name: "GIV-F"
Verification using external evidence from web search results.
method:
name: "RAG"
rag:
embedding_model: 'bge-small-en-v1.5'
chunking_strategy: 'sliding_window'
window_size: 3
similarity_cutoff: 0.3
top_k: 6
Combines predictions from multiple models using majority voting with tie-breaking.
majority_vote:
mode: 'commercial' # Options: commercial, open_source
final_tie_breaker: 'most_consistent' # Options: least_consistent, most_consistent, Null (for commercial)
num_votes: 3 # Number of votes for each model
llms:
- "mistral:7B"
- "qwen2.5:7B"
- "llama3.1:7B"
- "gemma2:9B"
higher_parameter_model:
qwen2.5:7b: 'qwen2.5:7b'
mistral:7b: 'mistral:7b'
llama3.1:7b: 'llama3.1:latest'
gemma2:9b: 'gemma2:9b'
commercial_model:
- "gpt-4o-mini"
# Dataset configuration
dataset:
name: "FactBench" # Options: DBpedia, YAGO, FactBench
# Method configuration
method:
name: "DKA" # Options: DKA, GIV-Z, GIV-F, RAG
# LLM configuration
llm:
mode: "open_source" # Options: commercial, open_source
model: "gemma2:9B"
parameters:
temperature: 0.75
top_p: 0.9
max_tokens: 512
# Evaluation configuration
evaluation:
metrics:
accuracy: 'balanced' # Options: balanced, normal
f1_score: "macro" # Options: micro, macro, weighted
# Knowledge Graph configuration
knowledge_graph:
kg_ids: ['correct_death_00106', 'correct_death_00040']
# Output configuration
output:
directory: "./results"
For commercial models, configure Azure OpenAI:
OpenAI:
azure_endpoint: "https://your-resource.openai.azure.com/"
api_key: "your-api-key"
api_version: "2024-02-15-preview"
python main.py
python evaluation.py --file results/factbench_results.json
For full evaluation use --full
flag to include all metrics.
python evaluation.py --file results/factbench_results.json --full
This module is interactive. You can run it as follows:
python consensus.py --dataset FactBench
if you don't specify files it will ask you to enter which files you want to use for the consensus. The output example will be:
Found 3 files for FactBench:
1. FactBench_open-source_gemma2:9b_rag_20250527-103716.json
Model: open-source_gemma2:9b, Method: rag
Facts: 2, Success: 100.0%
2. FactBench_open-source_qwen2.5:7B_rag_20250527-103404.json
Model: open-source_qwen2.5:7B, Method: rag
Facts: 2, Success: 100.0%
3. FactBench_open-source_qwen2.5:7B_rag_20250527-103603.json
Model: open-source_qwen2.5:7B, Method: rag
Facts: 2, Success: 100.0%
How many files do you want to select? (1-3):
Or you can simply define the files you want to use for the consensus:
python consensus.py --files results/factbench_open-source_gemma2:9b_rag_20250527-103716.json results/factbench_open-source_qwen2.5:7B_rag_20250527-103404.json
- add parallel processing for the consensus
Method | FactBench BAcc | YAGO BAcc | DBpedia BAcc | Avg Time/Fact |
---|---|---|---|---|
DKA | 0.72 | 0.53 | 0.64 | ~0.3s |
GIV-F | 0.74 | 0.58 | 0.65 | ~0.8s |
RAG | 0.90 | 0.56 | 0.67 | ~2.3s |
Consensus | 0.90 | 0.64 | 0.68 | ~1.5s |
- Model Rankings: Gemma2 > Qwen2.5 > Mistral > Llama3.1 > GPT-4o mini
- Dataset Difficulty: FactBench (easiest) > DBpedia > YAGO (hardest due to class imbalance)
- Cost-Performance Trade-off: RAG provides best accuracy but 10× computational cost
- Consensus Benefits: 1-5% improvement over individual models
Explore verification results interactively at: https://factcheck.dei.unipd.it/
- Fact Search: Find specific KG triples and their verification results
- Step-by-step Analysis: Inspect RAG pipeline components
- Model Comparison: Compare reasoning patterns across different LLMs
- Error Analysis: Categorized failure analysis with systematic insights
- User Feedback: Collaborative annotation and feedback system
factcheck-benchmark/
├── config.yml # Main configuration file
├── main.py # Entry point for experiments
├── config.py # Configuration validation and management
├── data_loader.py # Dataset loading and preprocessing
├── llm_client.py # LLM client implementations
├── evaluate.py # Evaluation metrics and analysis
├── requirements.txt # Python dependencies
├── prompts/ # Prompt templates for each methodology
├── consensus.py # Multi-model consensus implementation
├── rag_dataset/ # RAG dataset -- filtered -- for complete dataset refer to mockapi
├── methods/
│ ├── dka.py # Direct Knowledge Assessment
│ ├── giv.py # Guided Iterative Verification
│ └── rag.py # Retrieval-Augmented Generation
├── dataset/
│ ├── FactBench/
│ ├── YAGO/
│ └── DBpedia/
├── results/ # Output directory for results
└── README.md # This file
config.py
: Comprehensive configuration validation with support for multiple LLM providersevaluate.py
: Scikit-learn based evaluation with balanced accuracy and F1-macro metricsmethods/
: Implementation of all verification methodologiesprompts/
: Contains prompt templates for each methodology
# Balanced Accuracy (addresses class imbalance)
BAcc = (Sensitivity + Specificity) / 2
# F1-Macro Score (unweighted average across classes)
F1_macro = (1/N) * Σ(2 * Precision_i * Recall_i / (Precision_i + Recall_i))
# Consistency (model agreement)
Consistency = |{f ∈ F | response(m,f) = majorityVote(f)}| / |F|
# Efficiency
Time_per_fact = average_response_time_excluding_outliers
- Ollama Connection Error
# Ensure Ollama is running
ollama serve
# Check available models
ollama list
- Memory Issues
# Reduce batch size or use smaller models
config["llm"]["model"] = "gemma2:2b" # Instead of 9b
If you use this benchmark in your research, please cite:
@article{shami2025factcheck,
title={Knowledge Graph Validation via Large Language Models},
author={Shami, Farzad and Marchesin, Stefano and Silvello, Gianmaria},
journal={},
volume={14},
number={1},
pages={XXX-XXX},
year={2025},
publisher={}
}
We welcome contributions! Please see our Contributing Guidelines for details.
- New Datasets: Integration of additional KG datasets
- Model Support: Adding support for new LLM architectures
- Evaluation Metrics: Implementation of additional evaluation measures
- Optimization: Performance improvements and efficiency enhancements -- important aspect
This project is licensed under the MIT License - see the LICENSE file for details.
- Paper: 2025 Proceedings
- Dataset: Hugging Face Repository
- Web Platform: https://factcheck.dei.unipd.it/
- Issues: GitHub Issues
- Farzad Shami - University of Padua - farzad.shami@studenti.unipd.it
- Stefano Marchesin - University of Padua - stefano.marchesin@unipd.it
- Gianmaria Silvello - University of Padua - gianmaria.silvello@unipd.it
This work is partially supported by the HEREDITARY Project (EU Horizon Europe Grant Agreement No. GA 101137074).