Skip to content

FactCheck, a general-purpose benchmark that is designed to verify facts in KG using LLMs (including both open-source and commercial ones).

Notifications You must be signed in to change notification settings

FactCheck-AI/FactCheck

Repository files navigation

from sympy import python

FactCheck: Knowledge Graph Validation via Large Language Models

Python 3.8+ License: MIT Paper

📋 Table of Contents

🎯 Overview

FactCheck is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) capabilities in Knowledge Graph (KG) fact verification. This repository implements multiple verification methodologies and provides extensive evaluation across three real-world KG datasets with over 13,530 facts.

Key Research Questions

  1. RQ1: How effective are LLMs in fact-checking KGs using only their internal knowledge?
  2. RQ2: Can LLMs effectively fact-check KGs using external evidence through RAG?
  3. RQ3: Do multi-model consensus approaches improve KG fact verification accuracy?

Main Findings

  • ✅ Open-source LLMs can effectively verify KG facts (up to 0.90 balanced accuracy)
  • ✅ RAG integration improves accuracy but increases computational cost (~10×)
  • ✅ Multi-model consensus consistently outperforms individual models (+4.5% improvement)
  • 🚧 For ablation study results, see Ablation Study Results.

🚀 Features

  • Multiple LLM Support: Both open-source (Gemma2, Qwen2.5, Llama3.1, Mistral) and commercial (GPT-4o mini) models
  • Diverse Methodologies: Direct Knowledge Assessment (DKA), Guided Iterative Verification (GIV), RAG, and Multi-model Consensus
  • Real-world Datasets: FactBench, YAGO, and DBpedia with 13,530 total facts
  • RAG Dataset: 2+ million documents specifically curated for KG fact verification
  • Mock API: Simulated API for testing and development -- refer to FactCheck MockAPI.
  • Interactive Platform: Web-based exploration tool for verification analysis
  • Comprehensive Evaluation: Balanced accuracy, F1-macro, efficiency metrics, and cost analysis

📑 Prompt templates

Prompt templates for each methodology are available in the prompts directory.


📦 Installation

Prerequisites

  • Python 3.8+
  • Ollama (for open-source models)
  • Azure OpenAI API access (for commercial models)

Setup

  1. Clone the repository
git clone https://github.com/FactCheck-AI/factcheck-benchmark
cd factcheck-benchmark
  1. Install dependencies
pip install -r requirements.txt
  1. Install Ollama (for open-source models)
# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service
ollama serve
  1. Download required models
ollama pull gemma2:9b
ollama pull qwen2.5:7b
ollama pull llama3.1:8b
ollama pull mistral:7b

🚀 Quick Start

Basic Fact Verification

python main.py # Run with default configuration in `config.yml`

📊 Datasets

Supported Datasets

Dataset Facts Predicates Gold Accuracy Description
FactBench 2,800 10 0.54 Systematically generated with balanced true/false distribution
YAGO 1,386 16 0.99 High-quality facts with extreme class imbalance
DBpedia 9,344 1,092 0.85 Diverse schema with extensive predicate coverage

RAG Dataset

  • 130,820 questions generated from KG facts
  • 2,090,305 documents from Google SERP
  • 87.4% text coverage rate
  • Similarity scores for question relevance ranking

🔬 Methodologies

1. Direct Knowledge Assessment (DKA)

Basic fact verification using only LLM's internal knowledge without external guidance.

method:
  name: "DKA"

2. Guided Iterative Verification (GIV)

Enhanced verification with structured guidelines and examples.

GIV-Z (Zero-shot):

method:
  name: "GIV-Z"

GIV-F (Few-shot):

method:
  name: "GIV-F"

3. Retrieval-Augmented Generation (RAG)

Verification using external evidence from web search results.

method:
  name: "RAG"

rag:
  embedding_model: 'bge-small-en-v1.5'
  chunking_strategy: 'sliding_window'
  window_size: 3
  similarity_cutoff: 0.3
  top_k: 6

4. Multi-model Consensus

Combines predictions from multiple models using majority voting with tie-breaking.

majority_vote:
  mode: 'commercial'  # Options: commercial, open_source
  final_tie_breaker: 'most_consistent' # Options: least_consistent, most_consistent, Null (for commercial)
  num_votes: 3 # Number of votes for each model
  llms:
    - "mistral:7B"
    - "qwen2.5:7B"
    - "llama3.1:7B"
    - "gemma2:9B"
  higher_parameter_model:
    qwen2.5:7b: 'qwen2.5:7b'
    mistral:7b: 'mistral:7b'
    llama3.1:7b: 'llama3.1:latest'
    gemma2:9b: 'gemma2:9b'
  commercial_model:
    - "gpt-4o-mini"

⚙️ Configuration

Example Configuration (config.yml)

# Dataset configuration
dataset:
  name: "FactBench"  # Options: DBpedia, YAGO, FactBench

# Method configuration
method:
  name: "DKA"  # Options: DKA, GIV-Z, GIV-F, RAG

# LLM configuration
llm:
  mode: "open_source"  # Options: commercial, open_source
  model: "gemma2:9B"
  parameters:
    temperature: 0.75
    top_p: 0.9
    max_tokens: 512

# Evaluation configuration
evaluation:
  metrics:
    accuracy: 'balanced'  # Options: balanced, normal
    f1_score: "macro"     # Options: micro, macro, weighted

# Knowledge Graph configuration
knowledge_graph:
  kg_ids: ['correct_death_00106', 'correct_death_00040']

# Output configuration
output:
  directory: "./results"

Azure OpenAI Configuration

For commercial models, configure Azure OpenAI:

OpenAI:
  azure_endpoint: "https://your-resource.openai.azure.com/"
  api_key: "your-api-key"
  api_version: "2024-02-15-preview"

💻 Usage Examples

1. Single Model Evaluation

python main.py

2. Batch Evaluation

python evaluation.py --file results/factbench_results.json

For full evaluation use --full flag to include all metrics.

python evaluation.py --file results/factbench_results.json --full

3. Majority Vote Consensus

This module is interactive. You can run it as follows:

python consensus.py --dataset FactBench

if you don't specify files it will ask you to enter which files you want to use for the consensus. The output example will be:

Found 3 files for FactBench:
   1. FactBench_open-source_gemma2:9b_rag_20250527-103716.json
      Model: open-source_gemma2:9b, Method: rag
      Facts: 2, Success: 100.0%
   2. FactBench_open-source_qwen2.5:7B_rag_20250527-103404.json
      Model: open-source_qwen2.5:7B, Method: rag
      Facts: 2, Success: 100.0%
   3. FactBench_open-source_qwen2.5:7B_rag_20250527-103603.json
      Model: open-source_qwen2.5:7B, Method: rag
      Facts: 2, Success: 100.0%

How many files do you want to select? (1-3): 

Or you can simply define the files you want to use for the consensus:

python consensus.py --files results/factbench_open-source_gemma2:9b_rag_20250527-103716.json results/factbench_open-source_qwen2.5:7B_rag_20250527-103404.json

🚧 Todo:

  • add parallel processing for the consensus

📈 Results

Performance Summary

Method FactBench BAcc YAGO BAcc DBpedia BAcc Avg Time/Fact
DKA 0.72 0.53 0.64 ~0.3s
GIV-F 0.74 0.58 0.65 ~0.8s
RAG 0.90 0.56 0.67 ~2.3s
Consensus 0.90 0.64 0.68 ~1.5s

Key Insights

  1. Model Rankings: Gemma2 > Qwen2.5 > Mistral > Llama3.1 > GPT-4o mini
  2. Dataset Difficulty: FactBench (easiest) > DBpedia > YAGO (hardest due to class imbalance)
  3. Cost-Performance Trade-off: RAG provides best accuracy but 10× computational cost
  4. Consensus Benefits: 1-5% improvement over individual models

🌐 Web Platform

Explore verification results interactively at: https://factcheck.dei.unipd.it/

Features:

  • Fact Search: Find specific KG triples and their verification results
  • Step-by-step Analysis: Inspect RAG pipeline components
  • Model Comparison: Compare reasoning patterns across different LLMs
  • Error Analysis: Categorized failure analysis with systematic insights
  • User Feedback: Collaborative annotation and feedback system

📁 Repository Structure

factcheck-benchmark/
├── config.yml                 # Main configuration file
├── main.py                    # Entry point for experiments
├── config.py                  # Configuration validation and management
├── data_loader.py             # Dataset loading and preprocessing
├── llm_client.py              # LLM client implementations
├── evaluate.py                # Evaluation metrics and analysis
├── requirements.txt           # Python dependencies
├── prompts/                   # Prompt templates for each methodology
├── consensus.py               # Multi-model consensus implementation
├── rag_dataset/              # RAG dataset -- filtered -- for complete dataset refer to mockapi
├── methods/
│   ├── dka.py                 # Direct Knowledge Assessment
│   ├── giv.py                 # Guided Iterative Verification
│   └── rag.py                 # Retrieval-Augmented Generation
├── dataset/
│   ├── FactBench/
│   ├── YAGO/
│   └── DBpedia/
├── results/                   # Output directory for results
└── README.md                  # This file

Key Files

  • config.py: Comprehensive configuration validation with support for multiple LLM providers
  • evaluate.py: Scikit-learn based evaluation with balanced accuracy and F1-macro metrics
  • methods/: Implementation of all verification methodologies
  • prompts/: Contains prompt templates for each methodology

📊 Evaluation Metrics

Implemented Metrics

# Balanced Accuracy (addresses class imbalance)
BAcc = (Sensitivity + Specificity) / 2

# F1-Macro Score (unweighted average across classes)  
F1_macro = (1/N) * Σ(2 * Precision_i * Recall_i / (Precision_i + Recall_i))

# Consistency (model agreement)
Consistency = |{fF | response(m,f) = majorityVote(f)}| / |F|

# Efficiency
Time_per_fact = average_response_time_excluding_outliers

🛠️ Troubleshooting

Common Issues

  1. Ollama Connection Error
# Ensure Ollama is running
ollama serve

# Check available models
ollama list
  1. Memory Issues
# Reduce batch size or use smaller models
config["llm"]["model"] = "gemma2:2b"  # Instead of 9b

📚 Citation

If you use this benchmark in your research, please cite:

@article{shami2025factcheck,
  title={Knowledge Graph Validation via Large Language Models},
  author={Shami, Farzad and Marchesin, Stefano and Silvello, Gianmaria},
  journal={},
  volume={14},
  number={1},
  pages={XXX-XXX},
  year={2025},
  publisher={}
}

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Areas for Contribution

  • New Datasets: Integration of additional KG datasets
  • Model Support: Adding support for new LLM architectures
  • Evaluation Metrics: Implementation of additional evaluation measures
  • Optimization: Performance improvements and efficiency enhancements -- important aspect

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

👥 Authors


This work is partially supported by the HEREDITARY Project (EU Horizon Europe Grant Agreement No. GA 101137074).

About

FactCheck, a general-purpose benchmark that is designed to verify facts in KG using LLMs (including both open-source and commercial ones).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages