FactCheck: Knowledge Graph Validation via Large Language Models

from sympy import python

FactCheck: Knowledge Graph Validation via Large Language Models

📋 Table of Contents

Overview
Features
Installation
Quick Start
Datasets
Methodologies
Configuration
Usage Examples
Results
Web Platform
Repository Structure
Citation
Contributing

🎯 Overview

FactCheck is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) capabilities in Knowledge Graph (KG) fact verification. This repository implements multiple verification methodologies and provides extensive evaluation across three real-world KG datasets with over 13,530 facts.

Key Research Questions

RQ1: How effective are LLMs in fact-checking KGs using only their internal knowledge?
RQ2: Can LLMs effectively fact-check KGs using external evidence through RAG?
RQ3: Do multi-model consensus approaches improve KG fact verification accuracy?

Main Findings

✅ Open-source LLMs can effectively verify KG facts (up to 0.90 balanced accuracy)
✅ RAG integration improves accuracy but increases computational cost (~10×)
✅ Multi-model consensus consistently outperforms individual models (+4.5% improvement)
🚧 For ablation study results, see Ablation Study Results.

🚀 Features

Multiple LLM Support: Both open-source (Gemma2, Qwen2.5, Llama3.1, Mistral) and commercial (GPT-4o mini) models
Diverse Methodologies: Direct Knowledge Assessment (DKA), Guided Iterative Verification (GIV), RAG, and Multi-model Consensus
Real-world Datasets: FactBench, YAGO, and DBpedia with 13,530 total facts
RAG Dataset: 2+ million documents specifically curated for KG fact verification
Mock API: Simulated API for testing and development -- refer to FactCheck MockAPI.
Interactive Platform: Web-based exploration tool for verification analysis
Comprehensive Evaluation: Balanced accuracy, F1-macro, efficiency metrics, and cost analysis

📑 Prompt templates

Prompt templates for each methodology are available in the prompts directory.

📦 Installation

Prerequisites

Python 3.8+
Ollama (for open-source models)
Azure OpenAI API access (for commercial models)

Setup

Clone the repository

git clone https://github.com/FactCheck-AI/factcheck-benchmark
cd factcheck-benchmark

Install dependencies

pip install -r requirements.txt

Install Ollama (for open-source models)

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service
ollama serve

Download required models

ollama pull gemma2:9b
ollama pull qwen2.5:7b
ollama pull llama3.1:8b
ollama pull mistral:7b

🚀 Quick Start

Basic Fact Verification

python main.py # Run with default configuration in `config.yml`

📊 Datasets

Supported Datasets

Dataset	Facts	Predicates	Gold Accuracy	Description
FactBench	2,800	10	0.54	Systematically generated with balanced true/false distribution
YAGO	1,386	16	0.99	High-quality facts with extreme class imbalance
DBpedia	9,344	1,092	0.85	Diverse schema with extensive predicate coverage

RAG Dataset

130,820 questions generated from KG facts
2,090,305 documents from Google SERP
87.4% text coverage rate
Similarity scores for question relevance ranking

🔬 Methodologies

1. Direct Knowledge Assessment (DKA)

Basic fact verification using only LLM's internal knowledge without external guidance.

method:
  name: "DKA"

2. Guided Iterative Verification (GIV)

Enhanced verification with structured guidelines and examples.

GIV-Z (Zero-shot):

method:
  name: "GIV-Z"

GIV-F (Few-shot):

method:
  name: "GIV-F"

3. Retrieval-Augmented Generation (RAG)

Verification using external evidence from web search results.

method:
  name: "RAG"

rag:
  embedding_model: 'bge-small-en-v1.5'
  chunking_strategy: 'sliding_window'
  window_size: 3
  similarity_cutoff: 0.3
  top_k: 6

4. Multi-model Consensus

Combines predictions from multiple models using majority voting with tie-breaking.

majority_vote:
  mode: 'commercial'  # Options: commercial, open_source
  final_tie_breaker: 'most_consistent' # Options: least_consistent, most_consistent, Null (for commercial)
  num_votes: 3 # Number of votes for each model
  llms:
    - "mistral:7B"
    - "qwen2.5:7B"
    - "llama3.1:7B"
    - "gemma2:9B"
  higher_parameter_model:
    qwen2.5:7b: 'qwen2.5:7b'
    mistral:7b: 'mistral:7b'
    llama3.1:7b: 'llama3.1:latest'
    gemma2:9b: 'gemma2:9b'
  commercial_model:
    - "gpt-4o-mini"

⚙️ Configuration

Example Configuration (`config.yml`)

# Dataset configuration
dataset:
  name: "FactBench"  # Options: DBpedia, YAGO, FactBench

# Method configuration
method:
  name: "DKA"  # Options: DKA, GIV-Z, GIV-F, RAG

# LLM configuration
llm:
  mode: "open_source"  # Options: commercial, open_source
  model: "gemma2:9B"
  parameters:
    temperature: 0.75
    top_p: 0.9
    max_tokens: 512

# Evaluation configuration
evaluation:
  metrics:
    accuracy: 'balanced'  # Options: balanced, normal
    f1_score: "macro"     # Options: micro, macro, weighted

# Knowledge Graph configuration
knowledge_graph:
  kg_ids: ['correct_death_00106', 'correct_death_00040']

# Output configuration
output:
  directory: "./results"

Azure OpenAI Configuration

For commercial models, configure Azure OpenAI:

OpenAI:
  azure_endpoint: "https://your-resource.openai.azure.com/"
  api_key: "your-api-key"
  api_version: "2024-02-15-preview"

💻 Usage Examples

1. Single Model Evaluation

python main.py

2. Batch Evaluation

python evaluation.py --file results/factbench_results.json

For full evaluation use --full flag to include all metrics.

python evaluation.py --file results/factbench_results.json --full

3. Majority Vote Consensus

This module is interactive. You can run it as follows:

python consensus.py --dataset FactBench

if you don't specify files it will ask you to enter which files you want to use for the consensus. The output example will be:

Found 3 files for FactBench:
   1. FactBench_open-source_gemma2:9b_rag_20250527-103716.json
      Model: open-source_gemma2:9b, Method: rag
      Facts: 2, Success: 100.0%
   2. FactBench_open-source_qwen2.5:7B_rag_20250527-103404.json
      Model: open-source_qwen2.5:7B, Method: rag
      Facts: 2, Success: 100.0%
   3. FactBench_open-source_qwen2.5:7B_rag_20250527-103603.json
      Model: open-source_qwen2.5:7B, Method: rag
      Facts: 2, Success: 100.0%

How many files do you want to select? (1-3):

Or you can simply define the files you want to use for the consensus:

python consensus.py --files results/factbench_open-source_gemma2:9b_rag_20250527-103716.json results/factbench_open-source_qwen2.5:7B_rag_20250527-103404.json

🚧 Todo:

add parallel processing for the consensus

📈 Results

Performance Summary

Method	FactBench BAcc	YAGO BAcc	DBpedia BAcc	Avg Time/Fact
DKA	0.72	0.53	0.64	~0.3s
GIV-F	0.74	0.58	0.65	~0.8s
RAG	0.90	0.56	0.67	~2.3s
Consensus	0.90	0.64	0.68	~1.5s

Key Insights

Model Rankings: Gemma2 > Qwen2.5 > Mistral > Llama3.1 > GPT-4o mini
Dataset Difficulty: FactBench (easiest) > DBpedia > YAGO (hardest due to class imbalance)
Cost-Performance Trade-off: RAG provides best accuracy but 10× computational cost
Consensus Benefits: 1-5% improvement over individual models

🌐 Web Platform

Explore verification results interactively at: https://factcheck.dei.unipd.it/

Features:

Fact Search: Find specific KG triples and their verification results
Step-by-step Analysis: Inspect RAG pipeline components
Model Comparison: Compare reasoning patterns across different LLMs
Error Analysis: Categorized failure analysis with systematic insights
User Feedback: Collaborative annotation and feedback system

📁 Repository Structure

factcheck-benchmark/
├── config.yml                 # Main configuration file
├── main.py                    # Entry point for experiments
├── config.py                  # Configuration validation and management
├── data_loader.py             # Dataset loading and preprocessing
├── llm_client.py              # LLM client implementations
├── evaluate.py                # Evaluation metrics and analysis
├── requirements.txt           # Python dependencies
├── prompts/                   # Prompt templates for each methodology
├── consensus.py               # Multi-model consensus implementation
├── rag_dataset/              # RAG dataset -- filtered -- for complete dataset refer to mockapi
├── methods/
│   ├── dka.py                 # Direct Knowledge Assessment
│   ├── giv.py                 # Guided Iterative Verification
│   └── rag.py                 # Retrieval-Augmented Generation
├── dataset/
│   ├── FactBench/
│   ├── YAGO/
│   └── DBpedia/
├── results/                   # Output directory for results
└── README.md                  # This file

Key Files

config.py: Comprehensive configuration validation with support for multiple LLM providers
evaluate.py: Scikit-learn based evaluation with balanced accuracy and F1-macro metrics
methods/: Implementation of all verification methodologies
prompts/: Contains prompt templates for each methodology

📊 Evaluation Metrics

Implemented Metrics

# Balanced Accuracy (addresses class imbalance)
BAcc = (Sensitivity + Specificity) / 2

# F1-Macro Score (unweighted average across classes)  
F1_macro = (1/N) * Σ(2 * Precision_i * Recall_i / (Precision_i + Recall_i))

# Consistency (model agreement)
Consistency = |{f ∈ F | response(m,f) = majorityVote(f)}| / |F|

# Efficiency
Time_per_fact = average_response_time_excluding_outliers

🛠️ Troubleshooting

Common Issues

Ollama Connection Error

# Ensure Ollama is running
ollama serve

# Check available models
ollama list

Memory Issues

# Reduce batch size or use smaller models
config["llm"]["model"] = "gemma2:2b"  # Instead of 9b

📚 Citation

If you use this benchmark in your research, please cite:

@article{shami2025factcheck,
  title={Knowledge Graph Validation via Large Language Models},
  author={Shami, Farzad and Marchesin, Stefano and Silvello, Gianmaria},
  journal={},
  volume={14},
  number={1},
  pages={XXX-XXX},
  year={2025},
  publisher={}
}

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Areas for Contribution

New Datasets: Integration of additional KG datasets
Model Support: Adding support for new LLM architectures
Evaluation Metrics: Implementation of additional evaluation measures
Optimization: Performance improvements and efficiency enhancements -- important aspect

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

Paper: 2025 Proceedings
Dataset: Hugging Face Repository
Web Platform: https://factcheck.dei.unipd.it/
Issues: GitHub Issues

👥 Authors

Farzad Shami - University of Padua - farzad.shami@studenti.unipd.it
Stefano Marchesin - University of Padua - stefano.marchesin@unipd.it
Gianmaria Silvello - University of Padua - gianmaria.silvello@unipd.it

This work is partially supported by the HEREDITARY Project (EU Horizon Europe Grant Agreement No. GA 101137074).

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.idea		.idea
analysis		analysis
dataset		dataset
methods		methods
prompts		prompts
rag_dataset		rag_dataset
results		results
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
config.py		config.py
config.yml		config.yml
consensus.py		consensus.py
data_loader.py		data_loader.py
evaluation.py		evaluation.py
llm_client.py		llm_client.py
main.py		main.py
requirements.txt		requirements.txt

FactCheck-AI/FactCheck

Folders and files

Latest commit

History

Repository files navigation

FactCheck: Knowledge Graph Validation via Large Language Models

📋 Table of Contents

🎯 Overview

Key Research Questions

Main Findings

🚀 Features

📑 Prompt templates

📦 Installation

Prerequisites

Setup

🚀 Quick Start

Basic Fact Verification

📊 Datasets

Supported Datasets

RAG Dataset

🔬 Methodologies

1. Direct Knowledge Assessment (DKA)

2. Guided Iterative Verification (GIV)

3. Retrieval-Augmented Generation (RAG)

4. Multi-model Consensus

⚙️ Configuration

Example Configuration (config.yml)

Azure OpenAI Configuration

💻 Usage Examples

1. Single Model Evaluation

2. Batch Evaluation

3. Majority Vote Consensus

🚧 Todo:

📈 Results

Performance Summary

Key Insights

🌐 Web Platform

Features:

📁 Repository Structure

Key Files

📊 Evaluation Metrics

Implemented Metrics

🛠️ Troubleshooting

Common Issues

📚 Citation

🤝 Contributing

Areas for Contribution

📄 License

🔗 Links

👥 Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example Configuration (`config.yml`)

Packages