Skip to content

ZPD-Numbers-2025-2026/Inter-observer-reliability

Repository files navigation

Inter-Observer Reliability Analysis

Python Jupyter scikit-learn pandas

License Raters Images Agreement

Statistical validation of labeling consistency across three independent raters for a handwritten digit classification dataset.

ZPD-Numbers -- 2025/2026


About

This repository contains the inter-observer reliability analysis for a handwritten digit classification task. Three raters -- Michal, Olivier, and Vincenzo -- independently labeled 800 images into 10 categories (digits 0 through 9). The goal is to statistically verify that the human annotations are consistent and trustworthy before using them as ground truth.

Methodology

The analysis applies standard inter-rater reliability metrics:

Metric Scope Result
Fleiss' Kappa All 3 raters simultaneously 0.9981 (almost perfect)
Cohen's Kappa Michal vs Olivier 0.9972
Cohen's Kappa Michal vs Vincenzo 0.9972
Cohen's Kappa Olivier vs Vincenzo 1.0000 (perfect)

Out of 800 images, only 2 had any disagreement between raters -- both involving confusion between digits 8 and 9. No image had all three raters disagree.

Repository Contents

inter_observer_reliability.ipynb   Main analysis notebook
inter_observer_reliability.html    Rendered notebook (viewable in browser)
__merged_michal_.csv               Labels from Rater 1
__merged_olivier_.csv              Labels from Rater 2
__merged_vincenzo_.csv             Labels from Rater 3

How to Run

pip install pandas numpy scikit-learn statsmodels jupyter
jupyter notebook inter_observer_reliability.ipynb

Interpretation

A Fleiss' Kappa of 0.9981 falls into the "almost perfect" agreement range (> 0.81) on the Landis and Koch scale. This confirms that the labeling process is highly reliable and the resulting annotations can be used with confidence as ground truth for downstream tasks.

Kappa Range Interpretation
< 0.00 Poor
0.00 -- 0.20 Slight
0.21 -- 0.40 Fair
0.41 -- 0.60 Moderate
0.61 -- 0.80 Substantial
0.81 -- 1.00 Almost Perfect

ZPD-Numbers-2025-2026 | Michal Tarnowski, Olivier, Vincenzo

About

Statistical validation of labeling consistency across three independent raters for a handwritten digit classification dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors