Statistical validation of labeling consistency across three independent raters for a handwritten digit classification dataset.
ZPD-Numbers -- 2025/2026
This repository contains the inter-observer reliability analysis for a handwritten digit classification task. Three raters -- Michal, Olivier, and Vincenzo -- independently labeled 800 images into 10 categories (digits 0 through 9). The goal is to statistically verify that the human annotations are consistent and trustworthy before using them as ground truth.
The analysis applies standard inter-rater reliability metrics:
| Metric | Scope | Result |
|---|---|---|
| Fleiss' Kappa | All 3 raters simultaneously | 0.9981 (almost perfect) |
| Cohen's Kappa | Michal vs Olivier | 0.9972 |
| Cohen's Kappa | Michal vs Vincenzo | 0.9972 |
| Cohen's Kappa | Olivier vs Vincenzo | 1.0000 (perfect) |
Out of 800 images, only 2 had any disagreement between raters -- both involving confusion between digits 8 and 9. No image had all three raters disagree.
inter_observer_reliability.ipynb Main analysis notebook
inter_observer_reliability.html Rendered notebook (viewable in browser)
__merged_michal_.csv Labels from Rater 1
__merged_olivier_.csv Labels from Rater 2
__merged_vincenzo_.csv Labels from Rater 3
pip install pandas numpy scikit-learn statsmodels jupyter
jupyter notebook inter_observer_reliability.ipynbA Fleiss' Kappa of 0.9981 falls into the "almost perfect" agreement range (> 0.81) on the Landis and Koch scale. This confirms that the labeling process is highly reliable and the resulting annotations can be used with confidence as ground truth for downstream tasks.
| Kappa Range | Interpretation |
|---|---|
| < 0.00 | Poor |
| 0.00 -- 0.20 | Slight |
| 0.21 -- 0.40 | Fair |
| 0.41 -- 0.60 | Moderate |
| 0.61 -- 0.80 | Substantial |
| 0.81 -- 1.00 | Almost Perfect |
ZPD-Numbers-2025-2026 | Michal Tarnowski, Olivier, Vincenzo