📄 Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches
Authors: Margarita Bugueño, Gerard de Melo
arXiv preprint: arXiv:2508.00864
This repository contains the code, data, and supplemental materials for our paper: "Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches", submitted to arXiv on 2025-07-18.
In document classification, graph-based models effectively capture document structure, overcoming sequence length limitations and enhancing contextual understanding. However, most existing graph document representations rely on heuristics, domain-specific rules, or expert knowledge. Unlike previous approaches, we propose a method to learn data-driven graph structures, eliminating the need for manual design and reducing domain dependence. Our approach constructs homogeneous weighted graphs with sentences as nodes, while edges are learned via a self-attention model that identifies dependencies between sentence pairs. A statistical filtering strategy aims to retain only strongly correlated sentences, improving graph quality while reducing the graph size. Experiments on three document classification datasets demonstrate that learned graphs consistently outperform heuristic-based graphs, achieving higher accuracy and score. Furthermore, our study demonstrates the effectiveness of the statistical filtering in improving classification robustness. These results highlight the potential of automatic graph generation over traditional heuristic approaches and open new directions for broader applications in NLP.
├── config/
│ ├── GNN_config_heuristic_graphs/ # Config files including parameters for training GATs on heuristic-based graphs
│ └── GNN_config_learned_graphs/ # Config files including parameters for training GATs on learned graphs
│ ├── MHAClassifier_config/ # Config files including parameters for training our attention-based model.
├── data/ # Raw and processed datasets
├── GNN_Results_Classifier/ # Results obtained from our runs
│ ├── Attention/ # Results from learned graphs
│ └── Heuristic_uni/ # Results from heuristic-based graphs
├── imgs/ # Figures with adjency matrix examples of each dataset
├── notebooks/ # Jupyter notebooks for data preparation (optional, as processed data is provided in this repository)
├── src/ # Source code for training and evaluation
│ ├── data/ # data loaders and utils
│ └── graphs/ # graph-based architectures
│ └── models/ # core training models
│ └── pipeline/ # connector for text-graph models
└── paper.pdf # Preprint version of the paper
├── README.md
├── requirements.txt # Python dependencies
├── train_GNN.py # GNN Training script
├── train_MHAClassifier.py # MHA-based Classifier Training scriptDue to file size restrictions, the preprocessed versions of Hyperpartisan News Detection (HND) and BBC News datasets are available in the data/ folder. In turn, the arXiv dataset can be downloaded from: Drive.
python train_MHAClassifier.py --config config/MHAClassifier_config/your_config_file.yaml
python train_GNN.py --config config/GNN_config_learned_graphs/your_config_file.yaml
python train_GNN.py --config config/GNN_config_heuristic_graphs/your_config_file.yaml