GitHub - Buguemar/AttnGraphs: Rethinking Graph-Based Document Classification.

📄 Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

Authors: Margarita Bugueño, Gerard de Melo
arXiv preprint: arXiv:2508.00864

📝 Abstract

This repository contains the code, data, and supplemental materials for our paper: "Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches", submitted to arXiv on 2025-07-18.

In document classification, graph-based models effectively capture document structure, overcoming sequence length limitations and enhancing contextual understanding. However, most existing graph document representations rely on heuristics, domain-specific rules, or expert knowledge. Unlike previous approaches, we propose a method to learn data-driven graph structures, eliminating the need for manual design and reducing domain dependence. Our approach constructs homogeneous weighted graphs with sentences as nodes, while edges are learned via a self-attention model that identifies dependencies between sentence pairs. A statistical filtering strategy aims to retain only strongly correlated sentences, improving graph quality while reducing the graph size. Experiments on three document classification datasets demonstrate that learned graphs consistently outperform heuristic-based graphs, achieving higher accuracy and score. Furthermore, our study demonstrates the effectiveness of the statistical filtering in improving classification robustness. These results highlight the potential of automatic graph generation over traditional heuristic approaches and open new directions for broader applications in NLP.

📁 Repository Structure

├── config/
│   ├── GNN_config_heuristic_graphs/        # Config files including parameters for training GATs on heuristic-based graphs
│   └── GNN_config_learned_graphs/          # Config files including parameters for training GATs on learned graphs
│   ├── MHAClassifier_config/               # Config files including parameters for training our attention-based model.
├── data/                                   # Raw and processed datasets
├── GNN_Results_Classifier/                 # Results obtained from our runs
│   ├── Attention/                          # Results from learned graphs
│   └── Heuristic_uni/                      # Results from heuristic-based graphs
├── imgs/                                   # Figures with adjency matrix examples of each dataset
├── notebooks/                              # Jupyter notebooks for data preparation (optional, as processed data is provided in this repository)
├── src/                                    # Source code for training and evaluation
│   ├── data/                               # data loaders and utils
│   └── graphs/                             # graph-based architectures
│   └── models/                             # core training models
│   └── pipeline/                           # connector for text-graph models  
└── paper.pdf                               # Preprint version of the paper
├── README.md                            
├── requirements.txt                        # Python dependencies
├── train_GNN.py                            # GNN Training script 
├── train_MHAClassifier.py                  # MHA-based Classifier Training script

📊 Datasets

Due to file size restrictions, the preprocessed versions of Hyperpartisan News Detection (HND) and BBC News datasets are available in the data/ folder. In turn, the arXiv dataset can be downloaded from: Drive.

🚀 Running the Code

python train_MHAClassifier.py --config config/MHAClassifier_config/your_config_file.yaml

python train_GNN.py --config config/GNN_config_learned_graphs/your_config_file.yaml

python train_GNN.py --config config/GNN_config_heuristic_graphs/your_config_file.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📄 Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

📝 Abstract

📁 Repository Structure

📊 Datasets

🚀 Running the Code

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
GNN_Results_Classifier		GNN_Results_Classifier
config		config
data		data
imgs		imgs
notebooks		notebooks
src		src
README.md		README.md
requirements.txt		requirements.txt
train_GNN.py		train_GNN.py
train_MHAClassifier.py		train_MHAClassifier.py

Buguemar/AttnGraphs

Folders and files

Latest commit

History

Repository files navigation

📄 Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

📝 Abstract

📁 Repository Structure

📊 Datasets

🚀 Running the Code

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages