This repository contains code for testing various models to weakly label a dataset and for the actual labeling process. This work is part of the 2025 Field-of-Research Classification (FoRC) Shared Task which is co-located at SNLP 2025.
The FoRC shared task concerns itself with the automatic classification of scientific research papers by their field of research. We focus on classifying computational linguistics papers, taken from the ACL Anthology, by at least one of a list of 181 hierarchically organized labels. The 2025 iteration adds a weakly labeled dataset of over 40,000 papers to the manually labeled dataset of 1500 papers used for last year's iteration. More details about the shared task can be found here.
This code can:
- Convert data from the ACL Anthology corpus into the same format as the FoRC4CL dataset.
- Pre- and postprocess FoRC4CL-format datasets.
- Train and analyse simple ML models on the FoRC4CL train/test split.
- Weakly label the ACL Anthology corpus using the simple models.
- Train and score transformer models on the FoRC4CL train/test split.
Results are soon to be published in an overview paper.
Contributions are welcome! Please check our CodaBench if you'd like to submit a solution!
For any questions, please contact maria.francis@dfki.de or open an issue in this repository.