Figure 1. The architecture of TextSSL.
We use the same benchmark datasets that are used in Yao, Mao, and Luo 2019, where we follow the same train/test splits and data preprocessing for MR, Ohsumed and 20NG datasets as Kim 2014; Yao, Mao, and Luo 2019. Thanks for their work.
For R8 and R52 datasets, they are only provided by a preprocessed version that lack punctuations and do not have explicit sample names. Since we use documents with sentence segmentation information to construct graph, we re-extract the data from original Reuters-21578 dataset.
You can download the dataset here:
- re-extract R8 and R52 datasets.
python re-extract_data/mk_R8_R52.py --name R8
- remove words.
python remove_words.py --name R8
To run the code, you should change Your_path=/data/project/yinhuapark/ssl/
to your own path.
- create co-occurrence pairs of each documents.
python ssl_make_graphs/create_cooc_document.py --name R8
- construct graphs of each documents in InMemoryDatset.
python ssl_make_graphs/PygDocsGraphDataset.py --name R8
python ssl_graphmodels/pyg_models/train_docs.py --name R8
If you find our paper and repo useful, please cite our paper:
@inproceedings{piao2022sparse,
title={Sparse Structure Learning via Graph Neural Networks for Inductive Document Classification},
author={Piao, Yinhua and Lee, Sangseon and Lee, Dohoon and Kim, Sun},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={36},
number={10},
pages={11165--11173},
year={2022}
}