This is the official implementation for the NeurIPS 2023 paper Context-guided Embedding Adaptation for Effective Topic Modeling in Low-Resource Regimes. We have developed an approach that is proficient in discovering meaningful topics from only a few documents, and the core idea is to adaptively generate word embeddings semantically tailored to the given task by fully exploiting the contextual syntactic information.
The following lists the statistics of the datasets we used.
Dataset | Source link | N (#docs) | V (#words) | L (#labels) |
---|---|---|---|---|
20Newsgroups | 20NG | 11288 | 5968 | 20 |
Yahoo! Answers | Yahoo | 27069 | 7507 | 10 |
DBpedia | DB14 | 30183 | 6274 | 14 |
Web of Science | WOS | 11921 | 4923 | 7 |
We curated the vocabulary for each dataset by removing those words with very low and very high frequencies, as well as a list of commonly used stop words. After that, we filtered out documents that contained less than 50 vocabulary terms to yield the final available part of each original dataset. The pre-processed version of all four datasets can be downloaded from
Since we adopted an episodic training strategy to learn our model, we need to sample a batch of tasks from the original corpus to construct the training, validation, and test sets separately. To do this, unzip
the downloaded pre-processed datasets, put the data
folder under the root directory, and then execute the following command.
cd utils
python process_to_task.py
Note that for different datasets, please modify the arguments dataset_name and data_path accordingly.
To train a Meta-CETM with the best predictive performance from scratch, run the following command
python run_meta_cetm.py --dataset 20ng --data_path ./data/20ng/20ng_8novel.pkl --embed_path ./data/glove.6B/glove.6B.100d.txt --docs_per_task 10 --num_topics 20 --mode train
To train a ETM using the model-agnostic meta-learning (MAML) strategy, run the following command
python run_etm.py --dataset 20ng --data_path ./data/20ng/20ng_8novel.pkl --embed_path ./data/glove.6B/glove.6B.100d.txt --docs_per_task 10 --num_topics 20 --mode train --maml_train True
In the same vein, to train a ProdLDA from scratch using MAML, you can run the command
python run_avitm.py --dataset 20ng --data_path ./data/20ng/20ng_8novel.pkl --docs_per_task 10 --num_topics 20 --mode train --maml_train True
@inproceedings{NEURIPS2023_fce17645,
author = {Xu, Yishi and Sun, Jianqiao and Su, Yudi and Liu, Xinyang and Duan, Zhibin and Chen, Bo and Zhou, Mingyuan},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {79959--79979},
publisher = {Curran Associates, Inc.},
title = {Context-guided Embedding Adaptation for Effective Topic Modeling in Low-Resource Regimes},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/fce176458ff542940fa3ed16e6f9c852-Paper-Conference.pdf},
volume = {36},
year = {2023}
}