Code repository for findings of EMNLP 2021 paper "Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph." [ACL] [arxiv]
We tested the code on:
- python 3.6
- pytorch 1.9.0
- pytorch-geometric 1.7.2 (https://github.com/pyg-team/pytorch_geometric)
- fastai 1.0.61 (need to be v1: https://github.com/fastai/fastai1)
- embeddings (https://github.com/vzhong/embeddings)
- word2word (https://github.com/kakaobrain/word2word)
other requirements:
- numpy
- pandas
- scikit-learn
- gensim
- tqdm
- nltk
- pythainlp 2.3.1 (for Thai language tokenizer)
- 
Extract data/text_cls.zipfile for datasets.
- 
Run the code in srcfolder using the command for training and evaluating DHGNeten.- 
For Bosnian setting: 
 python main.py bosnian --rnn_layers 1 --directed 0 --add_from_dict 30000 --name [output_model_name].
- 
For other settings ( bengali,malayalam,tamil,thai_t,thai_w):
 python main.py [setting_name] --name [output_model_name].
- 
For DHGNetmulti, add a command option --langs ar,en,es,fa,fr,zh.
 
- 
Note that the code will automatically download source word-embeddings (default fasttext) which may take time and disk space.
Optionally, you can download dump files that contain all related source word-embeddings for the aforementioned settings in https://1drv.ms/u/s!AkynV6rCKmmXkNBYwRchAWfurRkBrQ?e=NnOszA and put the files in folder data/word_emb/fasttext_wiki.
Then run the code with an additional command option --use_temp_only 1
** To run using only English as source, you can download only en.db_temp.pkl.
If you find the code helpful, please cite our work:
@inproceedings{chairatanakul-etal-2021-cross-lingual,
    title = "Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph",
    author = "Chairatanakul, Nuttapong  and
      Sriwatanasakdi, Noppayut  and
      Charoenphakdee, Nontawat  and
      Liu, Xin  and
      Murata, Tsuyoshi",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.130",
    pages = "1504--1517",
}
