TMUP is an evaluation corpus for Japanese paraphrase identification. It consists of 655 sentence pairs in total.
- 363 paraphrase sentence pairs
- 292 non-paraphrase sentence pairs
To acquire both paraphrase and non-paraphrase instances, we
- generated sentence pairs using Google PBMT and NMT to acquire paraphrases
- extracted sentence pairs from Japanese Wikipedia to acquire non-paraphrases
To acquire both trivial and non-trivial instances, we
- calculated word overlap rate (Jaccard score) of each sentence pair and uniformly sampled candidates
Two annotators judged whether the candidates are paraphrases.
*For more details, please refer to the paper.
label <TAB> sentence_A_ja <TAB> sentence_B_ja <TAB> source_sentence_en (if applicable)
- 1: Paraphrase
- 0: Non-paraphrase
If you make use of this corpus, please cite the following publication:
Yui Suzuki, Tomoyuki Kajiwara and Mamoru Komachi. Building a Non-Trivial Paraphrase Corpus using Multiple Machine Translation Systems. In Proceedings of ACL 2017 Student Research Workshop, Vancouver, Canada. July 2017 (to appear).
@inproceedings{,
author = {Suzuki, Yui and Kajiwara, Tomoyuki and Komachi, Mamoru},
title = {Building a Non-Trivial Paraphrase Corpus
using Multiple Machine Translation Systems},
booktitle = {Proceedings of ACL 2017 Student Research Workshop},
month = {July},
year = {2017},
address = {Vancouver, Canada},
publisher = {Association for Computational Linguistics},
pages = {(to appear)},
url = {http://www.aclweb.org/anthology/}
}
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Copyright (c) 2017 TMU-NLP
For inquiry and feedback please contact the authors below:
- Yui Suzuki <suzuki-yui at ed.tmu.ac.jp>
- Tomoyuki Kajiwara <kajiwara-tomoyuki at ed.tmu.ac.jp>
- Mamoru Komachi <komachi at tmu.ac.jp>