This is the dataset repository of IndoToD, presented at SEALP 2023, colocated with AACL 2023, where our paper was awarded with the Best Paper 🏆 [ACL Anthology].
This code has been written using PyTorch. If you use source codes or datasets included in this repository in your work, please cite the following paper:
@inproceedings{kautsar2023indotod,
title={IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems},
author={Kautsar, Muhammad and Nurdini, Rahmah and Cahyawijaya, Samuel and Winata, Genta and Purwarianti, Ayu},
booktitle={Proceedings of the First Workshop in South East Asian Language Processing},
pages={85--99},
year={2023}
}
We introduce IndoToD, a high-quality bilingual multi-domain task-oriented dialogue system data for Indonesian and English. It comprises two datasets:
Overall, it has four different domains by delexicalization to efficiently reduce the size of annotations. To ensure a high-quality data collection, we hire native speakers to manually annotate the dialogues. We annotated the data from existing English ToD datasets: CamRest and SMD. Along with the original English datasets, these new Indonesian datasets serve as an effective benchmark for evaluating Indonesian and English ToD systems as well as exploring the potential benefits of cross-lingual and bilingual transfer learning approaches.
IndoCamRest is a task-oriented dialogue system dataset that translated from Cambridge Restaurant 676 (CamRest) dataset.
IndoSMD is a task-oriented dialogue system dataset that translated from In-Car Assistant (SMD) dataset.
We set up a benchmark for both Indonesian and English ToD to evaluate the performance of the current ToD systems in monolingual, cross-lingual, and bilingual tasks.
The datasets are under CC-BY-SA 4.0 and the code is license under Apache 2.0.