Skip to content

Latest commit

 

History

History
10 lines (7 loc) · 869 Bytes

README.md

File metadata and controls

10 lines (7 loc) · 869 Bytes

English-Tamil parallel Corpus prepared by the National Languages Processing Center, University of Moratuwa. The data has been cleaned and then aligned.

#En-Ta Glossary Line Count : 22477 #En-Ta Corpus Line Count : 8950

#Source: Data extracted from publicly available government resources such as annual reports, procurement reports, circulars and websites.

#Processing: Each word/pdf file was converted to text files, and unicode errors were fixed using a custom tool. Then the Tamil and English files were manually sentence-aligned. All the spelling and grammatical errors were manually fixed.

#If you use this dataset, kindly cite the following publication: Fernando, A., Ranathunga, S., & Dias, G. (2020). Data Augmentation and Terminology Integration for Domain-Specific Sinhala-English-Tamil Statistical Machine Translation. arXiv preprint arXiv:2011.02821.