English-Tamil parallel Corpus prepared by the National Languages Processing Center, University of Moratuwa. The data has been cleaned and then aligned.

#En-Ta Glossary Line Count : 22477 #En-Ta Corpus Line Count : 8950

#Source: Data extracted from publicly available government resources such as annual reports, procurement reports, circulars and websites.

#Processing: Each word/pdf file was converted to text files, and unicode errors were fixed using a custom tool. Then the Tamil and English files were manually sentence-aligned. All the spelling and grammatical errors were manually fixed.

#If you use this dataset, kindly cite the following publication: Fernando, A., Ranathunga, S., & Dias, G. (2020). Data Augmentation and Terminology Integration for Domain-Specific Sinhala-English-Tamil Statistical Machine Translation. arXiv preprint arXiv:2011.02821.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

English-Tamil parallel Corpus prepared by the National Languages Processing Center, University of Moratuwa. The data has been cleaned and then aligned.

Files

README.md

Latest commit

History

README.md

File metadata and controls

English-Tamil parallel Corpus prepared by the National Languages Processing Center, University of Moratuwa. The data has been cleaned and then aligned.