Skip to content

Agisight/TyvaData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project will contain important translation data for Russian-Tuvan and reverse translations.

About the data:

This data was collected via www.tyvan.ru platform by linguists, scientists, journalists, volunteers, etc.

Folder Data:

The 50K file has a breakdown: training/validation/test data.

The validation and test sentences from the file are reflected at the end

Folder For Yandex:

The datasets with 306615 translations.

Dataset Structure

The dataset contains Tyvan-Russian paires.

Data row has the following fields:

  • tyv: str: text in Tuvan
  • ru: str: text in Russian (translate)

Dataset Details

Dataset Description

  • Curated by: Ali Kuzhuget (tech and data), Ondar Choygan (data) contributors
  • Language(s) (NLP): Tyvan (Tuvan), Russian
  • License:: CC BY 4.0.

Below is the brief information about the languages

Language Language code on the website ISO 639-3 Glottolog
Tyvan tyv tyv tuvi1240
Russian rus rus russ1263

Dataset Sources

The dataset has been downloaded from www.tyvan.ru.

Uses

The dataset is intended to help humans and machines learn the low-resourced Tyvan (Tuvan) and Russian languages.

Dataset Creation

The dataset was curates as a source of machine translation training and other NLP tools. It consists donated and professional translations from books and websites. They have been downloaded from the www.tyvan.ru website and fined by Ali Kuzhuget. No additional filtering or postprocessing has been applied.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published