Skip to content

Language model evaluations for 859 languages

Notifications You must be signed in to change notification settings

lgessler/pronto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PrOnto

PrOnto is a resource intended for pretrained language model evaluation that is extensible to any language with a New Testament translation. No manual annotation required!

PrOnto v0.1 comes with 1051 datasets from all of ebible.org's permissively-licensed Bible translations, which span 859 distinct languages. Each dataset is created by aligning verses from the target language's translation with the New Testament verses included in English OntoNotes, and then projecting annotations from English to the target language. Currently, each dataset includes data for 5 different sequence classification tasks.

In addition to these premade datasets, you may also provide your own Bible translation and run our dataset generation code in order to generate a dataset for your own target language. No annotation required! See BUILD for details.

For a full description of this resource, see our paper.

Download

Please see our registry for a list of Bibles included in the PrOnto release. As of version 0.1, this includes 1051 Bible translations in 859 distinct languages.

Usage

You may use the datasets however you like, though we do include a sequence classification module which is ready to use with the data for our 5 tasks. You may invoke it with python -m pronto.eval.sequence_classifier. See here for sensible defaults on each task for its options.

Development

For information on how to build a PrOnto dataset, see BUILD.

About

Language model evaluations for 859 languages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published