Skip to content

Latest commit

 

History

History
31 lines (20 loc) · 1.22 KB

README.md

File metadata and controls

31 lines (20 loc) · 1.22 KB

Automatic sentence pair tagger

PEP8

A simple app that leverages an SQLite database and synonym searches to locate semantic pair candidates.

It supports reading verified sentence pairs from an existing .csv file.

Notes:

Inverse sentence pairs are not stored in the DB; s1-s2 is considered to be equivalent to s2-s1. Same-sentence pairs are also excluded, since they require no evaluation. The initial number of unrated pairs is equal to:

item_count! / (2!(item_count - 2)!)

For example, a corpus consisting of 3875 unique sentences should result in

3875! / (2 x 3873!) = (3875 x 3874) / 2 = 7505875

unrated pairs.

If 250 rows are present in the verified sentence file, the number of unrated pairs after the call to the initialize() function should be 7505625.

TODO:

Implement the generation of two .csv files:

  • a list of automatically rated sentence pairs that can then be manually verified
  • a list of all the sentence pairs in the corpus, which should include both same-sentence pairs (s1-s1) and inverse pairs (s1-s2 and s2-s1)