Skip to content

A collection of Semantic Textual Similarity (STS) models and a framework for their evaluation on STS corpora with fine-grained similarity scores

License

Notifications You must be signed in to change notification settings

vukbatanovic/STSFineGrain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STSFineGrain - a collection of STS models and a framework for their evaluation on STS corpora with fine-grained similarity scores

This package contains Java implementations of three baseline unsupervised STS models and four bag-of-words supervised STS models. STSFineGrain includes the following models:

Unsupervised models

  1. Word overlap
  2. Mean of word2vec word vectors
  3. Mixture of models 1 and 2

Supervised models

  1. Islam and Inkpen
  2. LInSTSS
  3. POST STSS
  4. POS-TF STSS

Please see the References section for papers describing all of the aforementioned models. Note that POST STSS and POS-TF STSS rely on a language-specific POS weighting scheme. The STSFineGrain package currently supports applying these models to texts in Serbian and English. Other implemented models do not have such language-related restrictions.

All models expect the input text to be formatted in UTF-8. The term frequency calculation output is also encoded in UTF-8, while model evaluation outputs are ANSI-encoded.

Fine-grained gold standard similarity scores are required in the evaluation of all models and the training of supervised ones.

Command-line interface

The supplied STSFineGrain.jar file makes it possible to use the STSFineGrain framework from the command line. The framework is invoked using the following general command form:

java -jar STSFineGrain.jar ActionID ActionSpecificArguments

ActionID can be:

  • 0 - specifies that the action to be taken is the calculation of term frequency values.
  • 1 - specifies that the action to be taken is the evaluation of an STS model.

Term frequency calculation

If ActionID is 0, the command should the following form:

java -jar STSFineGrain.jar 0 InputCorporaPaths OutputTFPath
  • InputCorporaPaths can contain arbitrarily many arguments - one path for each corpus file whose contents should be included in TF calculation.
  • OutputTFPath can contain only one argument - the path to the file to which the calculated TF values will be written.

STS model evaluation

If ActionID is 1, the command should have the following form:

java -jar STSFineGrain.jar 1 STSModelIndexNo EvaluationModeIndexNo LanguageCode STSCorpusRawTextsPath STSCorpusScoresPath Word2VecVectorsPath TermFrequenciesPath STSCorpusMSDorPOSPath
  • STSModelIndexNo is an integer in the 1-7 range that specifies the STS model to be used. The mapping between index numbers and STS models is as follows:
    • 1 - Word overlap
    • 2 - Mean of word2vec word vectors
    • 3 - Mixture of models 1 and 2
    • 4 - Islam and Inkpen method
    • 5 - LInSTSS
    • 6 - POST STSS
    • 7 - POS-TF STSS
  • EvaluationModeIndexNo can be either 1 or 2:
    • 1 - Evaluation is performed on the entire dataset specified through arguments STSCorpusRawTextPath and STSCorpusScoresPath. This is only suitable for unsupervised models.
    • 2 - Evaluation is performed using cross-validation on the dataset specified through arguments STSCorpusRawTextPath and STSCorpusScoresPath. This is suitable for both supervised and unsupervised models. The standard number of CV folds is 10, but this setting can be changed in the code of the Evaluator class.
  • LanguageCode - the two-letter ISO code specifying the language of the texts used. Currently, Serbian ("SR") and English ("EN") are supported.
  • STSCorpusRawTextsPath - the path to the raw text file of the STS corpus to be used. Each line of the text file should contain the sentences of a pair, separated with a tab.
  • STSCorpusScoresPath - the path to the score file of the STS corpus to be used. Each line of the file should contain the fine-grained similarity score of the corresponding line pair in the file specified by the STSCorpusRawTextsPath argument. The score file can also contain other information, but the score has to be the first item in a row, separated from other data with a tab.
  • Word2VecVectorsPath - the path to the word2vec vector file that is used by all implemented STS models except the word overlap method. The vector file must be saved in the original C word2vec tool format. For the word overlap method, this argument is ignored.
  • TermFrequenciesPath - the path to the term frequencies file created by this program. Term frequencies are only used for the LInSTSS and the POS-TF STSS models - for other models, this argument is ignored.
  • STSCorpusMSDorPOSPath - the path to the STS corpus MSD/POS tags file. Each line of the text file should contain the tags of a pair of sentences, and sentences in a pair should be separated with a tab. The number of tags in a sentence should be identical to the number of tokens (not counting punctuation, which is eliminated in the text-cleaning phase). POS/MSD tags are only used for the POST STSS and the POS-TF STSS models - for other models, this argument is ignored.

References

If you wish to use this package in your paper or project, please include a reference to the following paper in which it was presented:

Fine-grained Semantic Textual Similarity for Serbian, Vuk Batanović, Miloš Cvetanović, Boško Nikolić, in Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 1370-1378, Miyazaki, Japan (2018).

Be sure to also cite the original paper of each STS model you use:

Additional Documentation

Some non-trivial parts of the source code contain comments and some documentation in English. If you have any questions about the models' functioning, please review the source code, and the papers listed above. If no answer can be found, feel free to contact me at: vuk.batanovic / at / ic.etf.bg.ac.rs

License

See the license file for licensing information.