Skip to content

A python tool for evaluating the quality of sentence embeddings.

License

Notifications You must be signed in to change notification settings

pltrdy/SentEval

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SentEval

SentEval is a library for evaluating the quality of sentence embeddings. We assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks (more details here). Our goal is to ease the study and the development of general-purpose fixed-size sentence representations.

Dependencies

This code is written in python. The dependencies are:

Tasks

SentEval allows you to evaluate your sentence embeddings as features for the following tasks:

  • Binary classification: MR (movie review), CR (product review), SUBJ (subjectivity status), MPQA (opinion-polarity), SST (Stanford sentiment analysis)
  • Multi-class classification: TREC (question-type classification), SST (fine-grained Stanford sentiment analysis)
  • Entailment (NLI): SNLI (caption-based NLI), MultiNLI (Multi-genre NLI), SICK (Sentences Involving Compositional Knowledge, entailment)
  • Semantic Textual Similarity: STS12, STS13 (-SMT), STS14, STS15, STS16
  • Semantic Relatedness: STSBenchmark, SICK
  • Paraphrase detection: MRPC (Microsoft Research Paraphrase Corpus)
  • Caption-Image retrieval: COCO dataset (with ResNet-101 2048d image embeddings)

more details on the tasks

Download datasets

To get all the transfer tasks datasets, run (in data/):

./get_transfer_data_ptb.bash

This will automatically download and preprocess the datasets, and put them in data/senteval_data (warning: for MacOS users, you may have to use p7zip instead of unzip).

WARNING: Extracting the MRPC MSI file requires the "cabextract" command line (i.e apt-get/yum install cabextract).

Example (average word2vec) : examples/bow.py

examples/bow.py

In examples/bow.py, we evaluate the quality of the average(GloVe) embeddings.

To get GloVe embeddings [2GB], run (in examples/):

./get_glove.bash

To reproduce the results for avg(GloVe) vectors, run (in examples/):

python bow.py

As required by SentEval, this script implements two functions: prepare (optional) and batcher (required) that turn text sentences into sentence embeddings. Then SentEval takes care of the evaluation on the transfer tasks using the embeddings as features.

examples/infersent.py

To get the InferSent model and reproduce our results, download our best models and run infersent.py (in examples/):

curl -Lo examples/infersent.allnli.pickle https://s3.amazonaws.com/senteval/infersent/infersent.allnli.pickle
curl -Lo examples/infersent.snli.pickle https://s3.amazonaws.com/senteval/infersent/infersent.snli.pickle

How SentEval works

To evaluate your own sentence embedding method, you will need to implement two functions:

  1. prepare (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)
  2. batcher (transforms a batch of text sentences into sentence embeddings)

1.) prepare(params, samples) (optional)

batcher only sees one batch at a time while the samples argument of prepare contains all the sentences of a task.

prepare(params, samples)
  • batch: numpy array of text sentences
  • params: senteval parameters (note that "prepare" outputs are stored in params).
  • output: None. Any "output" computed in this function is stored in "params" and can be further used by batcher.

Example: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors.

2.) batcher(params, batch)

batcher(params, batch)
  • batch: numpy array of text sentences (of size params.batch_size)
  • params: senteval parameters (note that "prepare" outputs are stored in params).
  • output: numpy array of sentence embeddings (of size params.batch_size)

Example: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences.

3.) evaluation on transfer tasks

After having implemented the batch and prepare function for your own sentence encoder,

  1. to perform the actual evaluation, first import senteval and define a SentEval object:
import senteval
se = senteval.SentEval(params, batcher, prepare)

(to import senteval, you can either add senteval path to your pythonpath, use sys.path.insert or "pip install git+https://github.com/facebookresearch/SentEval")

  1. define the set of transfer tasks on which you want SentEval to perform evaluation and run the evaluation:
transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark']
results = se.eval(transfer_tasks)

The current list of available tasks is:

['CR', 'MR', 'MPQA', 'SUBJ', 'SST', 'TREC', 'MRPC', 'SNLI',
'SICKEntailment', 'SICKRelatedness', 'STSBenchmark', 'ImageCaptionRetrieval',
'STS12', 'STS13', 'STS14', 'STS15', 'STS16']

Note that the tasks of image-caption retrieval, SICKRelatedness, STSBenchmark and SNLI require pytorch and the use of a GPU. For the other tasks, setting usepytorch to False will make them run on the CPU (with sklearn), which can be faster for small embeddings but slower for large embeddings.

SentEval parameters

SentEval has several parameters (only task_path is required):

  • task_path (str): path to data, generated by data/get_transfer_data.py
  • seed (int): random seed for reproducability (default: 1111)
  • usepytorch (bool): use pytorch or scikit learn (when possible) for logistic regression (default: True). Note that sklearn is quite fast for small dimensions. Use pytorch for SNLI.
  • classifier (str): if usepytorch, choose between 'LogReg' and 'MLP' (tanh) (default: 'LogReg')
  • nhid (int): if usepytorch and classifier=='MLP' choose nb hidden units (default: 0)
  • batch_size (int): size of minibatch of text sentences provided to "batcher" (sentences are sorted by length). Note that this is not the batch_size used by pytorch logistic regression, which is fixed.
  • kfold (int): k in the kfold-validation. Set to 10 to be comparable to published results (default: 5)
  • ... and any parameter you want to have access to in "batcher" or "prepare" functions.

References

Please cite 1 2 if using this code for evaluating sentence embedding methods.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

[1] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

@article{conneau2017supervised,
  title={Supervised Learning of Universal Sentence Representations from Natural Language Inference Data},
  author={Conneau, Alexis and Kiela, Douwe and Schwenk, Holger and Barrault, Loic and Bordes, Antoine},
  journal={arXiv preprint arXiv:1705.02364},
  year={2017}
}

Learning Visually Grounded Sentence Representations

[2] D. Kiela, A. Conneau, A. Jabri, M. Nickel, Learning Visually Grounded Sentence Representations

@article{kiela2017learning,
  title={Learning Visually Grounded Sentence Representations},
  author={Kiela, Douwe and Conneau, Alexis and Jabri, Allan and Nickel, Maximilian},
  journal={arXiv preprint arXiv:1707.06320},
  year={2017}
}

Contact: aconneau@fb.com, dkiela@fb.com

About

A python tool for evaluating the quality of sentence embeddings.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 83.1%
  • Shell 16.9%