This repo contains various ways to calculate the similarity between source and target sentences. You can choose the pre-trained models you want to use such as ELMo, BERT and Universal Sentence Encoder (USE).
And you can also choose the method to be used to get the similarity:
1. Cosine similarity
2. Manhattan distance
3. Euclidean distance
4. Angular distance
5. Inner product
6. TS-SS score
7. Pairwise-cosine similarity
8. Pairwise-cosine similarity + IDF
You can experiment with (The number of models) x (The number of methods) combinations!
- This project is developed under conda enviroment
- After cloning this repository, you can simply install all the dependent libraries described in
requirements.txt
withbash install.sh
conda create -n sensim python=3.7
conda activate sensim
git clone https://github.com/Huffon/sentence-similarity.git
cd sentence-similarity
bash install.sh
- To test your own sentences, you should fill out corpus.txt with sentences as below:
I ate an apple.
I went to the Apple.
I ate an orange.
...
- Then, choose the model and method to be used to calculate the similarity between source and target sentences
python sensim.py
--model MODEL_NAME [use, bert, elmo]
--method METHOD_NAME [cosine, manhattan, euclidean, inner,
ts-ss, angular, pairwise, pairwise-idf]
--verbose LOG_OPTION (bool)
- In this section, you can see the example result of
sentence-similarity
- As you know, there is a no silver-bullet which can calculate perfect similarity between sentences
- You should conduct various experiments with your dataset
- Caution:
TS-SS score
might not fit with sentence similarity task, since this method originally devised to calculate the similarity between long documents
- Caution:
- Result:
- Universal Sentence Encoder
- Deep contextualized word representations
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- BERTScore: Evaluating Text Generation with BERT
- A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering
- TF-hub's Universal Sentence Encoder
- Allen NLP's ELMo
- Sentence Transformers
- BERTScore
- Vector Similarity