This repository houses the codebase for replicating the experiments detailed in Jorge Martinez-Gil's paper on Context-Aware Semantic Similarity Measurement for Unsupervised Word Sense Disambiguation. Discover more insights and applications through our arXiv preprint and an accessible Medium article.
Word sense disambiguation (WSD) plays a pivotal role in Natural Language Processing (NLP). It involves deciphering the intended meaning of a word in a multi-sense context, which is crucial for enhancing the performance of applications like machine translation and information retrieval.
Our repository offers an innovative unsupervised approach to WSD using context-aware semantic similarity:
- Preprocessing: Clean and prepare your text data.
- Context Extraction: Identify the context surrounding the ambiguous word.
- Semantic Similarity: Utilize pre-trained sentence embeddings and cosine similarity to evaluate semantic parallels.
- Sense Selection: Choose the sense with the highest similarity score.
Included are the necessary code, pre-trained embeddings, and test data for thorough evaluation.
pip install -r requirements.txt
The CoarseWSD-20 dataset, a well-known resource for coarse-grained WSD, forms the backbone of our experiments. It encompasses 20 commonly ambiguous words.
Follow these steps to apply our method:
- Clone this repository.
- Install dependencies (refer to the installation section).
- Download and position pre-trained word embeddings in the data directory.
- Execute the script of your choice and observe the results in your console.
Evaluate our approach using the provided test data.
Unsupervised Word Sense Disambiguation (UWSD):
python uwsd_bert.py
- BERTpython uwsd_elmo.py
- ELMopython uwsd_use.py
- Universal Sentence Encoder (USE)python uwsd_wmd.py
- Word Mover's Distance (WMD)
Context-Aware Semantic Similarity (CASS):
python cass-wordnet+bert.py
- CASS using WordNet and BERTpython cass-word2vec+bert.py
- CASS using word2vec and BERTpython cass-webscrapping+bert.py
- CASS using webscraping and BERT
Example Scenario UWSD:
- Typed object-oriented programming languages, such as java and c++ , often do not support first-class methods
--> options (island, programming language)
- uwsd_bert: programming language
- uwsd_elmo: programming language
- uwsd_use: programming language
- uwsd_wmd: programming language
- ChatGPT-4: programming language
Example Scenario CASS:
- Vienna is a nice city situated in the center of the European continent.
- cass-wordnet+bert: middle
- cass-word2vec+bert: hub
- cass-webscrapping+bert: mid
- ChatGPT-4: middle
The summary of the results in terms of the CoarseWSD-20 dataset disambiguation is:
Strategy | Hits | Accuracy |
---|---|---|
UWSD+BERT | 7,927 | 77.74% |
MFS-Baseline | 7,487 | 73.43% |
UWSD+USE | 7,335 | 71.94% |
UWSD+ELMo | 7,010 | 68.75% |
UWSD+WMD | 5,868 | 57.55% |
RO-Baseline | 4,459 | 43.73% |
If you utilize our work, kindly cite us:
@inproceedings{martinez2023b,
author = {Jorge Martinez-Gil},
title = {Context-Aware Semantic Similarity Measurement for Unsupervised Word Sense Disambiguation},
journal = {CoRR},
volume = {abs/2305.03520},
year = {2023},
url = {https://arxiv.org/abs/2305.03520},
doi = {https://doi.org/10.48550/arXiv.2305.03520},
eprinttype = {arXiv},
eprint = {2305.03520}
}
Released under the MIT License. View License.