Context-Aware Semantic Similarity Measurement for Unsupervised Word Sense Disambiguation

This repository houses the codebase for replicating the experiments detailed in Jorge Martinez-Gil's paper on Context-Aware Semantic Similarity Measurement for Unsupervised Word Sense Disambiguation. Discover more insights and applications through our arXiv preprint and an accessible Medium article.

🌍 Overview

Word sense disambiguation (WSD) plays an important role in Natural Language Processing (NLP). It involves deciphering the intended meaning of a word in a multi-sense context, which is crucial for improving the performance of applications like machine translation and information retrieval.

Our repository offers an innovative unsupervised approach to WSD using context-aware semantic similarity:

Preprocessing: Clean and prepare the text data.
Context Extraction: Identify the context surrounding the ambiguous word.
Semantic Similarity: Utilize pre-trained sentence embeddings and cosine similarity to evaluate semantic parallels.
Sense Selection: Choose the sense with the highest similarity score.

Included are the necessary code, pre-trained embeddings, and test data for thorough evaluation.

🛠️ Installation

pip install -r requirements.txt

📊 Dataset

The CoarseWSD-20 dataset, a well-known resource for coarse-grained WSD, forms the backbone of our experiments. It includes 20 commonly ambiguous words.

🚀 Usage Guide

Follow these steps to apply our method:

Clone this repository.
Install dependencies (refer to the installation section).
Download and position pre-trained word embeddings in the data directory.
Execute the script of your choice and observe the results in your console.
For best performance, use scripts prepared to work with GPUs.

📝 Evaluation

Evaluate our approach using the provided test data.

Unsupervised Word Sense Disambiguation (UWSD):

python uwsd_bert.py - BERT
python uwsd_elmo.py - ELMo
python uwsd_use.py - Universal Sentence Encoder (USE)
python uwsd_wmd.py - Word Mover's Distance (WMD)

Context-Aware Semantic Similarity (CASS):

python cass-wordnet+bert.py - CASS using WordNet and BERT
python cass-word2vec+bert.py - CASS using word2vec and BERT
python cass-webscrapping+bert.py - CASS using webscraping and BERT

Example Scenario UWSD:

Typed object-oriented programming languages, such as java and c++ , often do not support first-class methods --> options (island, programming language)
- uwsd_bert: programming language
- uwsd_elmo: programming language
- uwsd_use: programming language
- uwsd_wmd: programming language
- ChatGPT-4: programming language

Example Scenario CASS:

Vienna is a nice city situated in the center of the European continent.
- cass-wordnet+bert: middle
- cass-word2vec+bert: hub
- cass-webscrapping+bert: mid
- ChatGPT-4: middle

📈 Performance Results

The summary of the results in terms of the CoarseWSD-20 dataset disambiguation is:

Strategy	Hits	Accuracy
UWSD+BERT	7,927	77.74%
MFS-Baseline	7,487	73.43%
UWSD+USE	7,335	71.94%
UWSD+ELMo	7,010	68.75%
UWSD+WMD	5,868	57.55%
RO-Baseline	4,459	43.73%

📚 Citation

If you utilize our work, kindly cite us:

@inproceedings{martinez2023b,
  author    = {Jorge Martinez-Gil},
  title     = {Context-Aware Semantic Similarity Measurement for Unsupervised Word Sense Disambiguation},
  journal   = {CoRR},
  volume    = {abs/2305.03520},
  year      = {2023},
  url       = {https://arxiv.org/abs/2305.03520},
  doi       = {https://doi.org/10.48550/arXiv.2305.03520},
  eprinttype = {arXiv},
  eprint    = {2305.03520}
}

📖 Research that has cited this work

Pantip Multi-turn Datasets Generating from Thai Large Social Platform Forum Using Sentence Similarity Techniques
- Authors: A. Sae-Oueng, K. Kerdthaisong, …
- Conference: Joint Symposium on …, 2024 (IEEE)
- Abstract: Fine-tuning Large Language Models (LLMs) for specific domains is crucial. However, the lack of Thai open dialogues presents a major challenge. For this challenge, the study proposes generating multi-turn dialogue datasets from Pantip, a large Thai social platform forum, using sentence similarity techniques.

📄 License

Released under the MIT License. View License.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
CoarseWSD-20		CoarseWSD-20
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
cass-webscrapping+bert.py		cass-webscrapping+bert.py
cass-word2vec+bert.py		cass-word2vec+bert.py
cass-wordnet+bert.py		cass-wordnet+bert.py
requirements.old		requirements.old
requirements.txt		requirements.txt
uwsd-bert-cuda.py		uwsd-bert-cuda.py
uwsd-bert.py		uwsd-bert.py
uwsd-elmo-cuda.py		uwsd-elmo-cuda.py
uwsd-elmo.py		uwsd-elmo.py
uwsd-use.py		uwsd-use.py
uwsd-wmd.py		uwsd-wmd.py
uwsd.png		uwsd.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Context-Aware Semantic Similarity Measurement for Unsupervised Word Sense Disambiguation

🌍 Overview

🛠️ Installation

📊 Dataset

🚀 Usage Guide

📝 Evaluation

📈 Performance Results

📚 Citation

📖 Research that has cited this work

📄 License

About

Languages

License

jorge-martinez-gil/uwsd

Folders and files

Latest commit

History

Repository files navigation

Context-Aware Semantic Similarity Measurement for Unsupervised Word Sense Disambiguation

🌍 Overview

🛠️ Installation

📊 Dataset

🚀 Usage Guide

📝 Evaluation

📈 Performance Results

📚 Citation

📖 Research that has cited this work

📄 License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages