Nazanin Shafiabadi, nazanin.shafiabadi@etu.u-paris.fr
Liam Duignan, liam.duignan@etu.u-paris.fr
The Word Embedding Retrofitting Program is a comprehensive tool based on the algorithm proposed by Faruqui et al. in 2015. It can be applied as a post-processing step to enrich pre-trained word embeddings by incorporating knowledge from semantic lexicons extracted from WordNet and PPDB (Paraphrase Database) resources. As demonstrated in <title of the report>, the embeddings generated by this implementation generally outperform both the original embeddings and those generated by Faruqui et al (TO BE DETERMINED). This versatile tool can be applied to word vectors in either English or French, obtained from any vector training model. It seamlessly retrofits pre-trained word embeddings to effectively integrate the extracted lexicons, enhancing the performance and semantic representation of the embeddings.
- Support for English and French languages
- Possibility to select the lexicon database between WordNet and PPDB
- Customizable number of iterations for retrofitting
- Output saved to a specified file for further analysis
- Python 3.6 or above
- NLTK (Natural Language Toolkit) library
- WordNet database (included in NLTK)
- PPDB database (available for download separately)
- Operating System: Windows, macOS, or Linux
-
Ensure you have Python 3.6 or above installed on your system. You can download Python from the official Python website (https://www.python.org) and follow the installation instructions for your operating system.
-
Install the NLTK library by executing the following command in your terminal or command-line interface:
pip install nltk
- Download the WordNet resources by running the following Python script:
import nltk
nltk.download('wordnet')
-
Download the PPDB resources by visiting the PPDB website (http://paraphrase.org/#/download) and following the instructions for downloading the appropriate version for your language.
-
Clone or download the Word Embedding Retrofitting Program repository from GitHub to your local machine.
-
Place the PPDB resources in the designated directory within the program repository.
-
You are now ready to use the Word Embedding Retrofitting Program!
- Word embeddings file either in .gz, .txt or generic file format (should have one word per line followed by its vector representation (space delimited))
The Word Embedding Retrofitting Program consists of two main files:
-
shafiabadi-duignan-retrofit.py Run the file using the Python interpreter in a text editor or Python IDE of your choice, providing the required arguments. This will initiate the retrofitting process. The retrofitted word embeddings will be saved to the specified output file for further analysis.
-
lexicon.py This file is to be used internally by the program and not to be run independently (there will be no output).
python shafiabadi-duignan-retrofit.py <embeddings_file_path> <language> <lexicon> <iterations> <output_file_path>
Example:
python shafiabadi-duignan-retrofit.py sample_vec.txt eng wn 10 retrofitted_vec.txt
<embeddings_file_path>
: the path to the pre-trained word embeddings you wish to retrofit
<language>
: either "eng" (for English) or "fra" (for French)
<lexicon>
: Supported values (case insensitive) include:
'wordnet' or 'wn': Retrieves only the synonymy relations from the WordNet database.
'wordnet+' or 'wn+': Retrieves synonymy, hypernymy, and hyponymy relations from the WordNet database.
'ppdb': Retrieves paraphrase relations from the Paraphrase Database.
<iterations>
: an integer which specifies the number of iterations for which the optimization is to be performed. Usually n = 10 gives reasonable results.
<output_file_path>
: file containing the resulting retrofitted embeddings
@InProceedings{faruqui:2015:NAACL,
author = {Faruqui, Manaal and Dodge, Jesse and Jauhar, Sujay K. and Dyer, Chris and Hovy, Eduard and Smith, Noah A.},
title = {Retrofitting Word Vectors to Semantic Lexicons},
booktitle = {Proceedings of NAACL},
year = {2015},
}