Skip to content

Latest commit

 

History

History
103 lines (64 loc) · 4.27 KB

README.md

File metadata and controls

103 lines (64 loc) · 4.27 KB

Word Embedding Retrofitting Program

Contributors

Nazanin Shafiabadi, nazanin.shafiabadi@etu.u-paris.fr

Liam Duignan, liam.duignan@etu.u-paris.fr

Description

The Word Embedding Retrofitting Program is a comprehensive tool based on the algorithm proposed by Faruqui et al. in 2015. It can be applied as a post-processing step to enrich pre-trained word embeddings by incorporating knowledge from semantic lexicons extracted from WordNet and PPDB (Paraphrase Database) resources. As demonstrated in <title of the report>, the embeddings generated by this implementation generally outperform both the original embeddings and those generated by Faruqui et al (TO BE DETERMINED). This versatile tool can be applied to word vectors in either English or French, obtained from any vector training model. It seamlessly retrofits pre-trained word embeddings to effectively integrate the extracted lexicons, enhancing the performance and semantic representation of the embeddings.

Features

  • Support for English and French languages
  • Possibility to select the lexicon database between WordNet and PPDB
  • Customizable number of iterations for retrofitting
  • Output saved to a specified file for further analysis

Requirements

  • Python 3.6 or above
  • NLTK (Natural Language Toolkit) library
  • WordNet database (included in NLTK)
  • PPDB database (available for download separately)
  • Operating System: Windows, macOS, or Linux

Installation

  1. Ensure you have Python 3.6 or above installed on your system. You can download Python from the official Python website (https://www.python.org) and follow the installation instructions for your operating system.

  2. Install the NLTK library by executing the following command in your terminal or command-line interface:

pip install nltk
  1. Download the WordNet resources by running the following Python script:
import nltk
nltk.download('wordnet')
  1. Download the PPDB resources by visiting the PPDB website (http://paraphrase.org/#/download) and following the instructions for downloading the appropriate version for your language.

  2. Clone or download the Word Embedding Retrofitting Program repository from GitHub to your local machine.

  3. Place the PPDB resources in the designated directory within the program repository.

  4. You are now ready to use the Word Embedding Retrofitting Program!

Data you need

  • Word embeddings file either in .gz, .txt or generic file format (should have one word per line followed by its vector representation (space delimited))

Usage

The Word Embedding Retrofitting Program consists of two main files:

  1. shafiabadi-duignan-retrofit.py Run the file using the Python interpreter in a text editor or Python IDE of your choice, providing the required arguments. This will initiate the retrofitting process. The retrofitted word embeddings will be saved to the specified output file for further analysis.

  2. lexicon.py This file is to be used internally by the program and not to be run independently (there will be no output).

Running the program

python shafiabadi-duignan-retrofit.py <embeddings_file_path> <language> <lexicon> <iterations> <output_file_path>

Example: python shafiabadi-duignan-retrofit.py sample_vec.txt eng wn 10 retrofitted_vec.txt

Arguments description

<embeddings_file_path>: the path to the pre-trained word embeddings you wish to retrofit

<language>: either "eng" (for English) or "fra" (for French)

<lexicon>: Supported values (case insensitive) include:

'wordnet' or 'wn': Retrieves only the synonymy relations from the WordNet database.

'wordnet+' or 'wn+': Retrieves synonymy, hypernymy, and hyponymy relations from the WordNet database.

'ppdb': Retrieves paraphrase relations from the Paraphrase Database.

<iterations>: an integer which specifies the number of iterations for which the optimization is to be performed. Usually n = 10 gives reasonable results.

<output_file_path>: file containing the resulting retrofitted embeddings

Reference

@InProceedings{faruqui:2015:NAACL,
  author    = {Faruqui, Manaal and Dodge, Jesse and Jauhar, Sujay K.  and  Dyer, Chris and Hovy, Eduard and Smith, Noah A.},
  title     = {Retrofitting Word Vectors to Semantic Lexicons},
  booktitle = {Proceedings of NAACL},
  year      = {2015},
}