GitHub - gwohlgen/word-embeddings-benchmarks: Package for evaluating word embeddings

Word Embeddings Benchmarks - for Thai datasets

This is a fork, with the goal to provide an easy way to evaluate Thai word embeddings with new word similarity datasets. The accompanying publication which describes the new Thai datasets is currently under review. The Thai datasets are translations of popular existing datasets: WordSim-353, SimLex-999 and the dataset from SemEval 2017 (task 2). The task is word similarity, which is often used for intrinsic evaluation of word embedding models. In the fork we added Spearman rho as additional evaluation measure, and added the option to tokenize out-of-vocabulary words with the deepcut library.

First, please follow the installation guide from the original repo which is duplicated below as well. Then execute following commands to evaluate your own Thai word embedding file:

cd examples
bash call_thai.sh <path_to_your_embedding_file>

The datasets were created by KMITL University, Ladkrabang, Thailand (Dr. Ponrudee Netisopakul) together with ITMO University, St. Petersburg, Russia (Dr. Gerhard Wohlgenannt, Aleksei Pulich). Please cite our work:

P. Netisopakul, G. Wohlgenannt and A. Pulich, Word Similarity Datasets for Thai: Construction and Evaluation, IEEE Access, 2019.

@article{DBLP:journals/access/NetisopakulWP19,
  author    = {Ponrudee Netisopakul and Gerhard Wohlgenannt and Aleksei Pulich},
  title     = {Word Similarity Datasets for Thai: Construction and Evaluation},
  journal   = {{IEEE} Access},
  volume    = {7},
  pages     = {142907--142915},
  year      = {2019},
  url       = {https://doi.org/10.1109/ACCESS.2019.2944151},
  doi       = {10.1109/ACCESS.2019.2944151},
}

The preprint is found here: `https://arxiv.org/abs/1904.04307`.

In the work mentioned above, we evaluate the following embeddings:

Thai2vec (Pretrained language model based on Thai Wikipedia): https://github.com/cstorm125/thai2fit
ft-wiki (Skip-Gram model trained on Wikipedia using fastText): https://github.com/kobkrit/nlp_thai_resources#pre-trained-word-vectors
Kyu-ft and Kyu-w2v (Pre-trained word vectors of 30+ languages): https://github.com/Kyubyong/wordvectors
fastText (Word vectors for 157 languages): https://fasttext.cc/docs/en/crawl-vectors.html
BPEmb Thai models (Pre-trained subword embeddings in 275 languages): https://github.com/bheinzerling/bpemb

Below please find the description of the original repository by kudkudak, which includes general info, info on installation, etc.

Word Embeddings Benchmarks

Word Embedding Benchmark (web) package is focused on providing methods for easy evaluating and reporting results on common benchmarks (analogy, similarity and categorization).

Research goal of the package is to help drive research in word embeddings by easily accessible reproducible results (as there is a lot of contradictory results in the literature right now). This should also help to answer question if we should devise new methods for evaluating word embeddings.

To evaluate your embedding (converted to word2vec or python dict pickle) on all fast-running benchmarks execute ./scripts/eval_on_all.py <path-to-file>. See here results for embeddings available in the package.

Warnings and Disclaimers:

Analogy test does not normalize internally word embeddings.
Package is currently under development, and we expect within next few months an official release. The main issue that might hit you at the moment is rather long embeddings loading times (especially if you use fetchers).

Please also refer to our recent publication on evaluation methods https://arxiv.org/abs/1702.02170.

Features:

scikit-learn API and conventions
18 popular datasets
11 word embeddings (word2vec, HPCA, morphoRNNLM, GloVe, LexVec, ConceptNet, HDC/PDC and others)
methods to solve analogy, similarity and categorization tasks

Included datasets:

TR9856
WordRep
Google Analogy
MSR Analogy
SemEval2012
AP
BLESS
Battig
ESSLI (2b, 2a, 1c)
WS353
MTurk
RG65
RW
SimLex999
MEN

Note: embeddings are not hosted currently on a proper server, if the download is too slow consider downloading embeddings manually from original sources referred in docstrings.

Dependencies

Please see the requirements.txt and pip_requirements.txt file.

Install

This package uses setuptools. You can install it running:

python setup.py install

If you have problems during this installation. First you may need to install the dependencies:

pip install -r requirements.txt

If you already have the dependencies listed in requirements.txt installed, to install in your home directory, use:

python setup.py install --user

To install for all users on Unix/Linux:

python setup.py build
sudo python setup.py install

You can also install it in development mode with:

python setup.py develop

Examples

See examples folder.

License

Code is licensed under MIT, however available embeddings distributed within package might be under different license. If you are unsure please reach to authors (references are included in docstrings)

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
examples		examples
scripts		scripts
web		web
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGES.rst		CHANGES.rst
LICENSE		LICENSE
README.rst		README.rst
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Word Embeddings Benchmarks - for Thai datasets

Word Embeddings Benchmarks

Dependencies

Install

Examples

License

About

Uh oh!

Releases

Packages

Languages

License

gwohlgen/word-embeddings-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Word Embeddings Benchmarks - for Thai datasets

Word Embeddings Benchmarks

Dependencies

Install

Examples

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages