This repository contains code for semantic change analysis of words that combines word embeddings with word frequencies. For results on a synthetic corpus, the Google Books Ngram Corpus and on Twitter data see
Adrian Englhardt, Jens Willkomm, Martin Schäler and Klemens Böhm, "Improving Semantic Change Analysis by Combining Word Embeddings and Word Frequencies", International Journal on Digital Libraries (IJDL), 19 Mar 2019.
For a download of the generated word embeddings and results, see the companion website
The code is licensed under a MIT License and the data set under a Creative Commons Attribution 4.0 International License. If you use this code or data set in your scientific work, please reference the companion paper.
To install the package run the following steps:
pip install -r requirements.txt
python setup.py install
Run python setup.py test
to run the tests or tox
to run test for all supported python versions (2.7, 3.5 and 3.6).
This repository contains the following elements to perform semantic change analysis:
- Training word embeddings: Given a configuration file, scaf/jobs/training.py trains a word embedding model from a corpus in the Google Books Ngram format.
- Evaluating word embeddings: Test a word embedding model with word sense and analogy tests (scaf/jobs/embedding_evaluation.py).
- Build time series: Combine word embedding similarities with word frequencies to a two-dimensional time series (scaf/jobs/build_timeseries.py).
- Change detection: Given a configruation file, scaf/jobs/change_detection.py runs the change detection for the synthetic corpora.
For a full example from training word embeddings up to the change detection see the example notebook in example/example.ipynb.
For questions and comments, please contact Adrian Englhardt.