By Zachary Flamholz, Andrew Crane-Droesch, Lyle Ungar, Gary Weissman
Pre-trained word embeddings using the text of published clinical case reports. See the pre-preprint for a detailed description of the methods used to build and test the word embeddings.
Model | Dimension | Open Access Case Reports | Open Access All Manuscripts |
---|---|---|---|
word2vec | 100 | Download - 269 MB | Download - 2.7 GB |
300 | Download - 716 MB | Download - 7.8 GB | |
600 | Download - 1.4 GB | ||
fastText | 100 | Download - 798 MB | Download - 4.7 GB |
300 | Download - 2.3 GB | Download - 13.8 GB | |
600 | Download - 4.6 GB | ||
GloVe | 100 | Download - 157 MB | Download - 1.3 GB |
300 | Download - 445 MB | Download - 3.8 GB | |
600 | Download - 862 MB | Download - 7.4 GB |
Word embeddings are compatible with the gensim
Python package format.
First download and extract the files from each archive.
tar -xvf w2v_100d_oa_all.tar.gz
Then load the embeddings into Python.
from gensim.models import FastText, Word2Vec, KeyedVectors # KeyedVectors are used to load the GloVe models
# Load the model
model = Word2Vec.load('w2v_oa_all_100d.bin')
# Return 100-dimensional vector representations of each word
model.wv.word_vec('diabetes')
model.wv.word_vec('cardiac_arrest')
model.wv.word_vec('lymphangioleiomyomatosis')
# Try out cosine similarity
model.wv.similarity('copd', 'chronic_obstructive_pulmonary_disease')
model.wv.similarity('myocardial_infarction', 'heart_attack')
model.wv.similarity('lymphangioleiomyomatosis', 'lam')