This repository contains the code to build subword embeddings from the arXiv dataset of 1.7M+ scholarly papers.
[Download the arXiv dataset], decompress archive.zip
and place the file arxiv-metadata-oai-snapshot.json
into the data/
directory.
Install required Python modules:
pip3 install -r requirements.txt
Follow the instructions to build and install SentencePiece command line tools from C++ source.
Follow the instructions to build and install GloVe.
We follow the idea of pre-trained subword embbeddings from (Heinzerling and Strube, 2018).
# Extract the textual content from the arXiv dataset
# this creates a one-sentence-per-line raw corpus file
# 12,807,583 lines
python3 src/extract.py data/arxiv-metadata-oai-snapshot.json \
data/arxiv-metadata-oai-snapshot.txt
# Train a sentencePiece model from the corpus file
spm_train --input=data/arxiv-metadata-oai-snapshot.txt \
--model_prefix=data/arxiv-metadata-oai-snapshot \
--vocab_size=10000
# Encode the corpus file using the sentencePiece model
spm_encode --model=data/arxiv-metadata-oai-snapshot \
--output_format=piece \
< data/arxiv-metadata-oai-snapshot.txt \
> data/arxiv-metadata-oai-snapshot.piece
# Train the subword GloVe vectors
# script adapted from https://github.com/stanfordnlp/GloVe/blob/master/demo.sh
./src/train-glove.sh
Pre-trained models are available in the data/
directory.
data/arxiv-metadata-oai-snapshot.model
is the sentencePiece model.data/arxiv-metadata-oai-snapshot.vocab
is the sentencePiece vocabulary file.data/vectors.txt
anddata/vectors.bin
are learned GloVe vectors (50 dim).