This repositiory provides code and static word embeddings (SWEs) proposed in our paper "Static Word Embeddings for Sentence Semantic Representation" (EMNLP 25 Main).
English and cross-lingual (English-{German/Japanese/Chinese}) SWEs are stored in the "embeddings" folder. Code and SWE models, except for the English-Japanese one ("swe_mgte256_enja.txt"), are released under the Apache license 2.0. The English-Japanese one follows the license JParaCrawl (placed at "embeddings/LICENSE_swe_mgte256_enja.txt").
Refer to "example.py" for how to use English SWEs, and "example_xling.py" for cross-lignual ones. As denoted in the paper title, these embeddings are more effective for encoding sentences than long text like paragraphs/documents.
First, prepare the "word2sent.pkl" file that pickles the Python dictionary where keys are a list of words in (pre-defined) vocabulary and values are a list of N unlabelled sentences (not passages or documents) that contain the key word (e.g. "good": ["I have good news.", "He is a good student.", "That sounds good.", ...] ). In our paper, we employ CC-100 and split text in each line into sentences using BlingFire, and then sample N=100 sentences for each word in the 150k vocab.
Note that some pieces of code are hard-coded for the BERT-style tokenisation that specifies the subword boundary with "##". Modify relevant parts if necessary.
model=Alibaba-NLP/gte-base-en-v1.5
word2sent=path_to_word2sent.pkl
output_folder=output_folder_path
nsent=100
CUDA_VISIBLE_DEVICES=0 python extract_embs.py -prompt "" -output_folder ${output_folder} -model ${model} -word2sent ${word2sent} -nsent ${nsent}
model="Alibaba-NLP/gte-base-en-v1.5"
word2sent=path_to_word2sent.pkl
vec_path=output_folder_path/vec.txt
output_folder=output_pca_folder_path
python apply_pca.py -d_remove 7 -embd 256 -word2sent ${word2sent} -vec_path ${vec_path} -model ${model} -output_folder ${output_folder}
model="Alibaba-NLP/gte-base-en-v1.5"
word2sent=path_to_word2sent.pkl
vec_path=output_pca_folder_path/vec.txt
output_folder=final_output_folder_path
CUDA_VISIBLE_DEVICES=0 python train.py -prompt "" -word2sent ${word2sent} -epoch 15 -bs 128 -model ${model} -vec_path ${vec_path} -output_folder ${output_folder}
As in monolingual SWEs, prepare the "word2sent.pkl" file that pickles a python dictionary where keys are a list of words in a pre-defined vocabulary and values are a list of N unlabelled sentences (not passages or documents) that contain the key word. In our paper, we use CCMatrix and sample N=100 sentences for each word.
Note that some pieces of code are hard-coded for language pairs used in our paper (en-de, en-zh, en-ja); modify relevant parts if necessary.
model=Alibaba-NLP/gte-multilingual-base
word2sent_en=path_to_english_word2sent
folder=output_english_folder_path
CUDA_VISIBLE_DEVICES=0 python extract_embs.py -prompt "" -folder ${folder} -model ${model} -word2sent ${word2sent_en} -nsent 100
word2sent_de=path_to_german_word2sent
folder=output_german_folder_path
CUDA_VISIBLE_DEVICES=0 python extract_embs.py -prompt "" -folder ${folder} -model ${model} -word2sent ${word2sent_de} -nsent 100
(If the input language is Japanese/Chinese, enable the "-subword" option)
Generate English-German SWEs
langs="en de"
vec_path="output_english_folder_path/vec.txt output_german_folder_path/vec.txt"
model="Alibaba-NLP/gte-multilingual-base"
word2sent="${word2sent_en} ${word2sent_de}"
output_folder=output_pca_folder_path
python apply_pca_xling.py -d_remove 7 -embd 256 -langs ${langs} -word2sent ${word2sent} -vec_path ${vec_path} -model ${model} -output_folder ${output_folder}
You can also generate multilingual SWEs as follows (e.g. SWEs aligned across English, German, Chinese, and Japanese, which are evaluated in Table 10 and 11 in the paper).
langs="en de zh ja"
vec_path="output_english_folder_path/vec.txt output_german_folder_path/vec.txt output_chinese_folder_path/vec.txt output_japanese_folder_path/vec.txt"
model="Alibaba-NLP/gte-multilingual-base"
word2sent="${word2sent_en} ${word2sent_de} ${word2sent_zh} ${word2sent_ja}"
output_folder=output_pca_folder_path
python apply_pca_xling.py -d_remove 7 -embd 256 -langs ${langs} -word2sent ${word2sent} -vec_path ${vec_path} -model ${model} -output_folder ${output_folder}
Prepare "en.txt" and "de.txt", where each line is a sentence that is parallel (translation) to each language (hence, both files must have the same numbner of lines). These files are used for contrastive learning. In our paper, we use CCMatrix as in Step 1.
vec_path=output_pca_folder_path/vec.txt
lang=ende
output_folder=final_output_folder_path
model="Alibaba-NLP/gte-multilingual-base"
parallel_sents="en.txt de.txt"
CUDA_VISIBLE_DEVICES=0 python train_xling.py -parallel_sents ${parallel_sents} -lang ${lang} -epoch 15 -bs 128 -model ${model} -vec_path ${vec_path} -output_folder ${output_folder}
Note: The code used in Step 3 is designed for training bilingual SWEs (as described in our paper), but can be easily extended to mulitlingual training by feeding paralell sentences of multiple language pairs and jointly minimising the contrastive learning loss.
If you use our code or models, please cite our paper as follows:
@inproceedings{wada-etal-2025-static,
title = "Static Word Embeddings for Sentence Semantic Representation",
author = "Wada, Takashi and
Hirakawa, Yuki and
Shimizu, Ryotaro and
Kawashima, Takahiro and
Saito, Yuki",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.316/",
pages = "6206--6222",
ISBN = "979-8-89176-332-6",
}