word2manylanguages

Modeling Languages with Their Own Parameters

This repository contains all code, models, and documentation associated with the paper:

📖 Overview

This project builds multilingual word embedding models using corpora from OpenSubtitles and Wikipedia. Unlike prior work, we optimize model parameters — including embedding dimension and window size — separately for each language, improving prediction of psycholinguistic norms.

We provide:

Optimized fastText embeddings
Tools for training and evaluating embeddings
A reproducible pipeline for multilingual modeling
A Shiny app for interactive exploration COMING SOON!

🔍 Key Findings

Default fastText settings (300d, window=5) are not optimal across languages.
Best-performing settings vary widely by task, corpus type, and language.
Small models (e.g., 50d, window=1) often outperform larger ones on some tasks.

📦 Repository Contents

code/: Code to build corpora, train models, evaluate performance, the manuscript, shiny app
data/: Output evaluation data from the modeling and examples of processing
presentation/: Presentations from conferences on this project

🚀 Get Started

Requirements

Python 3.10+
fastText (via Gensim 3.8.3)
R 4.4.2 for reproducible manuscript analysis

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
code		code
data		data
presentations		presentations
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
word2manylanguages.Rproj		word2manylanguages.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

word2manylanguages

📖 Overview

🔍 Key Findings

📦 Repository Contents

🚀 Get Started

Requirements

About

Uh oh!

Releases 1

Packages

Contributors 7

Uh oh!

Languages

License

SemanticPriming/word2manylanguages

Folders and files

Latest commit

History

Repository files navigation

word2manylanguages

📖 Overview

🔍 Key Findings

📦 Repository Contents

🚀 Get Started

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 7

Uh oh!

Languages

Packages