wvec

A high-performance Word2Vec implementation combining Rust and Modern Fortran

Overview

wvec is a from-scratch implementation of the Word2Vec algorithm that leverages the unique strengths of two languages:

Rust handles text processing, tokenization, and the CLI — benefiting from memory safety, rich string handling, and zero-cost abstractions
Fortran powers the numerical core — delivering raw computational performance through BLAS integration and OpenMP parallelization

This hybrid architecture achieves both developer ergonomics and numerical performance, demonstrating modern systems programming across language boundaries.

graph LR
    subgraph "Rust Layer"
        A[Input Files] --> B[Text Processing]
        B --> C[BPE Tokenizer]
        C --> D[Token IDs]
    end

    subgraph "FFI Bridge"
        D --> E[C ABI]
    end

    subgraph "Fortran Layer"
        E --> F[Skip-gram Training]
        F --> G[BLAS Operations]
        G --> H[OpenMP Parallel]
        H --> I[Embeddings]
    end

    I --> J[Model File]

    style A fill:#f9f,stroke:#333
    style J fill:#9f9,stroke:#333

Key Features

Dual-Language Architecture

Component	Language	Why
Text I/O & Parsing	Rust	Native UTF-8, pattern matching, memory safety
BPE Tokenization	Rust	Hash maps, complex data structures
CLI Interface	Rust	Argument parsing, error handling
Matrix Operations	Fortran	BLAS integration, compiler optimization
Parallel Training	Fortran	OpenMP, array-oriented operations

Zero External Dependencies (Rust Side)

The entire Rust codebase uses only the standard library — no external crates. This demonstrates:

Deep understanding of Rust's std capabilities
Reduced supply chain attack surface
No dependency version conflicts

Production-Ready Safety Features

Checkpointing: Save and resume training state
Thermal Monitoring: Prevent CPU overheating during long training runs
Graceful Shutdown: Handle interrupts without losing progress

Architecture

flowchart TB
    subgraph Input["Input Layer"]
        TXT[".txt Files"]
        XML[".xml Wikipedia"]
        HTML[".html Pages"]
    end

    subgraph TextProcessing["Text Processing"]
        NORM["Normalization<br/>(case, punctuation)"]
        PRETOK["Pre-tokenization<br/>(language-aware)"]
    end

    subgraph BPE["BPE Tokenizer"]
        TRAIN["Training<br/>(pair merging)"]
        ENCODE["Encoding<br/>(merge rules)"]
        VOCAB["Vocabulary<br/>(binary I/O)"]
    end

    subgraph FFI["FFI Bridge"]
        TYPES["Type Mapping<br/>i32 ↔ c_int"]
        ARRAYS["Array Passing<br/>*const/*mut"]
        ERRORS["Error Codes<br/>status returns"]
    end

    subgraph Fortran["Fortran Core"]
        MODEL["Model State<br/>(singleton)"]
        SKIPGRAM["Skip-gram<br/>(neg. sampling)"]
        BLAS["BLAS<br/>(sdot, saxpy)"]
        OMP["OpenMP<br/>(Hogwild)"]
    end

    subgraph Safety["Safety Features"]
        CKPT["Checkpointing"]
        THERMAL["Thermal Monitor"]
        SHUTDOWN["Graceful Shutdown"]
    end

    Input --> TextProcessing
    TextProcessing --> BPE
    BPE --> FFI
    FFI --> Fortran
    Fortran --> Safety
    Safety --> OUTPUT[("Model File")]

    style Input fill:#e1f5fe
    style TextProcessing fill:#fff3e0
    style BPE fill:#f3e5f5
    style FFI fill:#e8f5e9
    style Fortran fill:#fce4ec
    style Safety fill:#fff8e1

Installation

Prerequisites

# Fedora/RHEL
sudo dnf install gcc-gfortran openblas-devel

# Ubuntu/Debian
sudo apt install gfortran libopenblas-dev

# macOS
brew install gcc openblas

Build

git clone https://github.com/hwang-fu/wvec.git
cd wvec
cargo build --release

Testing

# Run all tests (single-threaded mode required)
$ cargo test -- --test-threads=1

running 162 tests
test bpe::decode::tests::test_decode_basic ... ok
test bpe::encode::tests::test_encode_basic ... ok
...
test ffi::tests::test_checkpoint_save_load ... ok
test ffi::tests::test_train_corpus ... ok

test result: ok. 162 passed; 0 failed; 0 ignored

# Clean build artifacts
$ cargo clean
$ make -C fortran clean

Why --test-threads=1?

The Fortran numerical core uses a singleton pattern for embedding matrices (g_w_in, g_w_out). This design enables efficient OpenMP parallelization within a single training session, but means multiple Rust tests cannot safely call wvec_model_init() / wvec_model_free() concurrently.

Running tests single-threaded prevents race conditions on the shared Fortran state. This is the correct trade-off: we want intra-training parallelism (OpenMP threads inside Fortran), not inter-test parallelism (multiple Rust tests fighting over the same model).

Usage

Train Word Vectors

$ wvec train --input enwiki.txt --output model.bin --dim 128 --epochs 3
Training word vectors...
  Input: enwiki.txt
  Output: model.bin
  Dim: 128, Window: 5, Neg: 5, LR: 0.025, Epochs: 3

[1/5] Reading corpus...
  4,521,873 pre-tokens

[2/5] Preparing vocabulary...
  Training BPE (target size: 50000)
  Saved vocabulary to model.bin.vocab
  Vocabulary size: 32,847

[3/5] Encoding corpus...
  5,892,104 token IDs

[4/5] Building negative sampling table...
  Table size: 1000000

[5/5] Training...
  Epoch 1/3
  Epoch 2/3
  Epoch 3/3

Done! Model saved to model.bin

Query Trained Model

$ wvec similar --model model.bin --word "king" --topk 5
Similar to 'king':
  0.8234  queen
  0.7891  prince
  0.7654  monarch
  0.7432  throne
  0.7218  kingdom

$ wvec analogy --model model.bin --query "king - man + woman"
Analogy: king - man + woman
Results:
  0.7912  queen
  0.6843  princess
  0.6521  duchess
  0.6234  empress
  0.5987  monarch

$ wvec embed --model model.bin --word "hello"
Embedding for 'hello' (dim=128):
[
 -0.177975 -0.018467 -0.053807 -0.037715 -0.020454  0.104731  0.103260 -0.060821
  0.154096  0.125020  0.132915  0.018168 -0.081816  0.109417 -0.043241 -0.020463
  ...
]

BPE Tokenizer Utilities

$ wvec bpe-train --input corpus.txt --output vocab.bin --vocab-size 10000
Training BPE tokenizer...
  Input: corpus.txt
  Output: vocab.bin
  Vocab size: 10000
  Collected 158,432 pre-tokens
  Vocabulary: 8,291 tokens
Done! Saved to vocab.bin

$ wvec bpe-encode --vocab vocab.bin --text "tokenization is fun"
Input: tokenization is fun
Normalized: tokenization is fun

Tokens:
  'token' -> [1842]
  'ization' -> [3291]
  'is' -> [67]
  'fun' -> [2104]

All token IDs: [1842, 3291, 67, 2104]
Total: 4 tokens

$ wvec info vocab.bin
BPE Vocabulary: vocab.bin
  Tokens: 8,291
  Merge rules: 4,287

  Sample tokens:
       0: [UNK]
       1: [PAD]
       2: [BOS]
       3: [EOS]
       4: e
       5: t
       6: a
       7: o
       8: n
       9: i

Algorithm Details

Skip-gram with Negative Sampling

The training objective maximizes:

$$\log \sigma(v_{w_O}^{\top} v_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} \left[ \log \sigma(-v_{w_i}^{\top} v_{w_I}) \right]$$

Where:

$v_{w_I}$ is the input (center) word embedding
$v_{w_O}$ is the output (context) word embedding
$k$ negative samples are drawn from noise distribution $P_n(w) \propto U(w)^{3/4}$

Gradient update (per word pair):

g = (label - sigmoid(score)) * learning_rate

Byte Pair Encoding (BPE)

graph TD
    A["Input: 'lower'"] --> B["Characters: l o w e r"]
    B --> C["Count pairs: (l,o)=1, (o,w)=1, ..."]
    C --> D["Merge most frequent"]
    D --> E["New token: 'lo'"]
    E --> F["Repeat until vocab_size"]
    F --> G["Output: Vocabulary + Merge Rules"]

Hogwild Parallel Training

Multiple threads update shared embedding matrices without locks:

sequenceDiagram
    participant T1 as Thread 1
    participant T2 as Thread 2
    participant T3 as Thread 3
    participant M as Shared Embeddings

    par Parallel Updates
        T1->>M: Update word 42
        T2->>M: Update word 17
        T3->>M: Update word 89
    end

    Note over M: Sparse updates rarely collide<br/>Convergence still guaranteed

File Formats

BPE Vocabulary (`.bin`)

+------------------------------------------------+
| HEADER                                         |
+------------+-------+---------------------------+
| magic      | 4B    | "BPE\0"                   |
| version    | u32   | Format version (1)        |
| vocab_size | u32   | Number of tokens          |
| pairs_count| u32   | Number of merge rules     |
+------------+-------+---------------------------+
| TOKENS (repeated vocab_size times)             |
+------------+-------+---------------------------+
| len        | u32   | Token byte length         |
| bytes      | [u8]  | UTF-8 encoded token       |
+------------+-------+---------------------------+
| MERGE PAIRS (repeated pairs_count times)       |
+------------+-------+---------------------------+
| left       | u32   | Left token ID             |
| right      | u32   | Right token ID            |
| merged_id  | u32   | Resulting token ID        |
+------------+-------+---------------------------+

Model Checkpoint (`.bin`)

+------------------------------------------------+
| HEADER (24 bytes)                              |
+------------+-------+---------------------------+
| magic      | 4B    | "WVCK"                    |
| version    | i32   | Format version (1)        |
| vocab_size | i32   | Vocabulary size           |
| dim        | i32   | Embedding dimension       |
| epoch      | i32   | Training epoch            |
| lr         | f32   | Learning rate             |
+------------+-------+---------------------------+
| EMBEDDINGS                                     |
+------------+-----------------------------------+
| W_in       | f32[dim * vocab_size] col-major  |
| W_out      | f32[dim * vocab_size] col-major  |
+------------+-----------------------------------+

Performance

Parallel Scaling

Threads	Words/sec	Speedup
1	~50K	1.0x
4	~180K	3.6x
8	~320K	6.4x
16	~500K	10x

Tested on Intel Core Ultra 9 285H, 16 cores

Memory Footprint

Component	Estimate
Embeddings (500K x 256 x 2)	~1 GB
Negative sampling table	~4 MB
BPE vocabulary	~50 MB
Total	~1.1 GB

Project Structure

wvec/
├── src/
│   ├── bpe/           # Byte Pair Encoding tokenizer
│   │   ├── train.rs   # BPE training algorithm
│   │   ├── encode.rs  # Text -> token IDs
│   │   ├── decode.rs  # Token IDs -> text
│   │   ├── vocab.rs   # Vocabulary data structure
│   │   └── io.rs      # Binary serialization
│   ├── input/         # File format readers
│   │   ├── text.rs    # Plain text streaming
│   │   ├── xml.rs     # Wikipedia XML parser
│   │   └── html.rs    # HTML content extractor
│   ├── text/          # Text processing
│   │   ├── normalize.rs    # Unicode normalization
│   │   └── pretokenize.rs  # Language-aware splitting
│   ├── cmd/           # CLI command implementations
│   ├── ffi.rs         # Fortran FFI bindings
│   ├── cli.rs         # Argument parsing
│   └── main.rs        # Entry point
├── fortran/
│   ├── wvec_model.f90      # Embedding matrices
│   ├── wvec_train.f90      # Skip-gram + neg. sampling
│   ├── wvec_blas.f90       # BLAS interfaces
│   ├── wvec_checkpoint.f90 # Save/load state
│   ├── wvec_thermal.f90    # CPU temperature monitor
│   └── Makefile
├── build.rs           # Fortran compilation integration
└── Cargo.toml

Technical Highlights

FFI Design

The Rust-Fortran bridge uses C ABI with careful type mapping:

// Rust side
extern "C" {
    fn wvec_train_corpus(
        token_ids: *const c_int,
        n_tokens: c_int,
        window: c_int,
        n_neg: c_int,
        neg_table: *const c_int,
        neg_table_size: c_int,
        lr: c_float,
    ) -> c_int;
}

! Fortran side
function wvec_train_corpus(...) result(status) bind(C, name="wvec_train_corpus")
    integer(c_int), intent(in), value :: n_tokens, window, n_neg
    integer(c_int), intent(in) :: token_ids(n_tokens)
    ! ... OpenMP parallel training
end function

Singleton Pattern for Fortran State

Fortran module variables with save attribute provide global state management compatible with C ABI:

module wvec_model
    real(c_float), allocatable, save :: g_w_in(:,:)   ! Input embeddings
    real(c_float), allocatable, save :: g_w_out(:,:)  ! Output embeddings
    logical, save :: g_initialized = .false.
end module

BLAS Integration

Direct calls to optimized BLAS routines for vector operations:

! Dot product: score = w_in . w_out
score = sdot(dim, g_w_in(1, center), 1, g_w_out(1, context), 1)

! Vector update: w_out += g * w_in
call saxpy(dim, g, g_w_in(1, center), 1, g_w_out(1, context), 1)

References

Mikolov, T., et al. (2013). Efficient Estimation of Word Representations in Vector Space
Mikolov, T., et al. (2013). Distributed Representations of Words and Phrases
Sennrich, R., et al. (2016). Neural Machine Translation of Rare Words with Subword Units

License

MIT License - see LICENSE for details.

"Two languages, one purpose: turning words into geometry."

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
docs		docs
fortran		fortran
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wvec

Overview

Key Features

Dual-Language Architecture

Zero External Dependencies (Rust Side)

Production-Ready Safety Features

Architecture

Installation

Prerequisites

Build

Testing

Usage

Train Word Vectors

Query Trained Model

BPE Tokenizer Utilities

Algorithm Details

Skip-gram with Negative Sampling

Byte Pair Encoding (BPE)

Hogwild Parallel Training

File Formats

BPE Vocabulary (`.bin`)

Model Checkpoint (`.bin`)

Performance

Parallel Scaling

Memory Footprint

Project Structure

Technical Highlights

FFI Design

Singleton Pattern for Fortran State

BLAS Integration

References

License

About

Uh oh!

Releases

Languages

License

hwang-fu/wvec

Folders and files

Latest commit

History

Repository files navigation

wvec

Overview

Key Features

Dual-Language Architecture

Zero External Dependencies (Rust Side)

Production-Ready Safety Features

Architecture

Installation

Prerequisites

Build

Testing

Usage

Train Word Vectors

Query Trained Model

BPE Tokenizer Utilities

Algorithm Details

Skip-gram with Negative Sampling

Byte Pair Encoding (BPE)

Hogwild Parallel Training

File Formats

BPE Vocabulary (.bin)

Model Checkpoint (.bin)

Performance

Parallel Scaling

Memory Footprint

Project Structure

Technical Highlights

FFI Design

Singleton Pattern for Fortran State

BLAS Integration

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages

BPE Vocabulary (`.bin`)

Model Checkpoint (`.bin`)