minilm.c

Dependency-free MiniLM embeddings in C. In ~2000 lines of code you can load model weights, tokenize text, and get embeddings.

This toy project re-implements a distilled BERT (MiniLM) with:

A custom tensor library
A .tbf tensor file format
A WordPiece tokenizer

Quick background:

BERT: Bidirectional Encoder Representations from Transformers
Transformer: Neural net architecture built on attention
Attention: Figures out which words matter most in context

Quickstart

make run

Expected files at runtime, included in the assets directory:

bert_weights.tbf   # model weights in custom TBF format
vocab.txt          # tokenizer vocabulary

Example

#include "minilm.h"
#include "s8.h"
#include "nn.h"

int main(void) {
    const char *question = "what's the capital of germany?";
    minilm_t m;
    minilm_create(&m, "bert_weights.tbf", "vocab.txt");

    // candidates
    da_s8 choices = {0};
    da_s8_append(&choices, m_s8("paris"));
    da_s8_append(&choices, m_s8("london"));
    da_s8_append(&choices, m_s8("berlin"));
    da_s8_append(&choices, m_s8("madrid"));
    da_s8_append(&choices, m_s8("rome"));

    // embed
    da_tensor_t vecs = {0};
    for (size_t i = 0; i < choices.len; i++) {
        tensor_t v;
        minilm_embed(m, (char*)choices.data[i].data, choices.data[i].len, &v);
        da_tensor_t_append(&vecs, v);
    }
    tensor_t q; minilm_embed(m, (char*)question, strlen(question), &q);

    // nearest neighbor (L2)
    size_t best = nearest_index(vecs, q);
    printf("query : %s\nanswer: %s\n", question, choices.data[best].data);

    minilm_destroy(&m);
    return 0;
}

API (at a glance)

// lifecycle
int      minilm_create(minilm_t *m, const char *tbf_path, const char *vocab_txt_path);
void     minilm_destroy(minilm_t *m);

// inference (tokenize → encode → return embedding `tensor_t`)
t_status minilm_embed(minilm_t m, char *str, size_t str_len, tensor_t *out);


t_status minilm_tokenize(minilm_t m, s8 str, da_u32 *ids);
t_status minilm_encode(minilm_t m, da_u32 ids, tensor_t *out);

minilm_embed = tokenize → encode → return embedding tensor_t.
Embedding size/architecture match MiniLM (hidden size 384, 6 layers) as reflected in the structs.

Model & Formats

Weights: expected in .tbf format named bert_weights.tbf. See scripts/dump_tbf1.py for an example.
Vocab: vocab.txt (one token per line, BERT-style).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
include		include
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
example.c		example.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

minilm.c

Quickstart

Example

API (at a glance)

Model & Formats

About

Uh oh!

Releases

Packages

Languages

License

abyesilyurt/minilm.c

Folders and files

Latest commit

History

Repository files navigation

minilm.c

Quickstart

Example

API (at a glance)

Model & Formats

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages