Skip to content

abyesilyurt/minilm.c

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

minilm.c

Dependency-free MiniLM embeddings in C. In ~2000 lines of code you can load model weights, tokenize text, and get embeddings.

This toy project re-implements a distilled BERT (MiniLM) with:

  • A custom tensor library
  • A .tbf tensor file format
  • A WordPiece tokenizer

Quick background:

  • BERT: Bidirectional Encoder Representations from Transformers
  • Transformer: Neural net architecture built on attention
  • Attention: Figures out which words matter most in context

Quickstart

make run

Expected files at runtime, included in the assets directory:

bert_weights.tbf   # model weights in custom TBF format
vocab.txt          # tokenizer vocabulary

Example

#include "minilm.h"
#include "s8.h"
#include "nn.h"

int main(void) {
    const char *question = "what's the capital of germany?";
    minilm_t m;
    minilm_create(&m, "bert_weights.tbf", "vocab.txt");

    // candidates
    da_s8 choices = {0};
    da_s8_append(&choices, m_s8("paris"));
    da_s8_append(&choices, m_s8("london"));
    da_s8_append(&choices, m_s8("berlin"));
    da_s8_append(&choices, m_s8("madrid"));
    da_s8_append(&choices, m_s8("rome"));

    // embed
    da_tensor_t vecs = {0};
    for (size_t i = 0; i < choices.len; i++) {
        tensor_t v;
        minilm_embed(m, (char*)choices.data[i].data, choices.data[i].len, &v);
        da_tensor_t_append(&vecs, v);
    }
    tensor_t q; minilm_embed(m, (char*)question, strlen(question), &q);

    // nearest neighbor (L2)
    size_t best = nearest_index(vecs, q);
    printf("query : %s\nanswer: %s\n", question, choices.data[best].data);

    minilm_destroy(&m);
    return 0;
}

API (at a glance)

// lifecycle
int      minilm_create(minilm_t *m, const char *tbf_path, const char *vocab_txt_path);
void     minilm_destroy(minilm_t *m);

// inference (tokenize → encode → return embedding `tensor_t`)
t_status minilm_embed(minilm_t m, char *str, size_t str_len, tensor_t *out);


t_status minilm_tokenize(minilm_t m, s8 str, da_u32 *ids);
t_status minilm_encode(minilm_t m, da_u32 ids, tensor_t *out);
  • minilm_embed = tokenize → encode → return embedding tensor_t.
  • Embedding size/architecture match MiniLM (hidden size 384, 6 layers) as reflected in the structs.

Model & Formats

  • Weights: expected in .tbf format named bert_weights.tbf. See scripts/dump_tbf1.py for an example.
  • Vocab: vocab.txt (one token per line, BERT-style).

About

MiniLM (BERT) embeddings from scratch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published