Skip to content

A fast runtime lemmatizer for Indonesian using sentence embeddings and FAISS for similarity search.

Notifications You must be signed in to change notification settings

gbyuvd/IndoFastLemma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

IndoFastLemma

A fast runtime lemmatizer for Indonesian using sentence embeddings and FAISS for similarity search.

Features

  • Fast lemmatization via sentence embeddings (Sentence-BERT) + FAISS
  • FAISS-powered cosine similarity lookup
  • Trained on UD Indonesian treebanks (CSUI + GSD)

Performance

  • Accuracy: 83.3% (16,544/19,858 tokens on id_pud-ud-test.conllu)
  • Speed: ~639 tokens/sec
  • Error breakdown: 72% morphological, 28% casing

Usage
Please see demo.ipynb for tutorial on building the dict+embed and using it. also extract dictionaryandindex.7z which contains the above reqs.

"""after class definition, see demo.ipynb"""

lemmatizer = IndonesianLemmatizer()

test_sentences = [
    "Saya sedang belajar bahasa Indonesia",
    "Anak-anak sedang bermain di taman",
    "Mereka pergi ke pasar untuk membeli buah-buahan"
]

print("\nTesting the lemmatizer:")
for sentence in test_sentences:
    result = lemmatizer.lemmatize(sentence)
    print(f"Input: {sentence}")
    print(f"Output: {result}")
    print()

Requirements

  • faiss-cpu, sentence-transformers, tqdm, numpy
  • Prebuilt index/dict files (id_inflection.idx, id_inflection_dict.json.gz)

IndoFastLemma

Sebuah lemmatizer runtime cepat untuk bahasa Indonesia yang menggunakan sentence embeddings dan FAISS untuk pencarian kesamaan.

Fitur

  • Lemmatisasi cepat menggunakan sentence embeddings (Sentence-BERT) + FAISS
  • Pencarian kesamaan cosine yang didukung FAISS
  • Dilatih pada treebank UD Indonesia (CSUI + GSD)

Kinerja

  • Akurasi: 83,3% (16.544/19.858 token pada id_pud-ud-test.conllu)
  • Kecepatan: ~639 token/detik
  • Rincian kesalahan: 72% morfologis, 28% kapitalisasi

Penggunaan
Silakan lihat demo.ipynb untuk tutorial mengenai cara membangun kamus dan menggunakannya. Juga ekstrak dictionaryandindex.7z yang berisi persyaratan di atas.

“”“Setelah definisi kelas, lihat demo.ipynb”“”

lemmatizer = IndonesianLemmatizer()

test_sentences = [
    “Saya sedang belajar bahasa Indonesia”,
    “Anak-anak sedang bermain di taman”,
    “Mereka pergi ke pasar untuk membeli buah”
]

print(“\nMenguji lemmatizer:”)
for sentence in test_sentences:
    result = lemmatizer.lemmatize(sentence)
    print(fInput: {sentence}”)
    print(fOutput: {result}”)
    print()

Persyaratan

  • faiss-cpu, sentence-transformers, tqdm, numpy
  • Berkas indeks/kamus yang sudah dibangun sebelumnya (id_inflection.idx, id_inflection_dict.json.gz)

About

A fast runtime lemmatizer for Indonesian using sentence embeddings and FAISS for similarity search.

Topics

Resources

Stars

Watchers

Forks