Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
-
Updated
Aug 7, 2024 - Python
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Fast bare-bones BPE for modern tokenizer training
Subword Encoding in Lattice LSTM for Chinese Word Segmentation
Simple-to-use scoring function for arbitrarily tokenized texts.
Learning BPE embeddings by first learning a segmentation model and then training word2vec
Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
Subword-augmented Embedding for Cloze Reading Comprehension (COLING 2018)
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [arXiv 2025]
This repository provides a clear, educational implementation of Byte Pair Encoding (BPE) tokenization in plain Python. The focus is on algorithmic understanding, not raw performance.
Byte-Pair Encoding tokenizer for training large language models on huge datasets
Byte Pair Encoding (BPE)
Natural Language EnCoder-Decoder: word, char, bpe etc
A lightweight, from-scratch implementation of Byte Pair Encoding (BPE) tokenization in Python.
Add a description, image, and links to the bpe topic page so that developers can more easily learn about it.
To associate your repository with the bpe topic, visit your repo's landing page and select "manage topics."