BytePairEncoding.jl

Pure Julia implementation of the Byte Pair Encoding (BPE) method. Support openai-gpt2 byte-level bpe and openai tiktoken. BytePairEncoding.jl rely on TextEncodeBase.jl and support different tokenization method.

julia> using BytePairEncoding

julia> tkr = BytePairEncoding.load_tiktoken("cl100k_base")
BPETokenizer(MatchTokenization(BPETokenization(Cl100kBaseTokenization, bpe = TikTokenBPE(100256 merges)), 5 patterns))

julia> tkr("hello world aaaaaaaaaaaa")
5-element Vector{String}:
 "hello"
 " world"
 " a"
 "aaaaaaaa"
 "aaa"

julia> tkr2 = BytePairEncoding.load_gpt2()
BPETokenizer(MatchTokenization(CodeNormalizer(BPETokenization(GPT2Tokenization, bpe = BPE(50000 merges)), codemap = CodeMap{UInt8 => UInt16}(3 code-ranges)), 1 patterns))

julia> tkr2("hello world aaaaaaaaaaaa")
6-element Vector{String}:
 "hello"
 "Ġworld"
 "Ġa"
 "aaaa"
 "aaaa"
 "aaa"

julia> enc = BytePairEncoding.load_tiktoken_encoder("cl100k_base")
┌ Warning: The maximum encoded value (`length(BPEEncoder.vocab)`) is larger than the number of possible tokens
│ because there are some "gaps" in the vocabulary. Be carefull if used to initialize embedding table.
└ @ BytePairEncoding
BPEEncoder(BPETokenizer(MatchTokenization(BPETokenization(Cl100kBaseTokenization, bpe = TikTokenBPE(100256 merges)), 5 patterns)), Vocab(size = 100277))

julia> enc.encode("hello world aaaaaaaaaaaa") # === enc(...)
5-element Vector{Int64}:
 15340
  1918
   265
 70541
 33747

julia> enc.decode(enc("hello world aaaaaaaaaaaa"))
"hello world aaaaaaaaaaaa"

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github/workflows		.github/workflows
docs		docs
src		src
test		test
.appveyor.yml		.appveyor.yml
Artifacts.toml		Artifacts.toml
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BytePairEncoding.jl

About

Releases 11

Packages

Languages

License

chengchingwen/BytePairEncoding.jl

Folders and files

Latest commit

History

Repository files navigation

BytePairEncoding.jl

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 11

Packages 0

Languages

Packages