`sqlite3-ngram`

ngram is a SQLite3 FTS5 n-gram tokenizer, it tokenize the input text in computational linguistics level.

For the input text Hello 新世界:

ngram = 1

Hello, 新, 世, 界
ngram = 2

Hello, 新, 新世, 世界
ngram = 3

Hello, 新, 新世, 新世界

The tokenization is based on UTF-8 character and character category boundary.

The ngram currently support is in range [1, 4], larger ngram can be supported but it's usually unnecessary.

This tokenizer extension can be used as a fallback(generic) tokenizer for FTS purpose.

Build

# Tested under podman, docker should also be ok.
container/build.sh

Usage

-- First load the ngram extension
.load build/libngram.so
-- By default N = 2, valid N is in range [1, 4]
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'ngram');
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'ngram gram N');

-- Or check sql/load-ext.sql for example usage
-- sqlite3 < sql/load-ext.sql

Advance usage

You can integrate this tokenizer with the SQLite3 official porter tokenizer:

.load build/libngram.so
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter ngram gram N');

In such case, if you tokenized the word direct. directed, directing, direction, directly... all can be coalesced into direct and thus hit a match.

Limitation

Currently only the UTF-8 string is supported for tokenization, usually not a big concern though.

Credits

This project was inspired from the following projects:

wangfenjin/simple - 支持中文（简体和繁体）和拼音的 SQLite fts5 扩展

TODO

Implement ngram_highlight() function
Add more test cases
Enable build & test CI

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
container		container
sql		sql
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
download-sqlite.sh		download-sqlite.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`sqlite3-ngram`

Build

Usage

Advance usage

Limitation

Credits

TODO

About

Uh oh!

Releases

Packages

Languages

License

leiless/sqlite3-ngram

Folders and files

Latest commit

History

Repository files navigation

sqlite3-ngram

Build

Usage

Advance usage

Limitation

Credits

TODO

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`sqlite3-ngram`

Packages