Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database#createFTS5Tokenizer API #944

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

indutny-signal
Copy link

FTS5 doesn't support CJK symbols and non-latin locales in general. The easiest way to add them is to just use Intl global object available in V8 to segment the UTF-8 string into words with ICU. This Pull Request adds the API to map Intl.Segmenter APIs into FTS5 as a custom tokenizer, or alternatively implement your own tokenizer from scratch.

@valstu
Copy link

valstu commented Sep 4, 2024

This would be great addition, with this one could easily implement something like snowball stemmer to fts5 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants