Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better default minisearch tokenizer for Chinese documents #4049

Open
4 tasks done
leyen-me opened this issue Jul 16, 2024 · 6 comments
Open
4 tasks done

Better default minisearch tokenizer for Chinese documents #4049

leyen-me opened this issue Jul 16, 2024 · 6 comments
Labels
enhancement New feature or request help wanted Extra attention is needed stale

Comments

@leyen-me
Copy link

Describe the bug

搜索不到内容

Reproduction

中文内容

Expected behavior

image

System Info

no

Additional context

No response

Validations

@leyen-me leyen-me added the bug: pending triage Maybe a bug, waiting for confirmation label Jul 16, 2024
@brc-dd
Copy link
Member

brc-dd commented Jul 16, 2024

Have you tried doing vitepress build then run vitepress preview?

@leyen-me
Copy link
Author

您是否尝试过执行 vitepress 构建然后运行 vitepress 预览?

我尝试构建过,你可以预览我的生产链接,https://web.leyen.me

@brc-dd
Copy link
Member

brc-dd commented Jul 16, 2024

lucaong/minisearch#201 (comment) -- This comment kind of works, but still needs improvement I guess.

There are also some other people doing this - https://github.com/search?q=vitepress+segmenter+language:JavaScript+OR+language:TypeScript+NOT+is:fork&type=code

Not sure but bm25 parameters might help too - https://github.com/search?q=vitepress+searchOptions+bm25+language:JavaScript+OR+language:TypeScript+NOT+is:fork&type=code (I haven't checked how they work yet.)

@brc-dd brc-dd added bug Something isn't working help wanted Extra attention is needed and removed bug: pending triage Maybe a bug, waiting for confirmation labels Jul 16, 2024
@brc-dd
Copy link
Member

brc-dd commented Jul 16, 2024

I'm keeping this open. There should be some defaults here instead of needing Chinese users to manually configure it.

@brc-dd brc-dd reopened this Jul 16, 2024
@brc-dd brc-dd changed the title Not searchable Better default minisearch tokenizer for Chinese documents Jul 16, 2024
@brc-dd brc-dd added enhancement New feature or request and removed bug Something isn't working labels Jul 16, 2024
@niansi-z
Copy link
Contributor

niansi-z commented Jul 17, 2024

I tried it out and found that when there was no title, the problem recurred, not just in Chinese.demo

@brc-dd
Copy link
Member

brc-dd commented Jul 17, 2024

With current logic titles are needed. There should be a h1 per page. There is a PR open to make it more robust in handling such content, will see.

@github-actions github-actions bot added the stale label Sep 1, 2024
@brc-dd brc-dd removed the stale label Sep 1, 2024
@github-actions github-actions bot added the stale label Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed stale
Projects
None yet
Development

No branches or pull requests

3 participants