Skip to content

server : merge split UTF-8 token text in verbose JSON#3850

Merged
danbev merged 1 commit into
ggml-org:masterfrom
lyonsno:dragnet/server-utf8-words-3821
Jun 2, 2026
Merged

server : merge split UTF-8 token text in verbose JSON#3850
danbev merged 1 commit into
ggml-org:masterfrom
lyonsno:dragnet/server-utf8-words-3821

Conversation

@lyonsno

@lyonsno lyonsno commented May 31, 2026

Copy link
Copy Markdown
Contributor

Fixes #3821.

This applies the UTF-8 token-boundary merge used by the CLI JSON path to the server verbose_json word output. The UTF-8 tail detector now lives in examples/common-whisper, so both CLI and server use the same helper.

When a multi-byte character is split across adjacent tokens, the server now emits one merged segments[].words[] entry instead of separate invalid UTF-8 fragments. segments[].tokens still preserves the raw non-EOT token ids, so a merged word can correspond to more than one token id.

For merged words, metadata follows the existing CLI lead-token convention: start, probability, and t_dtw come from the lead token, while end advances to the last consumed token with a valid end time.

Verification:

  • cmake --build build-3821 --target test-common-utf8 whisper-server whisper-cli test-vad -j 4
  • ctest --test-dir build-3821 -L unit --output-on-failure
  • ctest --test-dir build-3821 -R '^test-whisper-cli-tiny\.en$' --output-on-failure
  • git diff --check

@danbev danbev merged commit e5d4412 into ggml-org:master Jun 2, 2026
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server: merge tokens split across UTF-8 boundaries in JSON output

2 participants