Skip to content

Commit a9ba4b8

Browse files
Clarifying readme about Chinese space tokenization
1 parent 332a687 commit a9ba4b8

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

multilingual.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -176,8 +176,8 @@ weighted the same way as the data, so low-resource languages are upweighted by
176176
some factor. We intentionally do *not* use any marker to denote the input
177177
language (so that zero-shot training can work).
178178

179-
Because Chinese does not have whitespace characters, we add spaces around every
180-
character in the
179+
Because Chinese (and Japanese Kanji and Korean Hanja) does not have whitespace
180+
characters, we add spaces around every character in the
181181
[CJK Unicode range](https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_\(Unicode_block\))
182182
before applying WordPiece. This means that Chinese is effectively
183183
character-tokenized. Note that the CJK Unicode block only includes

0 commit comments

Comments
 (0)