You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to use it with SentencePiece-based model (japanese-ALBERT), but it handle input sentence as single word.
I think it would be goot to replace model._model.tokenizer.tokenize instead of str.split.
The text was updated successfully, but these errors were encountered:
For LIME (and other ablation-style techniques), we want to tokenize on full words and not word pieces, which the model tokenizer might do. Is there a simple way to do word-based tokenization for non-space segmented languages?
I don't know another languages about non-space segmented language (maybe Chinese, Thai, ...), at least Japanese "word" is a little bit ambiguous consept.
To segment to words , you need to use morphological analyser like MeCab and dictionaries.
Japanese BERT tokenizer in Transformers uses MeCab and SentencePiiece, but ALBERT is not.
Curently, it seems just use
str.split
so it didn't work with non-space segmented languages like Japanese.lit/lit_nlp/components/citrus/lime.py
Line 85 in 3eb824b
I tried to use it with SentencePiece-based model (japanese-ALBERT), but it handle input sentence as single word.
I think it would be goot to replace
model._model.tokenizer.tokenize
instead ofstr.split
.The text was updated successfully, but these errors were encountered: