Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LIME tokenizer for SentencePiece (or other tokenizer) #361

Open
knok opened this issue Jun 1, 2021 · 3 comments
Open

LIME tokenizer for SentencePiece (or other tokenizer) #361

knok opened this issue Jun 1, 2021 · 3 comments

Comments

@knok
Copy link

knok commented Jun 1, 2021

Curently, it seems just use str.split so it didn't work with non-space segmented languages like Japanese.

tokenizer: Any = str.split,

I tried to use it with SentencePiece-based model (japanese-ALBERT), but it handle input sentence as single word.
I think it would be goot to replace model._model.tokenizer.tokenize instead of str.split.

@knok
Copy link
Author

knok commented Jun 1, 2021

Unfortunately, the change didn't work well.

@jameswex
Copy link
Collaborator

jameswex commented Jun 2, 2021

For LIME (and other ablation-style techniques), we want to tokenize on full words and not word pieces, which the model tokenizer might do. Is there a simple way to do word-based tokenization for non-space segmented languages?

@knok
Copy link
Author

knok commented Jun 3, 2021

I don't know another languages about non-space segmented language (maybe Chinese, Thai, ...), at least Japanese "word" is a little bit ambiguous consept.
To segment to words , you need to use morphological analyser like MeCab and dictionaries.

Japanese BERT tokenizer in Transformers uses MeCab and SentencePiiece, but ALBERT is not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants