Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subword # should be an option. #33

Open
FFengIll opened this issue Sep 14, 2023 · 6 comments
Open

subword # should be an option. #33

FFengIll opened this issue Sep 14, 2023 · 6 comments

Comments

@FFengIll
Copy link

For bert, there are many models use # for subword symbol, but not all.
Some popular bert-based models defined their own subword symbol.

For example, in e5 the symbol is .

>>> a = '▁'
>>> a.encode('utf-8')
b'\xe2\x96\x81'
@FFengIll
Copy link
Author

Furthermore, there is no such rule to force use #.

@FFengIll
Copy link
Author

In model, the substr symbol always be called as replacement or continuing_subword_prefix.
Actually, it will show in tokenizer.json.

@skeskinen
Copy link
Owner

Hi, I was wondering about the subword rules also with regards to #31
I remember trying to get the tokens from the tokenizer, like you did in the PR.
But I also remember having some issue with the subwords when I tried to do this.

Does the code in 31 handle subwords?
Do you have an idea on how to handle models like e5?

Also, unrelated but a thought I had earlier: it would be nice to convert test_tokenizer.cpp to python and run the tests against the reference tokenizers

@FFengIll
Copy link
Author

@skeskinen no, #31 only make vocab not necessary (because it maybe missing).

This issue is another problem for subwords ( I found this since I meet too many unknown token when using e5).

bellow is some token samples in bert-based model.

in m3e, subword is ## like many bert model.

"##a": 8139,
"03": 8140,
"09": 8141,
"08": 8142,
"28": 8143,
"##2": 8144,

in e5, subword is since they trained a new tokenizer (bellow is part copy from tokenizer.json)

      [
        "▁si",
        -7.355116367340088
      ],
      [
        "▁ja",
        -7.370460510253906
      ],
      [
        "▁za",
        -7.37307596206665
      ],
      [
        "▁v",
        -7.385393142700195
      ],

@FFengIll
Copy link
Author

For now, I do not have a good idea for this issue, so I do not implement a PR for it.
Maybe we need to more research and discuss.

@cgisky1980
Copy link

cgisky1980 commented Sep 19, 2023

For now, I do not have a good idea for this issue, so I do not implement a PR for it. Maybe we need to more research and discuss.

加油,需要跨平台的中英文向量化~ E5 多语言版就不错

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants