About the final word embeddings. #27

hasan-sayeed · 2021-08-30T03:23:12Z

I just trained a model on my own corpus. It has space group numbers and I replaced them with 'Xx1, Xx2,...., Xx229, Xx230' to avoid overlapping with some element names. But when I tried to get final embeddings from the model it says, some space group numbers (Xx105, Xx139 etc.) are not in vocabulary independent of the frequency! Why is this happening? I've tried to look up the code and couldn't figure it out.

The text was updated successfully, but these errors were encountered:

jdagdelen · 2021-08-30T04:17:47Z

Are you sure that those tokens occur enough times in the corpus to get their own spots in the vocabulary? This repo uses Gensim's Word2Vec implementation, which constructs the vocabulary by finding a min_count cutoff. It could be the case that some of your special tokens don't occur frequently enough in your corpus to make the cut.

Docs for Gensim Word2Vec. Check out the min_count, max_vocab_size, and max_final_vocab parameters. You can also use trim_rule to enforce your special tokens are included in the final vocab.

hasan-sayeed · 2021-08-30T20:31:30Z

Yeah, those tokens occur more than my --min_count. I tried to use trim_rule as well but the same thing is happening. And I tried with different corpus files (basically deleting half of the data from the original file every time) and it seems to miss different words every time. I'm guessing there is something wrong with my tokenization. This reminds me, when I use processing and get my corpus ready and try to train the model on this file it shoots an error, UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 165: invalid start byte. So I replaced the contents of corpus_example file with my data and then it ran fine. Can this be the issue? What can I do to solve this UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 165: invalid start byte error?

jdagdelen · 2021-08-30T20:54:21Z

Your issues seem to be specific to how Gensim is interacting with your corpus and how it builds vocabulary, not necessarily tokenization. I think you may want to bring question to the Gensim mailing list/support group. https://groups.google.com/g/gensim

jdagdelen · 2021-08-30T20:57:24Z

FWIW, I don't think the utf-8 decoding error is related, but to be sure can you please confirm what version of python you are using and ensure that your corpus is encoded properly and doesn't contain any illegal characters?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the final word embeddings. #27

About the final word embeddings. #27

hasan-sayeed commented Aug 30, 2021 •

edited

Loading

jdagdelen commented Aug 30, 2021 •

edited

Loading

hasan-sayeed commented Aug 30, 2021

jdagdelen commented Aug 30, 2021 •

edited

Loading

jdagdelen commented Aug 30, 2021

About the final word embeddings. #27

About the final word embeddings. #27

Comments

hasan-sayeed commented Aug 30, 2021 • edited Loading

jdagdelen commented Aug 30, 2021 • edited Loading

hasan-sayeed commented Aug 30, 2021

jdagdelen commented Aug 30, 2021 • edited Loading

jdagdelen commented Aug 30, 2021

hasan-sayeed commented Aug 30, 2021 •

edited

Loading

jdagdelen commented Aug 30, 2021 •

edited

Loading

jdagdelen commented Aug 30, 2021 •

edited

Loading