Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the final word embeddings. #27

Open
hasan-sayeed opened this issue Aug 30, 2021 · 4 comments
Open

About the final word embeddings. #27

hasan-sayeed opened this issue Aug 30, 2021 · 4 comments

Comments

@hasan-sayeed
Copy link

hasan-sayeed commented Aug 30, 2021

I just trained a model on my own corpus. It has space group numbers and I replaced them with 'Xx1, Xx2,...., Xx229, Xx230' to avoid overlapping with some element names. But when I tried to get final embeddings from the model it says, some space group numbers (Xx105, Xx139 etc.) are not in vocabulary independent of the frequency! Why is this happening? I've tried to look up the code and couldn't figure it out.

@jdagdelen
Copy link
Contributor

jdagdelen commented Aug 30, 2021

Are you sure that those tokens occur enough times in the corpus to get their own spots in the vocabulary? This repo uses Gensim's Word2Vec implementation, which constructs the vocabulary by finding a min_count cutoff. It could be the case that some of your special tokens don't occur frequently enough in your corpus to make the cut.

Docs for Gensim Word2Vec. Check out the min_count, max_vocab_size, and max_final_vocab parameters. You can also use trim_rule to enforce your special tokens are included in the final vocab.

@hasan-sayeed
Copy link
Author

Yeah, those tokens occur more than my --min_count. I tried to use trim_rule as well but the same thing is happening. And I tried with different corpus files (basically deleting half of the data from the original file every time) and it seems to miss different words every time. I'm guessing there is something wrong with my tokenization. This reminds me, when I use processing and get my corpus ready and try to train the model on this file it shoots an error, UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 165: invalid start byte. So I replaced the contents of corpus_example file with my data and then it ran fine. Can this be the issue? What can I do to solve this UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 165: invalid start byte error?

@jdagdelen
Copy link
Contributor

jdagdelen commented Aug 30, 2021

Your issues seem to be specific to how Gensim is interacting with your corpus and how it builds vocabulary, not necessarily tokenization. I think you may want to bring question to the Gensim mailing list/support group. https://groups.google.com/g/gensim

@jdagdelen
Copy link
Contributor

FWIW, I don't think the utf-8 decoding error is related, but to be sure can you please confirm what version of python you are using and ensure that your corpus is encoded properly and doesn't contain any illegal characters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants