-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the final word embeddings. #27
Comments
Are you sure that those tokens occur enough times in the corpus to get their own spots in the vocabulary? This repo uses Gensim's Word2Vec implementation, which constructs the vocabulary by finding a min_count cutoff. It could be the case that some of your special tokens don't occur frequently enough in your corpus to make the cut. Docs for Gensim Word2Vec. Check out the |
Yeah, those tokens occur more than my |
Your issues seem to be specific to how Gensim is interacting with your corpus and how it builds vocabulary, not necessarily tokenization. I think you may want to bring question to the Gensim mailing list/support group. https://groups.google.com/g/gensim |
FWIW, I don't think the utf-8 decoding error is related, but to be sure can you please confirm what version of python you are using and ensure that your corpus is encoded properly and doesn't contain any illegal characters? |
I just trained a model on my own corpus. It has space group numbers and I replaced them with 'Xx1, Xx2,...., Xx229, Xx230' to avoid overlapping with some element names. But when I tried to get final embeddings from the model it says, some space group numbers (Xx105, Xx139 etc.) are
not in vocabulary
independent of the frequency! Why is this happening? I've tried to look up the code and couldn't figure it out.The text was updated successfully, but these errors were encountered: