- Word2Vec is the fundamental method of the distributed representation that vectorizes the word representation in multi-dimensional space. Word2Vec proposed two different methods: Continuous Bag of Words (CBoW) and Skip-Gram. CBoW is a method of predicting the target word from words around the target word called surrounding words. This project aims to implement CBoW and Skip-Gram word embedding methods.
- text_processing.py
Output format
- output: Tokenized result of a given text. (list)
- my_skipgram.py; my_cbow.py
Output format
- output: List of tensor of input tokens.
- argparse
- stanza
- spacy
- nltk
- gensim
- torch
- nlp_pipeline(str, defaults to "spacy"): Tokenization method (spacy, stanza, nltk, gensim).
- encoding(str, defaults to "bert"): Encoding method (onehot, skipgram, cbow, elmo, bert).
- emb_dim(int, defaults to 10): The size of word embedding.
- hidden_dim(int, defaults to 128): The hidden size of skipgram and cbow.
- window(int, defaults to 2): The size of window of skipgram and cbow.
