This repo contains the code for implementation of word embeddings from scratch in python using two methods:
- Frequency-based Embedding - Co-occurrence Matrix method to obtain word embeddings of words occuring in a given corpus.
 - Prediction-based Embedding - Word2vec method used for training words representations. Here it is implemented using CBOW method.
 
- numpy
 - collections
 - re
 - sklearn
 - gensim
 
The models were trained on the following data LINK
python3 part1.py - To run model1 which uses co-occurrence matrix and svd
python3 part2.py - To run model2 which uses Word2vec CBOW model.
Link for embeddings - https://drive.google.com/drive/folders/1cK0aUM3likmKcisz2nK9yQlyPqBIioHi?usp=sharing
for review in splitreviews:
    for i in range(0,len(review)-1):
        matrix[counts[review[i]]][counts[review[i+1]]] += 1
        matrix[counts[review[i+1]]][counts[review[i]]] += 1 Where matrix is a 
Example of co-occurance matrix shown below.
- I enjoy flying.
 - I like NLP.
 - I like deep learning.
 
The co-occurance matrix for these sentences is 
from scipy.linalg import svd
U, D, VT = svd(matrix,full_matrices=False)word_embeddings = {}
index = 0
for word in vocabulary:
    word_embeddings[word] = U[index][:K]
    index = index + 1word_embeddings is a dictionary where the keys are the words are values are thier embeddings
To find the top 10 most similar words for a given word use the function find_word_embeddings
def find_word_embeddings(searchword):
    topscore = 0
    topword = " "
    top = []
    for i in range(10):
        top.append([0," "])
    for word in vocabulary:
        a = word_embeddings[searchword]
        b = word_embeddings[word]
        cos_sim = dot(a, b)/(norm(a)*norm(b))
        index = 0
        for item in top:
            if cos_sim > item[0] and word != searchword:
                top.insert(index,[cos_sim,word])
                top.pop(10)
                break
            index += 1
    return top
# Example
top = find_word_embeddings("camera")
print(top)TSNE plots for Model 1(Co-occurance Matrix) for the words 'camera', 'product', 'good', 'strong' and 'look'.
TSNE plots for Model 2(CBOW Word2vec) for the words 'camera', 'product', 'good', 'strong' and 'look'.


