This project leverages the advances in natural language processing through word2vec (word embeddings) to study the similarity between words. Utilizing Stanford's GloVe project data (Pre-trained word vectors generated by Stanford's GloVe unsupervised learning algorithm), the project implements a robust system for finding similar words and solving word analogies. This project showcases skills in handling large datasets, natural language processing, and implementing efficient algorithms.
This project leverages the advances in natural language processing through word2vec (word embeddings) to study the similarity between words. Utilizing Stanford's GloVe project data (Pre-trained word vectors generated by Stanford's GloVe unsupervised learning algorithm), the project implements a robust system for finding similar words and solving word analogies. This project showcases skills in handling large datasets, natural language processing, and implementing efficient algorithms.
-
Word Similarity Search:
- Find Closest Words: Given a word, find the n closest words based on word vector similarities.
- Partial Analogies: Solve analogies of the form "x is to y as z is to _____".
-
Efficient Data Handling:
- Preprocess large word vector files into efficient binary formats using NumPy for faster loading and processing.
-
Interactive Command-Line Interface:
- Provides a user-friendly command-line interface for entering words or partial analogies and getting real-time results.
-
Advanced Visualization:
- Optional visualization of word vectors using PCA to project high-dimensional data into 2D space, enhancing understanding of word relationships.
- Python: Core language used for implementation
- NumPy: Efficient numerical computations and array manipulations
- Natural Language Processing: Application of word embeddings for semantic analysis
- Data Processing: Handling and optimizing large datasets
- Algorithm Design: Implementation of similarity and analogy algorithms
- Sklearn: For PCA-based dimensionality reduction
- Matplotlib: For data visualization
-
Efficient Data Loading:
- Converts 5GB text file to optimized NumPy binary format.
- Reduces load time from minutes to seconds.
-
Smart Vocabulary Management:
- Restricts vocabulary to common words for faster processing without significant loss of functionality.
-
Vector Operations:
- Implements Euclidean distance calculations for word similarities.
- Utilizes vector arithmetic for analogy completions.
-
User Interface:
- Interactive command-line interface for word queries and analogies.
-
Extensibility:
- Optional visualization component for word relationships.
-
Handling Large Datasets:
- Preprocessed large GloVe word vector files into a more efficient binary format to speed up data loading.
-
Efficient Similarity Computations:
- Implemented Euclidean distance calculations and vector arithmetic for fast and accurate word similarity and analogy solving.
-
Interactive User Interface:
- Developed a command-line interface for a seamless user experience.
- Implement more advanced text processing techniques (e.g., stemming, removing stop words).
- Enhance analogy solving with support for more complex queries.
- Develop a web-based front-end for a more intuitive user interface.
- Optimize further by incorporating parallel processing for large datasets.
-
Clone the repository:
git clone https://github.com/amitch2019/word-similarity-relationships.git
-
Navigate to the project directory:
cd word-similarity-relationships -
Preprocess the GloVe data:
python save_np.py ~/data -
Run the word similarity and analogy program:
python wordsim.py ~/data
- The source code for this project is not publicly available at the moment.
- However, it can be shared with interested persons upon request.
- Please contact me directly to request access to the code.
Contact
For any questions or inquiries, please contact me at [chaubey.amit@gmail.com].