ANN

Approximate Nearest Neighbors Algorithms

Overview

This code provides a process for extracting text from PDF files, preprocessing the text, and applying Natural Language Processing (NLP) techniques to vectorize words using different transformer models. The main purpose of this code is to find "15" nearest neighbors of only "1" word (for example in this case Inflation) and uses annoy, faiss, and hnswlib as three algorithm to find the nearest neighbors of that word.

Steps

PDF Text Extraction: Utilizes PyPDF2 to read PDF files and extract text.
Text Tokenization and Preprocessing: Tokenizes the extracted text and filters out non-alphabetic tokens using nltk.
Model Loading and Vectorization: Uses multiple transformer models as tokenizers (GPT-2 (small, medium,large), RoBERTa(small,large), BERT (small,large) and XLNet) to convert tokens into vectors.
Index Building: Creates indexes using Annoy, FAISS, and HNSWlib for efficient nearest neighbor searches.
Nearest Neighbor Search: Implements search functionality to find words closest to a given target word in the vector space.
Result Saving: Outputs the nearest neighbors' results to CSV and Excel for further analysis.

Requirements

Make sure to install the required libraries:

pip install numpy torch PyPDF2 nltk faiss-cpu hnswlib annoy pandas transformers

This code is part of a larger project about LLMs and AI Agents.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ANN.ipynb		ANN.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ANN

Overview

Steps

Requirements

About

Uh oh!

Releases

Packages

Languages

License

alizarif/ANN

Folders and files

Latest commit

History

Repository files navigation

ANN

Overview

Steps

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages