Developed as a Node.js Library reads sentences from a PDF file and identifies the closest match to a given input sentence through TF-IDF (Term Frequency - Inverse Document Frequency) and Cosine Similarity measures. which is an attempt to reduce the influence of embedding models on semantic search this library attempts to match by distance score of each tokens.
- The system uses TF-IDF to transform sentences into numerical vectors for analysis.
- Determines the best matching sentence by calculating cosine similarity.
- Generates a sorted list of sentences along with their similarity scores.
- Node.js installed on your system
- Download the repository or place the code files into a directory:
git clone <repository-url>
cd <project-directory>
- Install dependencies:
npm install
Run the script using:
node main.js
- Converts text to lowercase.
- Removes punctuation and extra spaces.
- Tokenizes paragraphs into sentences.
- TF-TDF is used to trasform text into a numerical representation.
- Each term's importance is calculated using:
TF(Term Frequency): how often a word appears in a sentence. IDF(Inverse Document Frequency): adjust weights for words that appear Frequently across all sentences
-The last document added to TF-IDF model is the query sentence, and similarity is computed between it and stored sentence.
The similarity between the query vector and each stored sentence TF-IDF vector is calculated using cosine similarity:
cosine similarity = A.B/||A||||B|| where:
- A = TF-IDF vector of the query sentence
- B = TF-TDF vector of a stored sentence
- ||A|| and ||B|| = magnitude (norm) of each vector
- The result is a similarity score between 0 and 1 (1 being an exact match)
- All sentences are ranked based on their cosine similarity score.
- The highest-scoring sentence is displayed as the best match.
- Other sentences are listed with thier similarity scores.
- Natural - NLP processing for tokenization and TF-IDF
- CSV-Parser - Reads CSV Files
- pdf-parse - For parsing pdf's
- The cosine similarity algorithm or metric calculates how similar two sentences or text documents are by comparing the cosine of the angle between their vector representation. The closer the cosine similarity is to 1, the more similar the text are.
-
Data Preprocessing:
- Tokenizing the sentences and calculating the TF-IDF scores.
-
Vectorization:
- Computing the stored sentences and query to numerical vectors.
-
Cosine Similarity Comparison:
- The cosine of the angle of the two query and stored sentence vectors are calculated and compared.
- Before comparing sentence they need to be tokenized or split into words.
- Tokenization:
const tokenizer=new natural.wordTokenizer();
- TF-IDF Model:
const tfidf=new natural.TfIdf();
- The the tf-idf stands for Term frequency and Inverse document frequency. The Term frequency is for how often a word appears in a sentence. - The Inverse document frequency stands for importance of the word across all the stored sentences(to down-weight common words). - Storing and Processing sentences:
storedSentences.push(sentence.toLowerCase()) tfidf.addDocument(tokenizer.tokenize(sentence.toLoweCase())) }; ``` - This code converts the sentence to lowercase, tokenizes it and adds it to the tf-idf model.
- Now each sentence is transformed into a TF-IDF vector, which is a numerical representation.
- Extracting words and scores from query:
This project uses the natural
library in JavaScript to process text data effectively using tokenization and TF-IDF (Term Frequency-Inverse Document Frequency) calculations. The library helps in breaking down sentences into meaningful words and figuring out their importance within a dataset.
- You'll need to set up the tokenizer like this:
const natural = require('natural'); const tokenizer = new natural.WordTokenizer();
- It breaks sentences into individual words. which are called tokens which is important for any kind of text processing.
- Tokenized words can be used for further computations, like calculating cosine similarity.
- You'll want to create an instance of TF-IDF, and you can do it this way:
const tfidf = new natural.TfIdf();
- TF-IDF is about measuring how important a word is in a particular sentence compared to all the sentences in your dataset.
- Each sentence is treated as a separate document.
- When you add sentences you do like this:
tfidf.addDocument(tokenizer.tokenize(sentence.toLowerCase()));
- This whole process calculates TF-IDF scores for words in the sentence.
- When you throw a new query sentence, the library generates a TF-IDF vector for it and compares it with stored sentences.
- To retrieve the TF-IDF vector for a sentence:
const vector = tfidf.listTerms(index);
- For a query sentence:
const queryVector = tfidf.listTerms(0);
- These vectors are used in the
cosineSimilarity()
function to determine the most similar stored sentence.
- Cosine similarity measures the similarity between two sentences based on their TF-IDF vectors.
- The function
cosineSimilarity(vector1, vector2)
computes the similarity score between two vectors. - Higher similarity scores indicate more relevant matches.
- Search Engines: Improve search relevance by finding the most similar documents.
- Plagiarism Detection: Compare text similarity in academic or legal cases.
- Semantic Text Matching: Match job descriptions with resumes.
- Data retrieval from disk storage is generally slower compared to retrieval from RAM. The Latent Search Library enhances performance by storing data in-memory (RAM), enabling significantly faster access and processing times might differ upto the size the data.
- Extend Functionality: Modify the tokenization, stop-word removal, or similarity computation.
- Integrate with APIs: Connect the library to web applications for real-time text comparison.
- Enhance NLP Pipelines: Use in combination with machine learning models for semantic understanding.
This approach enhances document similarity, making information retrieval more efficient.
- Match sentences based on thier tokenized distance between each , the user input and the data's in the dataset which achived 56% of evaluation score without a embedding model involvement and with less expensive NLP Algorithms
- This project is licensed under the GPL License.
- You are free to use, modify, and distribute this software under the terms of the MIT license.