Embeddings-Search

Implemented a searching algorithm to check a sentence through a preset database of emails and find out the most similar in meaning

Guidelines for using :

dataset is a global variable which contains all the data that our model will use. Hence, update the dataset variable.
dataset = set_dataset(<your_dataset>)
Note that the dataset variable should be an array of strings.
Now , to run the program , run :
unprep, prep = load_dataset_and_preprocess(dataset)
- For BERT Queries , you will have to run :
  unprep_index = build_bert_embeddings_index(unprep)
  unprep_indx_to_email = build_reference(unprep)
- For GLOVE Queries, you will have run :
  embeddings_dict = init_glove_embeddings()
  prep_embeddings, prep_sentence = build_glove_index(prep)
Once the dataset is fixed , you will have to rerun the entire code , since we are pre-processing and storing all the embedded data in our files.

Now , to execute a query, call the search_ function, depending on whether you want to execute search based on glove-embeddings or bert-embeddings. Since both will give accurate results , the choice is left on the user.

BERT Query (Normal data):

bert_search(query, unprep_index, unprep, unprep_indx_to_email, top_n=4, is_preprocess=False)

BERT Query (Pre-processed data) :

bert_search(query, prep_index, prep, prep_indx_to_email, top_n=4, is_preprocess=True)

GLOVE Query (Normal data):

glove_search(query, unprep_embeddings, unprep, unprep_indx_to_email, top_n=4, is_preprocess=False)

GLOVE Query (Pre-processed data) :

glove_search(query, prep_embeddings, prep, prep_indx_to_email, top_n=4, is_preprocess=True)

Dependencies :

We install these libraries each time when the program is run, and it is recommended to enable GPU on your google colab as well, to speed up the pre-processing.

datasets

We have used datasets for taking our dataset. If you have a custom dataset, you can omit this.

nltk

numpy

pickle

mxnet (Optional)

It is recommended to use a GPU to speed up the pre-processing for your dataset. You can skip it, if you wish to.

Also, mxnet-cuda is specific to nvidia graphic cards. Please note this.

gluonnlp *(Optional, it is a dependency for mxnet)

bert-embedding

re(regex)

pdoc3 ( for documentation )

Contribution Guidelines:

Creating a PR:

Fork the repository to your own account

Clone the repository to your local system , make any changes you wish to

git clone https://github.com/<your_username>/Search-Engine

For each new feature, create a new branch of the name feature_name
git checkout -b <feature_name>

Comment the code properly with all necessary comments wherever needed

While coding , follow the iflake coding guidelines for a cleaner and better code quality.

You can install iflake8 on your system as :
pip install flake8

You can run it as :
flake8 path/to/your/code

Push the changes to your fork

Create a pull request

Resolve conflicts as required

To install dependencies

datasets

pip install datasets

nltk

pip install nltk

numpy

pip install numpy

pickle

pip install pickle5

mxnet

pip install mxnet-cu101

gluonnlp

pip install gluonnlp

bert-embedding

pip install bert-embedding --no-deps

re(regex)

pip install regex

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Documentation		Documentation
resources		resources
IR_Assignment.ipynb		IR_Assignment.ipynb
README.md		README.md
ir_assignment.py		ir_assignment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embeddings-Search

Guidelines for using :

Dependencies :

Contribution Guidelines:

Creating a PR:

To install dependencies

About

Releases

Packages

Contributors 2

Languages

Param-Bhatt/Search-engine

Folders and files

Latest commit

History

Repository files navigation

Embeddings-Search

Guidelines for using :

Dependencies :

Contribution Guidelines:

Creating a PR:

To install dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages