Skip to content

Get to the real work by using neural information retrieval for company information.

License

Notifications You must be signed in to change notification settings

JacobPolloreno/OfficeAnswers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SLIDES

BLOG

WorkBuddy

WorkBuddy System Architecture

Workbuddy helps you get to the real work by creating a search engine for company information. It leverages neural information retrieval to search for an answer in an embedding space.

How does Workbuddy work?

Algorithm:

  1. Build embeddings of query and documents using DSSM model(see below).
    • Use cosine similarity with a max margin loss that distinguishes from positive/relevant answers and negative/irrelevant answers to train the model. The result of training will be a scoring function that during inference, given a query and document, output a retreival score.
  2. Cache document embeddings in NMSLIB
  3. Given a query, we'll compute its embeddings, using the DSSM model, then perform KNN search with NMSLIB.
  4. The result with be a retrieval score for the given query-document pairs

Installation

git clone https://github.com/JacobPolloreno/OfficeAnswers.git
cd OfficeAnswers
Locally with virtualenv
Assumes virtualenv is installed. If not pip install virtualenv
source venv/bin/activate
bash build/run_build.sh
AWS with Conda env with TF + Keras Py36
source activate <CONDA_ENV_NAME>
bash build/aws_build.sh

Steps to Run

After you run the build script, WikiQA dataset was downloaded to data/raw.

The WikiQA dataset provides the main framework for learning question-answer pairs. It'll be augmented by your own custom dataset which you want to search. [See below to find out how to format your custom dataset.]

Step 1: Configuration

Create a copy of config file

cd configs
cp sample.config custom.config
#edit custom.config
  • Modify line 14 "custom_corpus" with the path to your custom dataset. recommend placing the dataset in data/raw folder
    • e.g. "custom_corpus": "./data/raw/custom_corpus.txt"
Step 2: Prepare and Preprocess
cd OfficeAnswers
python src/main.py configs/custom.config prepare_and_preprocess
Step 3: Train
cd OfficeAnswers
python src/cli.py configs/custom.config train
Step 4: Search
cd OfficeAnswers
python src/cli.py configs/custom.config search

Model - Deep Semantic Similarity Model (DSSM)

DSSM Architecture

Paper describing the model is here

Why this model?

DSSM builds a scoring function for query and document pairs. The scoring function will allow us to rank candidate answers.

We'll independently learn a representation for each query and candidate documents. Then we'll calculate the similarity between the two estimated representations via a similarity function.

The DSSM model uses a siamese network(pairwise scenario) during training where we have two pointwise networks(MLP) that share parameters and we update those parameters to minimize a pairwise max margin loss.

Input

One query and several documents are input to the model at the same time. Only one of the documents is most related to the query labeled as a positive(1) value. The other documents are all negative(0) documents, that are not related to the query.

Queries and documents mapped on letter n-gram spaces instead of traditional word spaces.

N-grams are defined as a sequence of n letters. Typically, (as seen in the above diagram) a term vector(X) with 500K words can be mapped to n-gram vectors(l1) sized only around 30K.

Output

For each query and document pair we'll get a scoring function that output a score for a given pair of query and document.

Dataset

  • WikiQA dataset which contains a set of question and sentence pairs.
    • 3,047 questions and 29,258 sentences in the dataset, where 1,473 sentences were labeled as answer sentences to their corresponding questions.
  • Slack data Dump for technical questions
  • Survey data HR question and answer pairs

WikiQA was used to build the embeddings along with the custom data(Slack + HR).

How should my data be formatted?

Raw data should be tab-seperated in the follow format:

<QUESTION>\t<ANSWER>\n

how are glacier caves formed ?	A partly submerged glacier cave on Perito Moreno Glacier .
how are glacier caves formed ?	The ice facade is approximately 60 m high

Testing

cd OfficeAnswers
python -m pytest

About

Get to the real work by using neural information retrieval for company information.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published