Workbuddy helps you get to the real work by creating a search engine for company information. It leverages neural information retrieval to search for an answer in an embedding space.
Algorithm:
- Build embeddings of query and documents using DSSM model(see below).
- Use cosine similarity with a max margin loss that distinguishes from positive/relevant answers and negative/irrelevant answers to train the model. The result of training will be a scoring function that during inference, given a query and document, output a retreival score.
- Cache document embeddings in NMSLIB
- Given a query, we'll compute its embeddings, using the DSSM model, then perform KNN search with NMSLIB.
- The result with be a retrieval score for the given query-document pairs
git clone https://github.com/JacobPolloreno/OfficeAnswers.git
cd OfficeAnswers
source venv/bin/activate
bash build/run_build.sh
source activate <CONDA_ENV_NAME>
bash build/aws_build.sh
After you run the build script, WikiQA dataset was downloaded to data/raw
.
The WikiQA dataset provides the main framework for learning question-answer pairs. It'll be augmented by your own custom dataset which you want to search. [See below to find out how to format your custom dataset.]
Create a copy of config file
cd configs
cp sample.config custom.config
#edit custom.config
- Modify line 14 "custom_corpus" with the path to your custom dataset. recommend placing the dataset in
data/raw
folder- e.g. "custom_corpus": "./data/raw/custom_corpus.txt"
cd OfficeAnswers
python src/main.py configs/custom.config prepare_and_preprocess
cd OfficeAnswers
python src/cli.py configs/custom.config train
cd OfficeAnswers
python src/cli.py configs/custom.config search
Paper describing the model is here
Why this model?
DSSM builds a scoring function for query and document pairs. The scoring function will allow us to rank candidate answers.
We'll independently learn a representation for each query and candidate documents. Then we'll calculate the similarity between the two estimated representations via a similarity function.
The DSSM model uses a siamese network(pairwise scenario) during training where we have two pointwise networks(MLP) that share parameters and we update those parameters to minimize a pairwise max margin loss.
One query and several documents are input to the model at the same time. Only one of the documents is most related to the query labeled as a positive(1) value. The other documents are all negative(0) documents, that are not related to the query.
Queries and documents mapped on letter n-gram spaces instead of traditional word spaces.
N-grams are defined as a sequence of n letters. Typically, (as seen in the above diagram) a term vector(X) with 500K words can be mapped to n-gram vectors(l1) sized only around 30K.
For each query and document pair we'll get a scoring function that output a score for a given pair of query and document.
- WikiQA dataset which contains a set of question and sentence pairs.
- 3,047 questions and 29,258 sentences in the dataset, where 1,473 sentences were labeled as answer sentences to their corresponding questions.
- Slack data Dump for technical questions
- Survey data HR question and answer pairs
WikiQA was used to build the embeddings along with the custom data(Slack + HR).
Raw data should be tab-seperated in the follow format:
<QUESTION>\t<ANSWER>\n
how are glacier caves formed ? A partly submerged glacier cave on Perito Moreno Glacier .
how are glacier caves formed ? The ice facade is approximately 60 m high
cd OfficeAnswers
python -m pytest