Evaluating and Scoring the AI chatbot answer similarity.

We Have different SwiChat AI Chatbots that is designed to asnwer user questions. We have question bank that consist three main columns, "Question", "Expected Answer" and "Chatbot Response". We aim to determine the semantic similarity of "Chatbot Response" with "Expected Answer."

Before moving ahead, create your dataset that contains the Questions bank. Here is a sample sheet with script that you can use for generating answers and as dataset.

Sheet: Link

To conduct this evaluation, we employ two embedding models: Bert and Glove. From each of these models, we calculate two types of matrices:

Recall: This measures the number of references found in the response divided by the size of the reference list.
Precision: This calculates the number of responses found in the reference list divided by the size of the response list.

Here's how we calculate these scores using BERT:

For Recall: Initially, both the reference and response data are transformed into lists by splitting them at commas. Next, we compute the cosine similarity for each reference data point and compare it with every response data point. If the similarity exceeds a predetermined threshold, set at 0.95, we consider it a match and label it as "1," storing it in a separate list. This process is repeated for each response, resulting in a list of "1s" and "0s." To calculate the final score, we divide the total number of "1s" (indicating similarity found in the response data) by the length of the reference data.

For Precision: Similar to the process for Recall, we transform both the reference and response data into lists. We compute the cosine similarity for each response data point and compare it with every reference data point. If the similarity exceeds the threshold of 0.95, we label it as "1" and store it in a separate list. Again, for each response, this process is repeated, resulting in a list of "1s" and "0s." To calculate the final Precision score, we divide the total number of "1s" (indicating similarity found in the reference data) by the length of the response data.

Now, for the GloVe embedding model:

For Recall: Similar to the BERT approach for Recall, we calculate the score, but this time utilizing GloVe embeddings. In this case, we use a cosine similarity threshold of 0.87.

For Precision: Similarly, like the BERT approach for Precision, we calculate the score using GloVe embeddings with a cosine similarity threshold of 0.87

Now, for the Bge_large embedding model:

For Recall: Similar to the BERT approach for Recall, we calculate the score with a threshold of 0.87, but this time utilizing Bge_large. The full name of model is BAAI/bge-large-en-v1.5 on hugging face. In this case, we first load the model using sentence transformer then encode the text using model and then making dot product on ecoded data and create score.

For Precision: Similarly, like the BERT approach for Precision, we calculate the score using Bge_large embeddings with threshold of 0.87.

Bleu: We can also create Bleu metric with 2-gram(matching two word sequence).
Bleurt: We can also create a Bleurt metric. we use this bleurt model with different checkpoints like some of them bleurt-large-512, BLEURT-20 four more model you can refere this file. Here all checkpoint is described..

Prerequisite

pip3 install pandas
pip3 install torch
pip3 install numpy
pip3 install transformers
pip3 install scipy
pip3 install scikit-learn
pip3 install evaluate
pip3 install datasets==2.10.0
pip3 install -U sentence-transformers

How to run

Demo Video for reference.
Step 1
You need to make a clone of this repo or download this repo directly. For making clone of this repo you can follow this command.

     git clone https://github.com/madgicaltechdom/SwiftChat-AI-Chatbot-Testing.git

Step 2
Then open the cloned folder in VS Code

Step 3
Then we need to add the csv file in data folder.

Step 4
Then you need to download the glove file. For downloading of glove file you can follow this link. After downloading the zip file you need to extract it and paste glove.6B.100d.txt file in the glove_data folder in same directory where we are working.
Step 5
Then go to the main.py file. Add the reference column name in line 19 and model response column name in line 20, add metric type in line 21.

Step 6
If you are using Bleurt model then you nedd the first download the desrired checkpoint and save it in the same directory on which we are working. Like for BLEURT-20 you can follow this command.

    wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip

    unzip BLEURT-20.zip

or for bleurt-large-512 you can follow this command.

    wget https://storage.googleapis.com/bleurt-oss/bleurt-large-512.zip

    unzip bleurt-large-512.zip

or you can directly download from this file

Then you need to define the name of checkpoint in main.py file in line 42

Step 7
Then you need to run this command in terminal.

     python main.py

Output

The script generates an output CSV file named response.csv with an additional column containing evaluation scores according to our metric.

The additional columns are as follows:-

If we choose Bert_metric

Precision scores Bert:- This column represents the Precision score using BERT.
Recall scores Bert:- This column represents the Recall score using BERT.

If we choose Glove_metric

Precision scores Glove:- This column represents the Precision score using the Glove.
Recall scores Glove:- This column represents the Recall score using Glove.

If we choose Bge_large_metric

Precision scores Bge_large:- This column represents the Precision score using Bge_large.
Recall scores Bge_large:- This column represents the Recall score using Bge_large.

If we choose Bleu

Bleu Score:- This column represents the Bleu score

If we choose Bleurt

Bleurt Score:- This column represents the Bleurt score

Reference

I take information from this article. For bge_large embedding https://huggingface.co/BAAI/bge-large-en-v1.5

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
README.md		README.md
bert_metric.py		bert_metric.py
bge_large.py		bge_large.py
bleu_metric.py		bleu_metric.py
bleurt_metric.py		bleurt_metric.py
main.py		main.py
metric_with_glove.py		metric_with_glove.py
precision.py		precision.py
recall.py		recall.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating and Scoring the AI chatbot answer similarity.

Prerequisite

How to run

Output

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Languages

madgicaltechdom/SwiftChat-AI-Chatbot-Testing

Folders and files

Latest commit

History

Repository files navigation

Evaluating and Scoring the AI chatbot answer similarity.

Prerequisite

How to run

Output

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages