Duplicate Question Pair (Quora) Classification Using NLP

Overview

This project was developed for a Kaggle competition hosted by Quora. The objective is to determine if two questions have the same semantic meaning, with the goal of grouping similar questions together. The competition involved extensive feature engineering, requiring the creation of 21 custom features from the provided dataset.

Dataset Overview

The dataset for this project was provided by Quora for the competition. It consists of three columns:

question1 (str)	question2 (str)	isDuplicate (t/f)
...	...	...

Link to Dataset

Pre-Processing

Replacing special characters
Removing some characters which arent' in unicode
Reducing numbers to actual string names
Removing shortforms and replace with long forms
Removing HTML tags

Custom Features Overview

Advanced Features

Token Features

cwc_min: This is the ratio of the number of common words to the length of the smaller question
cwc_max: This is the ratio of the number of common words to the length of the larger question
csc_min: This is the ratio of the number of common stop words to the smaller stop word count among the two questions
csc_max: This is the ratio of the number of common stop words to the larger stop word count among the two questions
ctc_min: This is the ratio of the number of common tokens to the smaller token count among the two questions
ctc_max: This is the ratio of the number of common tokens to the larger token count among the two questions
last_word_eq: 1 if the last word in the two questions is same, 0 otherwise
first_word_eq: 1 if the first word in the two questions is same, 0 otherwise
Total 8

Length Based Features

mean_len: Mean of the length of the two questions (number of words)
abs_len_diff: Absolute difference between the length of the two questions (number of words)
longest_substr_ratio: Ratio of the length of the longest substring among the two questions to the length of the smaller question
Total 3

Fuzzy Features

fuzz_ratio: fuzz_ratio score from fuzzywuzzy
fuzz_partial_ratio: fuzz_partial_ratio from fuzzywuzzy
token_sort_ratio: token_sort_ratio from fuzzywuzzy
token_set_ratio: token_set_ratio from fuzzywuzzy
Total 4

Basic Features

length of questions
total of words
count of common words
ratio - word share

How to Run

Follow these steps to run the project locally:

Clone the repository:

git clone https://github.com/srikharshashi/quora_duplicate_pair_classification
cd quora_duplicate_pair_classification

1.Install dependencies:

pip install -r requirements.txt

Run with streamlit:

streamlit run infer.py

Screenshots

[Leave empty for future insertion of images]

Accuracy

The project achieved the following accuracies on the test set:

Model	Accuracy
Random Forest	78.79%
XGBoost	80.1%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Duplicate Question Pair (Quora) Classification Using NLP

Overview

Dataset Overview

Pre-Processing

Custom Features Overview

Advanced Features

Token Features

Length Based Features

Fuzzy Features

Basic Features

How to Run

Screenshots

Accuracy

Files

README.md

Latest commit

History

README.md

File metadata and controls

Duplicate Question Pair (Quora) Classification Using NLP

Overview

Dataset Overview

Pre-Processing

Custom Features Overview

Advanced Features

Token Features

Length Based Features

Fuzzy Features

Basic Features

How to Run

Screenshots

Accuracy