This project was developed for a Kaggle competition hosted by Quora. The objective is to determine if two questions have the same semantic meaning, with the goal of grouping similar questions together. The competition involved extensive feature engineering, requiring the creation of 21 custom features from the provided dataset.
The dataset for this project was provided by Quora for the competition. It consists of three columns:
question1 (str) | question2 (str) | isDuplicate (t/f) |
---|---|---|
... | ... | ... |
- Replacing special characters
- Removing some characters which arent' in unicode
- Reducing numbers to actual string names
- Removing shortforms and replace with long forms
- Removing HTML tags
- cwc_min: This is the ratio of the number of common words to the length of the smaller question
- cwc_max: This is the ratio of the number of common words to the length of the larger question
- csc_min: This is the ratio of the number of common stop words to the smaller stop word count among the two questions
- csc_max: This is the ratio of the number of common stop words to the larger stop word count among the two questions
- ctc_min: This is the ratio of the number of common tokens to the smaller token count among the two questions
- ctc_max: This is the ratio of the number of common tokens to the larger token count among the two questions
- last_word_eq: 1 if the last word in the two questions is same, 0 otherwise
- first_word_eq: 1 if the first word in the two questions is same, 0 otherwise
- Total 8
- mean_len: Mean of the length of the two questions (number of words)
- abs_len_diff: Absolute difference between the length of the two questions (number of words)
- longest_substr_ratio: Ratio of the length of the longest substring among the two questions to the length of the smaller question
- Total 3
- fuzz_ratio: fuzz_ratio score from fuzzywuzzy
- fuzz_partial_ratio: fuzz_partial_ratio from fuzzywuzzy
- token_sort_ratio: token_sort_ratio from fuzzywuzzy
- token_set_ratio: token_set_ratio from fuzzywuzzy
- Total 4
- length of questions
- total of words
- count of common words
- ratio - word share
Follow these steps to run the project locally:
- Clone the repository:
git clone https://github.com/srikharshashi/quora_duplicate_pair_classification cd quora_duplicate_pair_classification
1.Install dependencies:
pip install -r requirements.txt
-
Run with streamlit:
streamlit run infer.py
[Leave empty for future insertion of images]
The project achieved the following accuracies on the test set:
Model | Accuracy |
---|---|
Random Forest | 78.79% |
XGBoost | 80.1% |