Programming Tag Generator from Quora Questions

Project Overview
This project aims to process and analyze a large dataset of questions sourced from Quora, specifically focusing on programming-related questions. The goal is to extract meaningful tags (or keywords) from each question, including both its title and body, using Natural Language Processing (NLP) techniques and machine learning models. These tags can then be used to categorize and index questions effectively, enhancing searchability and organization.

The project involves cleaning, preprocessing, and transforming the raw text data into useful features, followed by tag generation using both statistical methods (e.g., TF-IDF) and machine learning approaches. It specifically handles questions that may include programming code and aims to preserve relevant keywords while excluding unnecessary information such as HTML tags.

#Key Features

Data Preprocessing:

Cleaning raw text by removing HTML tags, special characters, and unnecessary code blocks.
Tokenizing and stemming words for more efficient text processing.
Removing stopwords to retain only meaningful keywords.

Handling Programming Code:

Detecting and extracting programming code segments embedded in questions.
Separating code from the question content to ensure accurate tag extraction.
Counting and analyzing questions containing code.

TF-IDF Vectorization:

Converting the cleaned questions into numerical features using the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer.
Using n-grams (unigrams, bigrams, trigrams) to capture both individual words and multi-word phrases.

Tag Extraction:

Generating tags based on keyword frequency across the entire dataset.
Employing multilabel classification techniques to assign the most relevant tags to each question.
Selecting the top n most frequent tags for a more focused analysis.

Performance Metrics:

Calculating and reporting statistics such as the average length of questions before and after processing.
Measuring the percentage of questions that contain programming code.
Evaluating how many questions are explained by the top n tags.

Technologies Used

Python: Primary programming language.
NLTK: Natural Language Toolkit for text processing.
scikit-learn: For machine learning, specifically for TF-IDF vectorization and multilabel classification.
SQL: For storing processed data (optional).
pandas: For data manipulation and handling.
NumPy: For numerical operations.
re: Regular expressions for text cleaning.

Dataset

The dataset used in this project consists of questions and answers from Quora, with a focus on programming-related content. Each entry in the dataset includes the following fields:

Title: The title of the question.
Question: The body of the question, which may include embedded code.
Tags: Pre-existing tags or labels for the question.

Future Enhancements

Improve the tag recommendation system by training more advanced machine learning models like BERT or GPT.
Enhance code snippet extraction and handling to better categorize questions with significant code content.
Incorporate deep learning techniques to automatically generate more sophisticated tags based on question content.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
fb_tags_quora.ipynb		fb_tags_quora.ipynb
lr_with_equal_weight.pkl		lr_with_equal_weight.pkl
tag.png		tag.png
tag_counts_dict_dtm.csv		tag_counts_dict_dtm.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Programming Tag Generator from Quora Questions

Data Preprocessing:

Handling Programming Code:

TF-IDF Vectorization:

Tag Extraction:

Performance Metrics:

Technologies Used

Dataset

Future Enhancements

License

About

Releases

Packages

Languages

License

mihirupadhyay/Programming-Tag-Generator

Folders and files

Latest commit

History

Repository files navigation

Programming Tag Generator from Quora Questions

Data Preprocessing:

Handling Programming Code:

TF-IDF Vectorization:

Tag Extraction:

Performance Metrics:

Technologies Used

Dataset

Future Enhancements

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages