News Classification Dataset Data Source:https://www.kaggle.com/amananandrai/ag-news-classification-dataset/notebooks
News Classification dataset consist of News articles of class "world", "sports", "business", and "Science"
Given a Title and description we have to determine wheatear it belongs to which news article category .
We have to predict the news article on given information so it is multiclass Classification problem
Data.shape : Train.csv + Test.csv = 120000 + 7600 =127600 rows.
Data.columns : Class index , Title , Description
Data.info( ) : Independent : Title , Description --- > Object , Dependent : Class Label -- > Int64
As this is Multiclass Classification problem so we are going to use:
1: Multiclass Confusion matrix
2: Precision , Recall ,F1-Score
3: Accuracy score , Error score
1: Load dataset ---- > .csv format
2: Perform Exploratory Data Analysis:
a] Check if Dataset has balanced distributions for each News label
b] Check for null values
c] Plot Distribution of data points among News Labels .
d] Use word clouds to observe max repeated words in each class.
Pre-processing :
1: Expand Contradictions ---> (.replace (‘ % ‘ , percent), .replace ( ‘ $ ‘ , dollar’ )
2: Remove Html tags , links ,url.
3 : Remove Punctuations .
4: Remove Stop words ---->( is , the , are …..)
5: Perform Stemming operations to convert more than one words with different spelling with similar meaning into one meaningful word
Divide Dataset into 3 parts Dtrain , Dtest, Dval into 60 , 20 , 20 ratio
Dtrain : max amount of data is used to training this data for model to learn from it
Dval : after training Dtrain our model we have to validate our data to see that our model have learned in proper manner or not.
Dtest : Unseen data
For creating a model :
1: Tf-ifd
2: Uni-gram, bi-gram, n-gram
3: Selecting max_features out of the model
https://niharjamdar.medium.com/tf-idf-term-frequency-and-inverse-document-frequency-56a0289d2fb6
After Applying NLP models for creating words into vectors .
We will convert those words into features by selecting max_features , max_df as hyper parameter.
More the features more machine will learn , and after that use machine learning algorithm:
1 : Logistic regression
2: Decision Tree
3: Stochastic Gradient descent