Skip to content

All or most of projects I have uploaded on Kaggle are gonna be here.

Notifications You must be signed in to change notification settings

Darshan0902/KAGGLE.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

199 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KAGGLE.

All or most of projects I have uploaded on Kaggle are gonna be here.

YOU CAN CHECK OUT MY KAGGLE AT :


https://www.kaggle.com/darshanprabhu09

Darshan0902

dataset notebook discussion

N1 | Cracking the Code of Accuracy| Unveiling MSE , MAE , RMSE :

https://www.kaggle.com/code/darshanprabhu09/mse-mae-rmse-for-the-housing-dataset

I am here to announce and ask for your reviews as I share my First and foremost implementation of Metrics like "Root mean Square error (RMSE)" , "Mean Absolute Error (MAE) " , "Mean Squared Error (MSE)" in Python . I have used the dataset of 5000 peoples whose housing were located also I am gonna share about the statistical as well as the Theory knowledge I gained from this :

RMSE (Root mean squared error)

Root mean square error or root mean square deviation is one of the most commonly used measures 
for evaluating the quality of predictions. It shows how far predictions fall from measured
true values using Euclidean distance.

To compute RMSE, calculate the residual (difference between prediction and truth) for each
data point, compute the norm of residual for each data point, compute the mean of residuals 
and take the square root of that mean.RMSE is commonly used in supervised learning
applications,as RMSE uses and needs true measurements at each predicted data point.

Root mean square error can be expressed as where N is the number of data points, y(i) 
is the i-th measurement, and y Ě‚(i) .Is its corresponding prediction.

Note: RMSE is NOT scale invariant and hence comparison of models using this 
measure is affected by the scale of the data. For this reason, RMSE is
commonly used over standardized data.

Mean Absolute Error (MAE):

In the context of machine learning, absolute error refers to the magnitude
of difference between the prediction of an observation and the true value 
of that observation. MAE takes the average of absolute errors for a group 
of predictions and observations as a measurement of the magnitude of 
errors for the entire group. MAE can also be referred as L1 
(loss function).

Mean Squared Error (MSE) :

:

The average squared difference between the estimated values and the actual
value. MSE is a risk function, corresponding to the expected value of the 
squared error loss. The fact that MSE is almost always strictly positive 
(and not zero) is because of randomness or because the estimator does not
account for information that could produce a more accurate estimate.

N2 |Titans in graphs |Exploring the Time Series :

https://www.kaggle.com/code/darshanprabhu09/n2-titans-in-graphs-exploring-the-time-series/notebook

Time series analysis is a specific way of analyzing a sequence of data points collected over an interval of time. In time series analysis, analysts record data points at consistent intervals over a set period of time rather than just recording the data points intermittently or randomly. However, this type of analysis is not merely the act of collecting data over time.

What sets time series data apart from other data is that the analysis can show how variables change over time. In other words, time is a crucial variable because it shows how the data adjusts over the course of the data points as well as the final results. It provides an additional source of information and a set order of dependencies between the data.

Time series analysis typically requires a large number of data points to ensure consistency and reliability. An extensive data set ensures you have a representative sample size and that analysis can cut through noisy data. It also ensures that any trends or patterns discovered are not outliers and can account for seasonal variance. Additionally, time series data can be used for forecasting—predicting future data based on historical data.

N3|Webscraping Wonders|Web based data extraction

https://www.kaggle.com/code/darshanprabhu09/n3-webscraping-wonders-web-based-data-extraction

Web scraping is the process of extracting data from websites through either API requests or the Parsing oh HTML files using BeautifulSoup4 and sometimes even through a HTML parser :D . I hope this helps you as you get the idea of extracing data through the Websites.

  • WEB SCRAPING :
Before we jump into the code lets know a little about Webscraping and how 
it can be useful for the purpose of data Analysis. Being in the Field of 
Data analysts , I always wondered whether we needed to have data in our 
local PC to form operations like alterations , Removing inconsistencies, 
Gaps and Redundancies in our data , Happens to be that there is a 
concept known as "WEBSCRAPING" . Do go through the small Snippet of 
code I have created using Python version 3.9.2 .

To know more about this module refer : https://pypi.org/project/beautifulsoup4/

  • Libraries and modules and how to install them:
1.) beautifulsoup4 4.12.2 : Beautiful Soup is a library that makes it
easy to scrape information from web pages.It sits atop an HTML or 
XML parser, providing Pythonic idioms for iterating, searching,
and modifying the parse tree. It helps us basically to read the HTML
files we have on any particular websites. To install this library
use the syntax below :


 pip install beautifulsoup4

https://www.kaggle.com/code/darshanprabhu09/n8-kernels-unveil-the-future-of-drug-chemistry

N4|Assets Insights model|Webscraping|Time series.

https://www.kaggle.com/code/darshanprabhu09/n4-financial-analysis-data-extraction-from-website

Financial Analysis: Balance Sheet and Asset Trends


This Python code retrieves the balance sheet statement for a
specified company and performs various financial analyses, 
including calculating total current assets, total current
liabilities, cash debt difference, percentage of intangible 
assets, and plotting quarterly asset data.

Prerequisites


- Python 3.x
- Requests library (`pip install requests`)
- Matplotlib library (`pip install matplotlib`)

Instructions


1. Update the `api_key` variable with your Financial Modeling
   Prep API key.
2. Specify the `company` variable with the ticker symbol of
    the desired company.
3. Define the number of `years` for the analysis.
4. Run the code.

Analysis Results


- Total current assets of the specified company.
- Total current liabilities of the specified company.
- Cash debt difference (cash and cash equivalents minus total
   debt) for the specified company.
- Percentage of intangible assets out of total assets for the
   specified company.
- Quarterly assets data plotted in a bar chart.

Example Output

Total current assets of GOOG: $XXX,XXX,XXX
Total current liabilities of GOOG: $XXX,XXX,XXX
Cash debt difference: $XXX,XXX,XXX
Percentage intangible: XX.XX%



N5|Data Visualization using Graphs| Visual methods | Version 5 :

https://www.kaggle.com/code/darshanprabhu09/n5-data-visualization-cookbook-matplotlib-seaborn

Data Visualization Toolkit

This repository contains a collection of data visualization techniques implemented in Python. Each visualization technique is demonstrated with example code snippets and sample datasets.

List of Visualization Techniques


1. Histogram: Visualize the distribution of a continuous variable using bars.
2. Bar Chart: Compare categorical data using rectangular bars.
3. Pie Chart: Represent data as slices of a circular pie.
4. Box Plot: Display the distribution of a dataset using quartiles and outliers.
5. Scatterplot: Plot individual data points in a two-dimensional space.
6. Line Plot: Visualize the trend of a variable over time or a continuous scale.
7. Horizontal Bar Chart: Compare categorical data using horizontal bars.
8. Violin Plot: Combine a box plot and a kernel density plot to represent the distribution of a variable.
9. Word Cloud: Create a visual representation of text data, where the size of each word indicates its frequency.
10. Heatmap: Display data in a matrix using color-coded cells to represent values.
11. 3D Scatter Plot: Plot three-dimensional data points in a three-dimensional space.
12. Stacked Area Chart: Visualize the cumulative contribution of multiple variables over time or a continuous scale.
13. Radar Chart: Display multivariate data on a polar coordinate system using polygonal shapes.
14. Tree Map: Represent hierarchical data as nested rectangles.
15. Polar Plot: Plot data in a circular or polar coordinate system.
16. Stream Plot: Visualize the flow or movement of particles in a vector field.
17. Network Graph: Display interconnected nodes and relationships in a network.
18. Sankey Plot: Illustrate the flow or transfer of entities through a system using flowing lines.
19. Hexbin Plot: Create a two-dimensional histogram using hexagonal bins to represent data density.
20. Polar Contour Plot: Combine the polar coordinate system with contour lines to visualize three-dimensional data.

Getting Started

  1. Clone the repository: git clone <repository-url>
  2. Install the required dependencies: pip install -r requirements.txt
  3. Explore the code and examples for each visualization technique.
  4. Customize and adapt the code for your specific data and visualization needs.

Contributing

Contributions to this repository are welcome! If you have any new visualization techniques to add or improvements to the existing ones, feel free to submit a pull request.

License

Happy visualizing!

N6|Pandas Playground|EDA|Constant Experimentation | Version 7 :

https://www.kaggle.com/code/darshanprabhu09/n6-pandas-playground-eda-constant-experimentation


1. INSTALLING PANDAS

To install pandas, you can use the following command:

pip install pandas

Make sure you have pip installed on your system before running the command.


2. READING THE FILE

To read a file using pandas, you can use the read_csv() function. Here's an example:

import pandas as pd

df = pd.read_csv('your_file.csv')

Replace 'your_file.csv' with the actual file path or name.


3. HEAD()

The head() function is used to display the first few rows of the DataFrame. By default, it shows the first 5 rows. Example:

df.head()

4. TAIL()

The tail() function is used to display the last few rows of the DataFrame. By default, it shows the last 5 rows. Example:

df.tail()

5. DTYPES

The dtypes attribute is used to display the data types of each column in the DataFrame. Example:

df.dtypes

6. PRINTING COLUMNS WITH PARTICULAR DTYPES

6.1 - PRINT ROWS AND COLUMNS HAVING THE DATATYPE AS "OBJECT"

To print rows and columns having the data type as "object", you can use the following code:

object_columns = df.select_dtypes(include=['object'])
print(object_columns)

6.2 - PRINT ROWS AND COLUMNS HAVING THE DATATYPE AS "int64"

To print rows and columns having the data type as "int64", you can use the following code:

int_columns = df.select_dtypes(include=['int64'])
print(int_columns)

7. RETRIEVING COLUMNS HEADERS

7.1 - INT64

To retrieve column headers with the data type "int64", you can use the following code:

int_columns = df.select_dtypes(include=['int64'])
headers = int_columns.columns
print(headers)

7.2 - OBJECT

To retrieve column headers with the data type "object", you can use the following code:

object_columns = df.select_dtypes(include=['object'])
headers = object_columns.columns
print(headers)

8. FORMATTING DATATYPES

To format the data types of columns in the DataFrame, you can use the astype() function. Example:

df['column_name'] = df['column_name'].astype('new_data_type')

Replace 'column_name' with the actual column name and 'new_data_type' with the desired data type.


9. ROUNDING OF THE DATA

To round the values in a column, you can use the round() function. Example:

df['column_name'] = df['column_name'].round(decimals=2)

Replace 'column_name' with the actual column name and adjust the decimals parameter as per your requirement.


10. BEGINNING & END OF DATA

To view the beginning and end of the DataFrame, you can use the head() and tail() functions together. Example:

print("Beginning of Data:")
print(df.head())

print("End of Data:")
print(df.tail())

11. EXPLORATORY DATA ANALYSIS

11.1 - Description about index of data

To get a description of the index of the DataFrame, you can use the `index

` attribute. Example:

print(df.index)

11.2 - Statistical description of DataFrame

To get a statistical description of the DataFrame, you can use the describe() function. Example:

print(df.describe())

11.3 - TO GET TOTAL NUMBER OF ROWS

To get the total number of rows in the DataFrame, you can use the shape attribute. Example:

total_rows = df.shape[0]
print(total_rows)

12. GROUPBY()

The groupby() function is used to group data based on one or more columns. Example:

grouped_data = df.groupby('column_name')

Replace 'column_name' with the actual column name you want to group by.


13. CROSSTAB()

The crosstab() function is used to compute a cross-tabulation of two or more factors. Example:

cross_tab = pd.crosstab(df['column1'], df['column2'])
print(cross_tab)

Replace 'column1' and 'column2' with the actual column names you want to create the cross-tabulation for.


14. WHERE()

The where() function is used to filter rows based on a condition. Example:

filtered_data = df.where(df['column_name'] > 0)
print(filtered_data)

Replace 'column_name' with the actual column name and adjust the condition as per your requirement.


15. nUNIQUE()

The nunique() function is used to count the number of unique values in each column. Example:

unique_counts = df.nunique()
print(unique_counts)

16. VALUE_COUNTS()

The value_counts() function is used to count the occurrences of each unique value in a column. Example:

value_counts = df['column_name'].value_counts()
print(value_counts)

Replace 'column_name' with the actual column name.


17. CUT()

The cut() function is used to segment and bin continuous data into discrete intervals. Example:

bins = [0, 50, 100, 150]
labels = ['Low', 'Medium', 'High']
df['bin_column'] = pd.cut(df['column_name'], bins=bins, labels=labels)

Replace 'column_name' with the actual column name and adjust the bins and labels parameters as per your requirement.


N7|SKlearn Pipeline models|Prediction|Modelling. | Version 8 :

https://www.kaggle.com/code/darshanprabhu09/n7-sklearn-pipeline-models-prediction-modelling

Table of Contents

Introduction

Data Description

Importing Libraries

Data Collection & Preparation

Exploratory Data Analysis 5.1. Training Data 5.2. Testing Data 5.3. Conversion to Numpy Array 5.4. Scaling

Model Building 6.1. Linear Regression 6.1.1. Model Overview 6.1.2. Training and Evaluation 6.2. KNN Model (K-Nearest Neighbors) 6.2.1. Model Overview 6.2.2. Training and Evaluation 6.3. Random Forest Regression 6.3.1. Model Overview 6.3.2. Training and Evaluation

Conclusion

Future Improvements

References

Note

Introduction The California Housing Price Prediction project aims to develop machine learning models that can predict housing prices in different neighborhoods of California. The project utilizes a dataset containing various attributes such as location coordinates, housing median age, total rooms, total bedrooms, population, households, median income, median house value, and ocean proximity. By analyzing the data and building predictive models, we can gain insights into the factors that affect housing prices and create accurate price predictions.

Data Description The dataset used in this project provides information on housing attributes in different neighborhoods of California. Here is a summary of the dataset's features:

Longitude: The longitude coordinates of the housing location. Latitude: The latitude coordinates of the housing location. Housing Median Age: The median age of houses in a specific neighborhood. Total Rooms: The total number of rooms in a housing unit. Total Bedrooms: The total number of bedrooms in a housing unit. Population: The population count in a specific neighborhood. Households: The number of households in a specific neighborhood. Median Income: The median income of households in a specific neighborhood. Median House Value: The median value of houses in a specific neighborhood. Ocean Proximity: The proximity of a housing unit to the ocean. Importing Libraries To perform data analysis and build machine learning models, several Python libraries are imported, including NumPy, Pandas, and scikit-learn (sklearn). These libraries provide various functions and tools for data manipulation, analysis, and model building.

Data Collection & Preparation The dataset used in this project is collected from the California Housing dataset. The data is divided into a training set and a testing set. The necessary data preprocessing steps, such as data cleaning and feature scaling, are performed to prepare the data for analysis and model building.

Exploratory Data Analysis Exploratory Data Analysis (EDA) is conducted to gain insights and understand the dataset better. This includes analyzing both the training and testing data, examining descriptive statistics, visualizing the data through plots and charts, and checking for missing values. EDA helps in understanding the distribution of variables, identifying patterns, and identifying any data issues that need to be addressed.

5.1. Training Data The training data is analyzed and visualized to understand the distribution and relationships between different features. Summary statistics are calculated, and visualizations such as histograms, scatter plots, and correlation matrices are created to explore the data.

5.2. Testing Data Similar to the training data, the testing data is also analyzed and visualized to gain insights into the dataset. The distribution and relationships between features are examined using summary statistics, histograms, scatter plots, and other relevant visualizations.

5.3. Conversion to Numpy Array To prepare the data for model training and evaluation, the datasets are converted into Numpy arrays. The features and target variables are separated, and the necessary transformations are applied to ensure compatibility with the machine learning models.

5.4. Scaling To ensure that the features are on a similar scale and to improve model performance, feature scaling techniques such as StandardScaler and MinMaxScaler are applied. These techniques normalize the features to have zero mean and unit variance or scale them to a specific range, respectively.

Model Building In this section, three machine learning models are built to predict housing prices: Linear Regression, KNN Regression, and Random Forest Regression. Each model is briefly described, and the steps involved in training and evaluating the models are outlined. 6.1. Linear Regression 6.1.1. Model Overview Linear Regression is a statistical modeling technique used to analyze the relationship between a dependent variable and one or more independent variables. It aims to find the best-fitting linear relationship between the input variables and the output variable. In the context of machine learning, linear regression is often used as a predictive model to estimate or predict the value of a continuous target variable based on input features.

6.1.2. Training and Evaluation The Linear Regression model is trained using the training data, and the model's performance is evaluated using evaluation metrics such as mean absolute error. The training and testing errors are calculated, providing insights into how well the model fits the training data and generalizes to unseen data.

6.2. KNN Model (K-Nearest Neighbors) 6.2.1. Model Overview KNN (K-Nearest Neighbors) is a non-parametric machine learning algorithm used for both classification and regression tasks. It makes predictions based on the similarity of input data points to their k nearest neighbors. In the case of regression, the predicted value is the average of the values of the k nearest neighbors.

6.2.2. Training and Evaluation The KNN model is trained using the training data, and the model's performance is evaluated using evaluation metrics such as mean absolute error. The optimal value of k is determined through cross-validation, and the model's performance is assessed on the testing data.

6.3. Random Forest Regression 6.3.1. Model Overview Random Forest Regression is an ensemble learning method that combines multiple decision trees to create a more robust and accurate predictive model. Each tree in the random forest is trained on a random subset of the training data and uses a random subset of features for splitting decisions. The final prediction is obtained by averaging the predictions of all the individual trees.

6.3.2. Training and Evaluation The Random Forest Regression model is trained using the training data, and the model's performance is evaluated using evaluation metrics such as mean absolute error. The number of trees in the random forest and other hyperparameters are tuned to optimize the model's performance. The trained model is then evaluated on the testing data.

Conclusion In this project, we developed machine learning models to predict housing prices in different neighborhoods of California. By analyzing the dataset and building predictive models, we gained insights into the factors that affect housing prices. The models, including Linear Regression, KNN Regression, and Random Forest Regression, were trained and evaluated on the dataset, providing accurate price predictions. The project demonstrates the application of machine learning in real estate and can be used as a basis for further research and analysis.

Future Improvements While the developed models provide accurate predictions, there are several areas for future improvement. These include:

Collecting additional data to improve the model's accuracy and generalization. Exploring more advanced machine learning algorithms to compare and enhance the predictive performance. Conducting feature engineering to create new meaningful features that canbetter capture the relationship between housing attributes and prices. Incorporating domain knowledge and external data sources to enrich the dataset and improve the model's predictive power. Fine-tuning the hyperparameters of the models to further optimize their performance. Implementing ensemble methods to combine the predictions of multiple models for improved accuracy and robustness. Deploying the models in a production environment, such as a web application, to provide real-time housing price predictions and assist users in decision-making. References Include a list of the references used in the project, such as research papers, articles, and online resources. Cite the sources accurately to acknowledge the contributions of others and avoid plagiarism.

Note Add any additional notes or important information related to the project, such as data limitations, assumptions made, or any other relevant details that might be useful for readers.

This README.md provides an overview of the California Housing Price Prediction project, including the data used, data preprocessing steps, exploratory data analysis, model building, and future improvement possibilities. It serves as a guide for understanding the project structure and findings.

N8|Kernels Unveil the Future of Drug | CHEMISTRY. | Version 4 :

https://www.kaggle.com/code/darshanprabhu09/n8-kernels-unveil-the-future-of-drug-chemistry

I have done EDA on the data and then after doing so I created a Linear Regression and a Random forest model to check the accuracy of our model and then chose the final model to be used , I will really appreciate it if you guys took out your spare time to check out my work and share your reviews or some suggestions meant to be added on the same .

About the dataset :

The given data appears to be a tabular representation with several columns. Here is a description of each column:

MolLogP: This column represents the calculated logarithm of the partition coefficient (LogP) for the chemical compounds. LogP is a measure of the compound's lipophilicity or hydrophobicity, indicating its tendency to dissolve in oil or water.

MolWt: This column represents the molecular weight of the compounds. Molecular weight is the sum of the atomic weights of all atoms in a molecule and is an important property in determining the physical and chemical characteristics of a compound.

NumRotatableBonds: This column indicates the number of rotatable bonds in the chemical compounds. Rotatable bonds refer to single bonds that can rotate freely without breaking, and they play a role in the compound's flexibility and conformational changes.

AromaticProportion: This column represents the proportion of aromatic atoms present in the compounds. Aromaticity is a property associated with compounds containing conjugated ring systems, such as benzene rings.

logS: This column represents the logarithm of the solubility values for the compounds. Solubility is a measure of how well a compound dissolves in a given solvent.

Label: This column appears to provide ranges or intervals for the solubility values (logS) or molecular weights (MolWt) in the dataset.

Count: This column represents the count or frequency of compounds falling within the respective range or interval specified by the label.

N9|Feature Engineering|PCA|DIMENSION-REDUCTION :

https://www.kaggle.com/code/darshanprabhu09/n9-feature-engineering-pca-dimension-reduction

Feature engineering is the process where one can make changes in their particular data to be used , By reducing or removing the unwanted data in their dataset this process is sometimes also know as "DIMENSIONALITY REDUCTION". Dimensionality reduction is used to reduce the number of dimensions or features in a particular dataframe so as a result our model may not have to face OVERFITTING of data . There contains various compononents for feature engineering they are as follows :

  1. Principal component analysis (PCA) :
      Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high
      number of dimensions/features per observation, increasing the interpretability of data while preserving
      the maximum amount of information, and enabling the visualization of multidimensional data.


     PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by 
     linearly transforming the data into a new coordinate system where (most of) the variation in the data 
     can be described with fewer dimensions than the initial data. Many studies use the first two principal
     components in order to plot the data in two dimensions and to visually identify clusters of closely
     related data points. Principal component analysis has applications in many fields such as population 
     genetics, microbiome studies, and atmospheric science.
  1. Dimensionality Reduction :
Dimensionality reduction is a technique used to reduce the number of features in a dataset while retaining
as much of the important information as possible. In other words, it is a process of transforming 
high-dimensional data into a lower-dimensional space that still preserves the essence of the original data.

In machine learning, high-dimensional data refers to data with a large number of features or variables.
The curse of dimensionality is a common problem in machine learning, where the performance of the 
model deteriorates as the number of features increases. This is because the complexity of the model 
increases with the number of features, and it becomes more difficult to find a good solution. In 
addition, high-dimensional data can also lead to overfitting, where the model fits the training
data too closely and does not generalize well to new data.

Dimensionality reduction can help to mitigate these problems by reducing the complexity of the model and 
improving its generalization performance. There are two main approaches to dimensionality reduction:
feature selection and feature extraction.

  1. Feature Selection:
Feature selection involves selecting a subset of the original features that are most relevant to the
problem at hand. The goal is to reduce the dimensionality of the dataset while retaining the most 
important features. There are several methods for feature selection, including filter methods,
wrapper methods, and embedded methods. Filter methods rank the features based on their relevance 
to the target variable, wrapper methods use the model performance as the criteria for selecting 
features, and embedded methods combine feature selection with the model training process.

  1. Feature Extraction:
Feature extraction involves creating new features by combining or transforming the original 
features. The goal is to create a set of features that captures the essence of the original 
data in a lower-dimensional space. There are several methods for feature extraction, including 
principal component analysis (PCA), linear discriminant analysis (LDA), and t-distributed 
stochastic neighbor embedding (t-SNE). PCA is a popular technique that projects the original
features onto a lower-dimensional space while preserving as much of the variance as possible.

  1. Principal Component Analysis

image

This method was introduced by Karl Pearson. It works on the condition that while the data
in a higher dimensional space is mapped to data in a lower dimension space, the variance 
of the data in the lower dimensional space should be maximum.



It involves the following steps:

Construct the covariance matrix of the data.
Compute the eigenvectors of this matrix.
Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.Hence, we are left with a lesser number of 
eigenvectors, and there might have been some data loss in the process. But, the most 
important variances should be retained by the remaining eigenvectors.

N10| Analysis on alcohol consumption by students. | Version 4 :

https://www.kaggle.com/code/darshanprabhu09/n10-analysis-on-alcohol-consumption-by-students

California Housing Price Prediction

This project aims to predict housing prices in different neighborhoods of California using machine learning algorithms. The dataset provides information on various attributes related to housing, such as longitude, latitude, housing median age, total rooms, total bedrooms, population, households, median income, median house value, and ocean proximity.

Table of Contents

  • Importing Libraries
  • Frequency Distribution for Categorical Columns
  • Correlation Analysis
  • Gender-Based Analysis
  • Count Plot of Schools
  • Count Plot of Sex
  • Box Plot of Age
  • Count Plot of Family Size
  • Count Plot of Parent's Cohabitation Status
  • Count Plot of Mother's Education Level
  • Count Plot of Father's Education Level
  • Count Plot of Travel Time
  • Count Plot of Failures
  • Summary Statistics for Numerical Columns
  • Correlation Heatmap
  • Distribution of Age
  • Average Mother's Education Level by Family Size
  • Sex Count and Percentage
  • Cross-Tabulation
  • Box Plot - Mother's Education Level
  • Violin Plot - Age by Sex
  • Logistic Regression
  • Decision Tree Classifier
  • Conclusion

1 - Importing Libraries.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Read the data into a DataFrame
data = pd.read_csv('/kaggle/input/student-alcohol-consumption/student-alcohol-consumption.csv')

2 - Frequency Distribution for Categorical Columns

numeric_columns = ['age', 'Medu', 'Fedu', 'traveltime', 'failures']
categorical_columns = ['school', 'sex', 'famsize', 'Pstatus']
# Frequency distribution for categorical columns
for column in categorical_columns:
    freq_dist = data[column].value_counts()
    print(f"\nFrequency Distribution for {column}:")
    print(freq_dist)

3 - Correlation Analysis

# Correlation Analysis
corr_matrix = data[numeric_columns].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

correlation = data[['Medu', 'Fedu']].corr()
print(correlation)

4 - Gender-Based Analysis

# Gender-Based Analysis
gender_grades = data.groupby('sex')['failures'].mean()
print("\nAverage Grades by Gender:")
print(gender_grades)

5 - Count Plot of Schools

# Count plot of schools
sns.countplot(data=data, x='school')
plt.title('Count Plot of Schools')
plt.show()

6 - Count Plot of Sex

# Count plot of sex
sns.countplot(data=data, x='sex')
plt.title('Count Plot of Sex')
plt.show()

7 - Box Plot of Age

# Box plot of age
sns.boxplot(data=data, x='age')
plt.title('Box Plot of Age')
plt.show()

8 - Count Plot of Family Size

# Count plot of family size
sns.countplot(data=data, x='famsize')
plt.title('Count Plot of Family Size')
plt.show()

9 - Count Plot of Parent's Cohabitation Status

# Count plot of parent's cohabitation status
sns.countplot(data=data, x='Pstatus')
plt.title("Count Plot of Parent's Cohabitation Status")
plt.show()

10 - Count Plot of Mother's Education Level

# Count plot of mother's education level
sns.countplot(data=data, x='Medu')
plt.title("Count Plot of Mother's Education Level")
plt.show()

11 - Count Plot of Father's Education Level

# Count

 plot of father's education level
sns.countplot(data=data, x='Fedu')
plt.title("Count Plot of Father's Education Level")
plt.show()

12 - Count Plot of Travel Time

# Count plot of travel time
sns.countplot(data=data, x='traveltime')
plt.title('Count Plot of Travel Time')
plt.show()

13 - Count Plot of Failures

# Count plot of failures
sns.countplot(data=data, x='failures')
plt.title('Count Plot of Failures')
plt.show()

14 - Summary Statistics for Numerical Columns

numerical_columns = ['age', 'Medu', 'Fedu', 'traveltime', 'failures']
numerical_summary = data[numerical_columns].describe()
print(numerical_summary)

15 - Correlation Heatmap

correlation_matrix = data[numerical_columns].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

16 - Distribution of Age

sns.histplot(data=data, x='age', bins=10)
plt.title('Distribution of Age')
plt.show()

17 - Average Mother's Education Level by Family Size

sns.barplot(data=data, x='famsize', y='Medu')
plt.title("Average Mother's Education Level by Family Size")
plt.show()

18 - Sex Count and Percentage

sex_count = data['sex'].value_counts()
sex_percentage = data['sex'].value_counts(normalize=True) * 100
print("Sex Count:\n", sex_count)
print("\nSex Percentage:\n", sex_percentage)

19 - Cross-Tabulation

cross_tab = pd.crosstab(data['sex'], data['Pstatus'])
print(cross_tab)

20 - Box Plot - Mother's Education Level

sns.boxplot(data=data, x='Medu')
plt.title("Box Plot - Mother's Education Level")
plt.show()

21 - Violin Plot - Age by Sex

sns.violinplot(data=data, x='sex', y='age')
plt.title("Violin Plot - Age by Sex")
plt.show()

22 - Logistic Regression

import pandas as pd
import statsmodels.api as sm

# Load the dataset
data = pd.read_csv('/kaggle/input/student-alcohol-consumption/student-alcohol-consumption.csv')

# Convert 'failures' to binary format (0 or 1)
data['failures'] = data['failures'].apply(lambda x: 0 if x == 0 else 1)

# Convert categorical variables to dummy variables
categorical_vars = ['school', 'sex', 'famsize', 'Pstatus']
data = pd.get_dummies(data, columns=categorical_vars, drop_first=True)

# Define the predictors and target variable
X = data[['age', 'Medu', 'Fedu', 'traveltime']]
X = sm.add_constant(X)
y = data['failures']

# Perform logistic regression
model = sm.Logit(y, X)
result = model.fit()
print(result.summary())

23 - Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = data[['age', 'Medu', 'Fedu', 'traveltime']]
y = data['failures']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeClassifier()
model

.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

24 - Conclusion In this analysis, we explored various aspects of the student alcohol consumption dataset. We started by importing the necessary libraries and loading the data. Then, we performed frequency distribution analysis for categorical columns and correlation analysis for numerical columns. We also conducted gender-based analysis and visualized the data using count plots, box plots, and violin plots. Additionally, we calculated summary statistics, created a correlation heatmap, and examined the distribution of age. Finally, we applied logistic regression and decision tree classification models to predict student failures.

Overall, this analysis provides insights into the factors that may influence student alcohol consumption and academic performance. Further analysis and modeling can be conducted to explore these relationships in more detail.

N11 | Feature engineering | PCA | dimensioniality Reduction | PCA :

https://www.kaggle.com/code/darshanprabhu09/n11-feature-engineering-pca-dimensions

Feature engineering is the process where one can make changes in their particular data to be used , By reducing or removing the unwanted data in their dataset this process is sometimes also know as "DIMENSIONALITY REDUCTION". Dimensionality reduction is used to reduce the number of dimensions or features in a particular dataframe so as a result our model may not have to face OVERFITTING of data . There contains various compononents for feature engineering they are as follows :

  1. Principal component analysis (PCA) :
      Principal component analysis (PCA) is a popular technique for analyzing large 
      datasets containing a high number of dimensions/features per observation, 
      increasing the interpretability of data while preserving the maximum amount 
      of information, and enabling the visualization of multidimensional data.


     PCA is a statistical technique for reducing the dimensionality of a dataset. 
     This is accomplished by linearly transforming the data into a new coordinate 
     system where (most of) the variation in the data can be described with fewer 
     dimensions than the initial data. Many studies use the first two principal
     components in order to plot the data in two dimensions and to visually 
     identify clusters of closely related data points. Principal component
     analysis has applications in many fields such as population 
     genetics, microbiome studies, and atmospheric science.
     
  1. Dimensionality Reduction :
Dimensionality reduction is a technique used to reduce the number of features
in a dataset while retaining as much of the important information as possible. 
In other words, it is a process of transforming high-dimensional data into a
lower-dimensional space that still preserves the 
essence of the original data.

In machine learning, high-dimensional data refers to data with a large number 
of features or variables. The curse of dimensionality is a common problem in
machine learning, where the performance of the model deteriorates as the 
number of features increases. This is because the complexity of the model 
increases with the number of features, and it becomes more difficult to find 
a good solution. In addition, high-dimensional datacan also lead to 
overfitting, where the model fits the training data too closely 
and does not generalize well to new data.

Dimensionality reduction can help to mitigate these problems by reducing 
the complexity of the model and improving its generalization performance. 
There are two main approaches to dimensionality reduction: feature 
selection and feature extraction.

  1. Feature Selection:
Feature selection involves selecting a subset of the original features
that are most relevant to the problem at hand. The goal is to reduce
the dimensionality of the dataset while retaining the most important
features. There are several methods for feature selection, including
filter methods, wrapper methods, and embedded methods. Filter methods
rank the features based on their relevance to the target variable, 
wrapper methods use the model performance as the criteria for selecting 
features, and embedded methods combine feature selection with the model
training process.

  1. Feature Extraction:
Feature extraction involves creating new features by combining or transforming
the original features. The goal is to create a set of features that captures 
the essence of the original data in a lower-dimensional space. There are 
several methods for feature extraction, including principal component analysis 
(PCA), linear discriminant analysis (LDA), and t-distributed stochastic neighbor 
embedding (t-SNE). PCA is a popular technique that projects the original
features onto a lower-dimensional space while preserving as much of the 
variance as possible.

  1. Principal Component Analysis

image

This method was introduced by Karl Pearson. It works on the condition
that while the data in a higher dimensional space is mapped to data in 
a lower dimension space, the variance of the data in the lower 
dimensional space should be maximum.

 Principal component analysis (PCA) is a popular technique for analyzing large 
 datasets containing a high number of dimensions/features per observation, 
 increasing the interpretability of data while preserving the maximum amount 
 of information, and enabling the visualization of multidimensional data.
 

It involves the following steps:

-Construct the covariance matrix of the data.
-Compute the eigenvectors of this matrix.
-Eigenvectors corresponding to the largest eigenvalues are used to 
-reconstruct a large fraction of variance of the original data.Hence, 
-we are left with a lesser number of eigenvectors, and there might
-have been some data loss in the process. But, the most 
-important variances should be retained by the remaining eigenvectors.
  1. Dimensionality Reduction :
Dimensionality reduction is a technique used to reduce the number of features
in a dataset while retaining as much of the important information as possible. 
In other words, it is a process of transforming high-dimensional data into a
lower-dimensional space that still preserves the 
essence of the original data.

In machine learning, high-dimensional data refers to data with a large number 
of features or variables. The curse of dimensionality is a common problem in
machine learning, where the performance of the model deteriorates as the 
number of features increases. This is because the complexity of the model 
increases with the number of features, and it becomes more difficult to find 
a good solution. In addition, high-dimensional datacan also lead to 
overfitting, where the model fits the training data too closely 
and does not generalize well to new data.

Dimensionality reduction can help to mitigate these problems by reducing 
the complexity of the model and improving its generalization performance. 
There are two main approaches to dimensionality reduction: feature 
selection and feature extraction.

N12| IMAGE CLASSIFIER | KERAS |AUGMENTATION | ML .

This project demonstrates image classification of cats and dogs using the Xception model. The code performs the following steps:

  1. Data Preparation:

    • Creates a directory structure for storing the training, validation, and test datasets.
    • Moves the original training and test images to their respective directories.
    • Splits the test dataset into validation and test subsets.
  2. Model Building:

    • Uses the Xception model for transfer learning.
    • Freezes the layers of the base model.
    • Adds a Flatten layer and a Dense layer with softmax activation for classification.
  3. Data Augmentation and Preprocessing:

    • Uses the ImageDataGenerator class to generate augmented images for training data.
    • Preprocesses the input images using the Xception-specific preprocess_input function.
  4. Training and Saving the Model:

    • Fits the training data to the model with a specified number of epochs.
    • Adds a checkpoint to save the best model based on validation accuracy.
    • Saves the trained model and creates a runtime model for inference.
  5. Prediction and Evaluation:

    • Uses the saved runtime model to make predictions on the test dataset.
    • Converts the predicted values into cat and dog classes.
    • Evaluates the accuracy of the model by comparing the predicted labels with the true labels.

Requirements

  • Python (3.7 or later)
  • TensorFlow (2.5.0 or later)
  • Keras (2.5.0 or later)

Usage

  1. Set up the project environment and install the required dependencies.
  2. Prepare the data by executing the data preparation code.
  3. Build and train the model using the provided code.
  4. Make predictions on the test dataset and evaluate the model's accuracy.

For detailed code implementation and usage instructions, refer to the code comments and documentation.

License This project is licensed under the MIT License.

N13|Amazon stocks|PyTorch|LSTM|Neural Network | Version 6 :

About

All or most of projects I have uploaded on Kaggle are gonna be here.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published