Skip to content

shreeyajoshi2013/Python-Data-Preprocessing-with-Abalone-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Python Data Preprocessing with Abalone dataset

This dataset contains information about physical measurements of abalone (a type of marine snails) for predicting their age ('Rings' feature in the dataset), from other features. This project assesses the data and preprocesses it, making it useful for classification tasks. Additionally, normalization is executed in the end part.

The dataset taken from link.

Built with

  • Google Colab

Libraries used

  • Pandas
  • Matplotlib
  • Seaborn
  • Scikit-learn
  • SciPy

Highlights

  • Summarization Statistics
  • Pandas boxplot
  • Seaborn violinplot
  • Seaborn pairplot
  • SciPy zscore function

What is being done?

  1. Loading the dataset and exploring the features
    By running basic data exploration functions, the data and the features are explored.
  2. Checking for missing data
    By applying checks, it is found out that there are no missing values in the dataset.
  3. Checking if this is a balanced dataset
    By plotting the bars of number of rings against number of abalone, it is observed that the dataset is not balanced. The datapoints not evenly distributed.
  4. Computing summarization statistics
    By implementing boxplot, it is observed that for few of all the features('Whole weight' and 'Shucked Weight'), the range of data values is large. And by implementing violinplot, it is seen that the skew is positive for most of the features except two of them('Length' and 'Diameter').
  5. Checking for outliers
    By implementing pairplot, it is observerd that there are few features with few outliers, one of them being a major one ('Height').
  6. Normalization
    Zscore normalization is applied on the data. By plotting two of the features and comparing the normalized and unnormalized plots, it is observed that not much change is obtained in the shape, but there is a substantial change in the axis scale.

Conclusion

  • The data does not have any missing values. So there is no need of interpolation.
  • As the dataset is not balanced, more data might be needed for a confident and reliable prediction of the age of abalone from the features.
  • The dataset contains few outliers. Hence the data needs further cleaning.

About

Python - Data preprocessing of Abalone dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors