This dataset contains information about physical measurements of abalone (a type of marine snails) for predicting their age ('Rings' feature in the dataset), from other features. This project assesses the data and preprocesses it, making it useful for classification tasks. Additionally, normalization is executed in the end part.
The dataset taken from link.
- Google Colab
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
- SciPy
- Summarization Statistics
- Pandas boxplot
- Seaborn violinplot
- Seaborn pairplot
- SciPy zscore function
- Loading the dataset and exploring the features
By running basic data exploration functions, the data and the features are explored. - Checking for missing data
By applying checks, it is found out that there are no missing values in the dataset. - Checking if this is a balanced dataset
By plotting the bars of number of rings against number of abalone, it is observed that the dataset is not balanced. The datapoints not evenly distributed. - Computing summarization statistics
By implementing boxplot, it is observed that for few of all the features('Whole weight' and 'Shucked Weight'), the range of data values is large. And by implementing violinplot, it is seen that the skew is positive for most of the features except two of them('Length' and 'Diameter'). - Checking for outliers
By implementing pairplot, it is observerd that there are few features with few outliers, one of them being a major one ('Height'). - Normalization
Zscore normalization is applied on the data. By plotting two of the features and comparing the normalized and unnormalized plots, it is observed that not much change is obtained in the shape, but there is a substantial change in the axis scale.
- The data does not have any missing values. So there is no need of interpolation.
- As the dataset is not balanced, more data might be needed for a confident and reliable prediction of the age of abalone from the features.
- The dataset contains few outliers. Hence the data needs further cleaning.