Abalone Dataset: Predict the Ring age in years
"Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.
Source: https://archive.ics.uci.edu/ml/datasets/abalone
Sex / nominal / -- / M, F, and I (infant) Length / continuous / mm / Longest shell measurement Diameter / continuous / mm / perpendicular to length Height / continuous / mm / with meat in shell Whole weight / continuous / grams / whole abalone Shucked weight / continuous / grams / weight of meat Viscera weight / continuous / grams / gut weight (after bleeding) Shell weight / continuous / grams / after being dried Rings / integer / -- / +1.5 gives the age in years (ring-age)
Ignore the +1.5 in ring-age and use the raw data
Overview
Create a report and include the Data Processing and Modelling tasks given below. You are required to develop a set of models for regression and classification task for the same dataset. In the case of classification, you will have model which classifies two age groups, i.e below 7 and above 7 years of ring age. This is group work with at least two group members that will be assigned randomly.
Data processing (30 Marks):
Clean the data (eg. convert M and F to 0 and 1). You can do this with code or simple find and replace (5 Marks).
Develop a correlation map using a heatmap and discuss major observations (5 Marks).
Pick two of the most correlated features (negative and positive) and create a scatter plot with ring-age. Discuss major observations (10 Marks).
Create histograms of the two most correlated features, and the ring-age. What are the major observations? (5 Marks)
Create a 60/40 train/test split - which takes a random seed based on the experiment number to create a new dataset for every experiment (5 Marks).
Add any other visualisation of the dataset you find appropriate (OPTIONAL).
Modelling (70 Marks):
Develop a linear regression model using all features for ring-age using 60 percent of data picked randomly for training and remaining for testing. Visualise your model prediction using appropriate plots. Report the correct metrics for the given model (i.e RMSE and R-squared score and classification score and AUC score and ROC plot. (30 Marks)
Compare linear/logistic regression model with all features, i) without normalising input data (taken from Step 1), ii) with normalising input data. (5 Marks)
Develop a linear/logistic regression model with two selected input features from the data processing step. (5 Marks)
Compare the best approach from the above investigations using a neural network trained with SGD. You need to run some trial experiments to determine optimal hyperparameters, i.e number of hidden neurons and layers and learning rate etc. You can discuss your results and major observations about trial experiments (15 Marks) Note trial and error runs do not need multiple experiments (ie. 30 exp)
Discuss the neural network with the linear model results. Discuss how you can improve model further (15 Marks)
In each of the above investigations, run 30 experiments (3 experiments in case you face wall time issues on Ed) and report the mean and std of the RMSE and R-squared score of the train and test datasets.
Do not insert screenshots of results from the terminal in the report. You will lose marks. RMSE and R-squared values should be properly documented and inserted into proper Tables.
Do not include any code in the report. You should upload running code in Edstem only. Code should be running in Edstem or accompany a screenshot of terminal running - in case you use your laptop and libraries such as pyTorch etc.
You will not be penalised marks for improper structure of code etc, you will only have marks deducted if the code does not run, or if no evidence of code running (screenshot) is provided in the case you using a laptop.
Please do not dump too many plots in the report, and make it messy. Keep the report neat with informative and neatly designed plots which all have Figure numbers, labels, axis labels etc. All Figures need to be cited and discussed in the text.
If there are many figures auto-generated, it is fine to leave them in Ed, but you need to get the best ones in the report that fits your story - i.e. the goals of the assessment.
As noted here: https://stackoverflow.com/questions/56069689/how-to-normalize-the-columns-of-a-dataframe-using-sklearn-preprocessing-normaliz by default preprocessing.normalize() normalises along rows, and you need to change axis as above to normalize along column which makes more sense. It is better to use MinMaxScaler(): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html and normalize along columns. You will note have any marks detected if you do not use MinMaxScaler.
Technical issues
Learning rate in Keras: https://stackoverflow.com/questions/59737875/keras-change-learning-rate
Stats model package: https://www.statsmodels.org/stable/index.html (in case you want to show which features are significant - this is not required but can be done if you want)
You can find examples of Tables and Figures in the papers below:
Chandra R; Ranjan M, 2022, 'Artificial intelligence for topic modelling in Hindu philosophy: Mapping themes between the Upanishads and the Bhagavad Gita', PloS one, vol. 17, pp. e0273476, http://dx.doi.org/10.1371/journal.pone.0273476
Chandra R; Jain A; Chauhan DS, 2022, 'Deep learning via LSTM models for COVID-19 infection forecasting in India', PLoS ONE, vol. 17, http://dx.doi.org/10.1371/journal.pone.0262708
Chandra R; Krishna A, 2021, 'COVID-19 sentiment analysis via deep learning during the rise of novel cases', PLoS ONE, vol. 16, pp. e0255615, http://dx.doi.org/10.1371/journal.pone.0255615
Chandra R; Saini R, 2021, 'Biden vs Trump: Modeling US General Elections Using BERT Language Model', IEEE Access, vol. 9, pp. 128494 - 128505, http://dx.doi.org/10.1109/ACCESS.2021.3111035
Chandra R; Goyal S; Gupta R, 2021, 'Evaluation of Deep Learning Models for Multi-Step Ahead Time Series Prediction', IEEE Access, vol. 9, pp. 83105 - 83123, http://dx.doi.org/10.1109/ACCESS.2021.3085085