Table of Contents
- Introduction
- Project Overview
- Dataset
- Analysis Techniques
- Visualizations
- Model Inference and Results
The machine learning models implemented in this project have not performed well and require further optimization.
This project aims to analyze the sales of best-selling books across various dimensions such as genre, author, language, and publication year. By leveraging data analysis and machine learning techniques, the project seeks to uncover trends, patterns, and insights in book sales data.
The Book Sales Analysis project utilizes a dataset containing information about best-selling books, including their titles, authors, genres, sales figures, and publication details. The analysis aims to answer key questions about sales trends, popular genres, and factors influencing book sales.
The dataset used for this analysis includes the following columns:
- Book: Title of the book
- Author(s): Author(s) of the book
- Original Language: Language in which the book was originally published
- First Published: Year the book was first published
- Approximate Sales in Millions: Estimated sales figures
- Genre: Genre classification of the book
The project employs various data analysis techniques, including:
- Descriptive statistics to summarize the data
- Data visualization to illustrate trends and patterns
- Machine learning models to predict book sales and classify genres
The analysis includes several visualizations to help interpret the data:
- Bar charts to show sales trends over decades.
- Stacked bar charts to illustrate genre evolution.
- Line charts for sales predictions.
Using the Random Forest Regressor for predicting book sales, we achieved a Root Mean Squared Error (RMSE) of approximately 17.96. This metric indicates the average difference between predicted and actual sales figures, with lower values suggesting better predictive accuracy.
RMSE: 17.96: This value shows that our model has a moderate level of prediction accuracy. Given the sales figures, a prediction error of approximately 18 million can be significant, highlighting areas for potential improvement in the model.
For genre classification, we evaluated our model's performance using precision, recall, and F1-score metrics across various genres. The classification report provides insights into how well the model is performing in predicting different genres:
Genre | Precision | Recall | F1-score | Support |
---|---|---|---|---|
Autobiography | 0.00 | 0.00 | 0.00 | 1 |
Children's Literature | 0.00 | 0.00 | 0.00 | 2 |
Children's fiction | 0.00 | 0.00 | 0.00 | 1 |
Erotica | 0.00 | 0.00 | 0.00 | 1 |
Fantasy | 1.00 | 0.50 | 0.67 | 4 |
... | ... | ... | ... | ... |
Unknown | 0.27 | 1.00 | 0.43 | 9 |
Young Adult novel, adventure, dystopian, sci-fi | 0.00 | 0.00 | 0.00 | 1 |
The model achieved an accuracy of 31% across all genres.
- Precision: 0.06
- Recall: 0.07
- F1-score: 0.05
- Precision: 0.18
- Recall: 0.31
- F1-score: 0.19
-
Imbalanced Dataset: The classification performance indicates a potential issue with class imbalance, where certain genres have significantly more examples than others. This imbalance likely impacted the model's ability to accurately predict less frequent genres.
-
Genre Performance: The genre Fantasy showed the highest precision and recall, suggesting the model performs relatively well for this category compared to others. In contrast, most genres achieved a precision and recall of 0, indicating challenges in accurately predicting those classes.
-
Need for Improvement: Given the results, further optimization strategies, such as oversampling minority classes, using different classification algorithms, or fine-tuning hyperparameters, may be necessary to improve classification performance across all genres.
The analyses provide valuable insights into book sales trends and genre classification. While the sales prediction model demonstrates reasonable accuracy, the genre classification model shows significant room for improvement, primarily due to the imbalanced dataset and the complexity of genre categorization.