GitHub - mgluengo/ml-model-challenge: Datathon - Challenge to build a ML model to predict stores' sales (Ironhack Data Bootcamp

Challenge

Goal:

Given a dataset with information about stores, build a Machine Learning Model to predict the revenue of other shops of the same brand.

Rules:

⏱️ 6 hours since the dataset release to complete the full process (EDA, Modeling, Evaluation).
📈 After the 6h, a new dataset for validation with no labels (sales data) was released. We had 30 min to apply our models to predict the revenues of the stores in the new dataset and submit our work.
📤 Submissions were individual, and included:
- Output file with the index of the stores in the validation dataset and our sales prediction for each store.
- Pickled version of the model built and any feature engineering steps used.
✅ The teaching staff compared the predictions with the real data and communicated the final scores.

Results

Achieved a score of 0.9839 (98.4%).

Used Decision Tree Regression model with Gradient Boosting (highest scores after evaluating multiple models).

Tech

Process

(See details in the .ipynb files)

Understood the business problem and data provided. Identified it as a regression problem (labels are sales, and all other attributes are features).
Performed EDA:
- Exloration (shape, data types, nulls, etc.)
- Data cleaning. The dataset quality was good, so this step went smoothly. Changes included dropping unnecessary columns.
- Feature engineering. Converted DateTime into ordinal values, and used one-hot-encoding (created dummies) to convert all categorical into numerical values.
- Analysis. Used visualization (matplotlib, seaborn) to analyze correlations, find outliers, etc.
Prepared data (feature selection, train/test splitting).
Trained and evaluated models. Being this a regression problem, the following models were tested: Linear Regression, Decission Tree Regressor, K-Neighbors Regressor, Ridge, Lasso. The best results were obtained using Decission Trees (score 0.88, no overfitting).
Applied ensemble methods to optimize the Decission Tree model (Random Forest, XGBoost, Gradient Boosting). Best results achieved with Gradient Boosting (score 0.95, no overfitting).
Tuned hyperparameters. After analyzing and plotting the optimal max_depth hyperparameter to avoid overfitting, tested results tuning other hyperparameters (e.g. n_estimators, random_state). Achieved final score of 0.9839 (98.4%) with the validation dataset.
Hyperparameter Optimization. As a next step, I would've continued tuning hyperparameters (see constraints below).

Constraints

Faced constraints due to the performance of the computer used (MackBook Pro early 2015) and limited time to train the models.

Having more time and/or a faster machine, I would've used Hyperparameter Optimization (Random Search / Grid Search) to better customize the hyperparameters to the given challenge and improve the results.

Dataset Description

Feature	Description
Store_ID	Shop's unique identifier
Day_of_week	Encoded from 0 to 6
Date	day, month and year of the data point
Nb_customers_on_day	customers that showed up that day
Open	binary variable (0 if closed, 1 if open)
Promotion	binary variable (0 if no promos that day, 1 if yes)
State_holiday	encoded 0, a, b, c (0 if no holidays, and otherwise, the letter indicates which state holiday it was)
School_holiday	binary variable (0 if holiday, 1 if not)

The first dataset to build the model included also the labels (Sales of each store).

Special Thanks

Thanks to all my bright and supportive bootcamp classmates! 🧑‍💻 👩‍💻

Even though this was an indvidual competition, we worked collaboratively and exchanged ideas. This allowed us to complete the challenge ahead of time and obtain higher scores than previous cohorts.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
images		images
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Challenge

Goal:

Rules:

Results

Tech

Process

Constraints

Dataset Description

Special Thanks

About

Releases

Packages

Languages

mgluengo/ml-model-challenge

Folders and files

Latest commit

History

Repository files navigation

Challenge

Goal:

Rules:

Results

Tech

Process

Constraints

Dataset Description

Special Thanks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages