Project developed for 'Knowledge Extraction and Machine Learning', a fifth year subject @FEUP. Made in collaboration with @cyrilico.
A Summary of the theoretical material is available here.
Folder /research-project
contains materials that were necessary to the develop the ECAC research project (2nd project).
Component | Grade |
---|---|
Project | 20 |
Classification | 17 |
For running the desired jupyter notebooks, one must first run the following commands in a terminal containing python3
:
- In Mac/ Linux
python3 -m venv venv
. venv/bin/activate
pip install _U -r requirements.txt
jupyter notebook
- Windows
py -3 -m venv venv
venv\Scripts\activate
pip install _U -r requirements.txt
jupyter notebook
In the end, the virtual environment can be terminated by running:
deactivate
In the jupyter notebook web page, open the pre_processing.ipynb file first and simply run all cells.
Thereafter, open the prediction.ipynb file. Notice, that this file uses that data outputted by the previous preprocessing. You should also run all cells, but notice the comments along, highlighting important cells that can be changed to better suit your needs, for example:
# CHANGE THIS LINE TO CHANGE THE USED CLASSIFICATION METHOD
classifier = create_DT()
After running, you should expect your predictions in the file you indicated in the desired format.
Final presentation slides available here.
Final leaderboards available here - placed 9️⃣.
- ❗ : Submissions selected for competition scoring. Notice that we did not have access to the private score when choosing the two submissions.
Public Score | Private Score | Local Score | Date | Improvement to previous submission |
---|---|---|---|---|
0.59259 | 0.57160 | Not recorded | 23.09.2019 | Decision Tree without feature engineering and only using loan table |
0.61049 | 0.59876 | Not recorded | 23.09.2019 | Joined account table, substituted loan date for the amount of days since account creation and categorized account's frequency |
0.56543 | 0.61728 | Not recorded | 24.09.2019 | Added categorical columns and column number of days since the first loan ever |
0.62839 | 0.65864 | Not recorded | 24.09.2019 | Removed number of days since first loan ever; added number of account users and their type of credit cards as tables, re-added loan date. |
0.50000 | 0.50000 | Not recorded | 25.09.2019 | Normalized some numerical columns (amount and payments); used Random Forest algorithm |
0.62839 | 0.58888 | Not recorded | 26.09.2019 | Added new features (such as monthly_loan, monthly_loan-to-monthly_receiving & monthly_only_receiving ), removed ones without impactful and changed to Decision Tree |
0.59259 | 0.63209 | Not recorded | 26.09.2019 | Removed loan_id feature |
0.57716 | 0.60802 | Not recorded | 27.09.2019 | Fixed merge of tables in previous submission |
0.75370 | 0.75308 | Not recorded | 29.09.2019 | Added transactions table and reworked the flow of the entire project, making it way easier to customize |
0.81728 | 0.75679 | Not recorded | 29.09.2019 | Added demographic table |
0.84135 | 0.77716 | Not recorded | 30.09.2019 | Removed redundant features, changed join on district_id of account to district_id of client |
0.88148 | 0.68148 | Not recorded | 01.10.2019 | Experimented with grid search hyper parameter running |
0.85925 | 0.73518 | Not recorded | 03.10.2019 | Changed Classifying model, after grid searching Decision Tree as it had better performance |
0.64197 | 0.59876 | Not recorded | 04.10.2019 | Implemented PCA |
0.83580 | 0.80555 | 0.781090 | 04.10.2019 | Increased local score using feature selection |
0.89259 | 0.75555 | 0.832430 | 04.10.2019 | Added class weighting to RandomForest and GradientBoosting |
0.85617 | 0.73765 | 0.848035 | 09.10.2019 | Now considering households and pensions. Fixed numerical imputation not working correctly. |
0.82839 | 0.72530 | 0.862035 | 10.10.2019 | Experimented with under sampling |
0.79444 | 0.64012 | 0.840876 | 10.10.2019 | Added bank demographic data |
❗ 0.90123 | 0.79506 | 0.842036 | 11.10.2019 | Heavy feature engineering. Consistent results locally. |
0.88333 | 0.81666 | 0.852039 | 11.10.2019 | Small improvement locally using feature selection and feature engineering. |
0.72530 | 0.71913 | 0.841861 | 12.10.2019 | Heavy feature selection. Removing features without correlation to loan status. |
0.77020 | 0.73333 | Not recorded | 15.10.2019 | Hardcore feature selection. Using only 7 features. |
0.85000 | 0.81049 | 0.824199 | 17.10.2019 | Fixed some local bugs. Heavy feature selection, both automatic and manual. |
0.79753 | 0.68827 | 0.828777 | 18.10.2019 | Very consistent results. S'more feature engineering and selection. |
0.77160 | 0.75617 | 0.799563 | 19.10.2019 | Decision Tree of depth 2. Constant AUC of 80%, probably small error interval. |
0.78353 | 0.68353 | 0.937524 | 21.10.2019 | Applied backward elimination. Using LinearRegression. Constant local score. |
0.70432 | 0.58271 | 0.860821 | 21.10.2019 | Feature selection using backward elimination and RFE on LogisticRegression |
0.71913 | 0.83395 | 0.845231 | 24.10.2019 | Using most consistent local with SMOTETek sampling and Gradient Boosting. |
❗ 0.85864 | 0.74012 | 0.867982 | 24.10.2019 | Best local scoring setup. |
0.83209 | 0.78641 | 0.864521 | 25.10.2019 | Random Forest with SMOTETEEN and Filter Method as feature selection. Locally consistent. |
0.74074 | 0.79506 | 0.850971 | 25.10.2019 | Best local Decision Tree, with SMOTETEEN and Filter Method as feature selection. Likely to overfit. |
- Fundamental Techniques of Feature Engineering for Machine Learning
- Principal Component Analysis in 6 steps
- How to Handle Imbalanced Data in Classification Problems
- Finding Correlation Between Many Variables (Multidimensional Dataset) with Python
- Automated Feature Engineering in Python
- A simple guide to creating Predictive Models in Python, Part-1
- Automated Machine Learning Hyperparameter Tuning in Python
- Hyperparameter Tuning
- Feature Selection and Dimensionality Reduction
- Hyperparameter Tuning the Random Forest in Python
- Understanding PCA (Principal Component Analysis) with Python
- Loan Default Prediction and Identification of Interesting Relations between Attributes of Peer-to-Peer Loan Applications
- Imbalanced Classes: Part 2
- Feature Selection with sklearn and Pandas
- Feature Selection Using Random forest