Name	Name	Last commit message	Last commit date
Latest commit History 121 Commits
Fast.ai	Fast.ai
GeneralML	GeneralML
H2O	H2O
Kaggle	Kaggle
PyCaret	PyCaret
Pytorch	Pytorch
tensorflow	tensorflow
README.md	README.md
requirements.txt	requirements.txt

Python Portfolio

Portfolio of data science projects from either original work or revised for a study and learning purpose. Portfolio in this repo is presented in the form of iPython Notebooks and .py files.

Notebooks can be found in the each folder under the framework name (H2O, Pytorch, Tensorflow, etc)

For a detailed code example and explanation, please refer to readme.md file shown below.

Note: Data used in the projects is for demo purposes only

Motivation

This repository was origianlly to have a record of project progress and my own learning process, but I found that it would be helpful to who wants to improve data-science skills to next-level, as it contains a numerious real-life data science example and notebooks created by @hyunjoonbok and codes borrowed from authors who produced state-of-the-art results.

As Python is nowadays a go-to for Data-Science. I have managed to use the best out of Python to use its full functionality for not only simple EDA, but building a complex ML/DL models.

Below examples include the intense usage of industry-hot frameworks (i.e. Pytorch, Fast.ai, H2O, Grandient Boosting, etc) to produce ready-to-use results.

ALL Portfolio
Technologies
Reference
Contact

Projects

2020 Edition

Cohort Analysis

Cohort Analysis - Customer Retention (1)
Cohort Analysis - Customer Retention (2)
Cohort Analysis - Customer Segmentation (1)
Cohort Analysis - Customer Segmentation (2)

Suppose that we have a company that selling some of the product, and you want to know how well does the selling performance of the product. We have the data that can we analyze, but what kind of analysis that we can do? We can segment customers based on their buying behavior on the market. These notebook introduces several ways to segment the users and better understand their retention, using the K-Means algorithm in Python. Using a industry marketing data, We create cohorts to understand metrics like customer retention rates, the average quantity purchased, the average price, etc. The notebook covers a full analysis (preprocessing + visualization + interpretation) to do customer segmentation step-by-step.

Nov 23, 2020

Marketing Channel Attribution with Markov Chains:

In businses marketing scenarios, it is quite impoerntat to track the conversion from the user base through the advertisement money spent. But it is more valuable to know **how** each of those conversion are made, so that the further actions are taken in on-going basis (i.e. budget adjustment). The notebook introduces a concept of 'Markov Chains' to understand the attributions in marketing. By using these transition probabilities, we can identify the statistical impact a single channel has on our total conversions.

Oct 29, 2020

Market Basket Analysis 101 with Real Example - Association rules, Lift, Confidence, Support:

The notebook has the implementation of Basket anayiss in real-world data using Python. It goes over the concept, along with key terms and metrics aimed at giving a sense of what “association” in a rule means and some ways to quantify the strength of this association. The entire data mining process (preprocessing + visualization + interpretation) are clearly explained.

Oct 17, 2020

Simple Text Mining concept and Practice from scratch:

The purpose of the notebook is to get the very basic understading of the methods behind the common NLP project. It introduces 3 different approches that could be taken when performing a text-mining, from it's concept and actual implementation of codes: 1) Frequency of appearance of two words, 2) Statistical method of extracting connection, 3) Word2vec (DL). It also narrates 2 prerequisite steps to be taken before performing text-mining: 1) Select the target word 2) Choose the context: choose what is the sentence about

Oct 9, 2020

Time Series Forecasting With Prophet in Python - WIP:

Sept 7, 2020

Machine Learning / Deep Learning with H2O

Complete guide to Machine Learning with H2O (AutoML)
Machine Learning Regression problem with H2O (XGBoost & Deeplearning)

In this notebook, we will use the subset of the Freddie Mac Single-Family dataset to try to predict the interest rate for a loan using H2O's XGBoost and Deep Learning models. We will explore how to use these models for a regression problem, and we will also demonstrate how to use H2O's grid search to tune the hyper-parameters of both models. We're going to use machine learning with H2O-3 to predict the interest rate for each loan. To do this, we will build two regression models: an XGBoost model and a Deep Learning model that will help us find the interest rate that a loan should be assigned. Complete this tutorial to see how we achieved those results. Also, we go over H2O's AutoML solution, which is an automated algorithm for automating the machine learning workflow, which includes some light data preparation such as imputing missing data, standardization of numeric features, and one-hot encoding categorical features. It also provides automatic training, hyper-parameter optimization, model search, and selection under time, space, and resource constraints. H2O's AutoML further optimizes model performance by stacking an ensemble of models. H2O AutoML trains one stacked ensemble based on all previously trained models and another one on the best model of each family.

Aug 11, 2020

Explainable Machine Learning with SHAP

Understand Classification Model with SHAP
Understand Regression Model with SHAP
SHAP Decision Plots in Depth

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. Here, we look at the implementation of Tree SHAP, a fast and exact algorithm to compute SHAP values for trees and ensembles of trees. We have 3 different basic examples (regression / classifcation / more in-depth graphics) that can be applied to visualizaing the model.

Aug 8, 2020

Multi-Class Text Classification 1 (with PySpark and Doc2Vec):

In this notebook, we utilize Apache Spark's machine learning library (MLlib) with PySpark to tackle NLP problem and how to simulate Doc2Vec inside Spark envioronment. Apache Spark is a famous distributed competiting system to to scale up any data processing solutions. Spark also provides a Machine-learning powered library called 'MLlib'. We utilize Spark Machine Learning Library (Spark MLlib) to look at 3297 labeled sentences, and classify them into 5 different categories.

Jul 7, 2020

Multi-class Text Classification 2 (with PySpark, MLlib, SparkSQL):

Apache Spark is quickly gaining steam both in the headlines and real-world adoption, mainly because of its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for us to be able to stream and analyze it in real time. We use Spark Machine Learning Library (Spark MLlib) to classify crime rescription into 33 categories.

Jul 6, 2020

Retail Price Recommendation model with Gradient Boosting Tree:

Mercari (Japan’s biggest shopping app) would like to offer pricing suggestions to sellers, but this is not easy because their sellers are enabled to put just about anything, or any bundle of things, on Mercari’s marketplace. In this machine learning project, we will build a model that automatically suggests the right product prices. Here we build a complete price recommendation model leveraging LightGBM.

Jul 4, 2020

Full Pytorch Implementation of Recommender System (Collaborative Filtering):

We utilize Pytoch's embeddings layers to build a simple recommendation system. Our model will predict user ratings for specific movies.

Jul 2, 2020

End-to-End Machine Learning Model using PySpark and MLlib:

We build a complete ML model (Binary Classification with Imbalanced Classes problem) leveraging Spark's computation. Full cycle of ML (EDA, feature engineering, model building) is covered. In-Memory computation and Parallel-Processing are some of the major reasons that Apache Spark has become very popular in the big data industry to deal with data products at large scale and perform faster analysis

Jun 28, 2020

ML Model for predicting a Crew Size:

EDA-focused regression model building to predict a ship's Crew Size. CSV Dataset included in a same folder.

Jun 23, 2020

Simple Text Mining concept and practice from scratch:

This notebook shows 3 different approches that could be taken when performing a text-mining, from it's concept and actual implementation of codes. Text mining is an approach to find a relationship between two words in a given sentence. It could be found by using: 1) Frequency of appearance of two words 2) Statistical method of extracting connection 3) Word2vec (DL)

Jun 19, 2020

Reuters News Text Classification in Tensorflow Keras:

We build a text classifcation on Reuters News (available through sklearn) based on LSTM method using Tensorflow

Jun 14, 2020

Text Classification with MLP (MultiLayer Perceptron) in Tensorflow Keras:

Simple MLP model in Tensorflow to solve the text classification problem. Here, we will use the texts_to_matrx() function in Keras to perform text-classification.

Jun 14, 2020

Recommedation System - Collaborative Filtering

FastAI Implementation
Surprise Library Implementation

Experiment with the MovieLens 100K Data to provide movie recommendations for users based on different settings (Item-based, user-based, etc)

Jun 10, 2020

TabNet in Pytorch - New Solution to Tabular ML problems:

An introduction of TabNet, which is a neural-net based algorithm to be readily used in Tabular dataset Machine Learning problems (most common in Kaggle Competitions). A Pytorch Implementation with a Toy example (adult census income dataset and forest cover type dataset) are shown in this notebook, along with a basic architecture and workflow.

Jun 7, 2020

Image Segmentation using a modified U-Net in Tensorflow:

An image segmentation task with Oxford-IIIT Pet Dataset to build a model that genenarte masks around the pet images and eventaully segment the image itself. Built using MobileNetV2 pretrained on ImageNet.

Jun 5, 2020

Rock Paper Scissors (using MobileNetV2) in Tensorflow 2.0:

CNN model using Tensorflow that recognizes Rock-Paper-Scissors. Built using MobileNetV2 pretrained on ImageNet.

Jun 4, 2020

Few Shot Learning (N-shot) in Pytorch:

Pytoch implementation of N-shot learning. We look at image classification of word image in many different languages (Omniglot Dataset) to and build the model that determines which of the evaluatiion set classes the sample belongs to.

Jun 2, 2020

f-AnoGAN in Pytorch:

Concept and codes for the fast unsupervised anomaly detection with generative adversarial networks (GAN), which is widely used for real-time anomaly detection applications. Uses "DCGAN" model, which is State-of-the-Art GAN model.

May 27, 2020

Transfer Learning in Pytorch by building CIFAR-10 model:

Transfer learning explained. Modify a few last layers to fit-in to my own dataset.

May 26, 2020

Pytorch Training Loop Implementation:

A simple walkthrough of training loops and metrics used in learning in Pytorch, follow by a complete example in the last using CIFAR-10 dataset.

May 16, 2020

Recommender System (Collaborative filtering):

A complete guide to recommendation system using Collaborative Filtering: Matrix Factorization. Concepts that are used in industry are explained, and compare model/metrics and build prediction algorithm.

May 7, 2020

Neural Transfer Using PyTorch (VGG19):

Style transfer in practice using Pytorch using pretrained VGG19 model.

Apr 31, 2020

Pytorch Training in Pratice:

Going through a complete modeling step in Pytorch based on MNIST dataset. Can grasp a general idea of Pytorch concept.

Apr 24, 2020

Tensorboard usage in Pytorch:

How to use Tensorboard in Jupyter notebook when training a model in Pytorch.

Apr 14, 2020

Google-play App Review Sentiment Analysis with BERT:

3-way polarity (positive, neutral, negative) sentiment analysis system for Google-Play App reviews. Use Pytorch to get review in JSON, data-preprocess, Create pytorch dataloader , train/evaluate the model. Evaluate the errors and testing on the raw text data in the end.

Mar 5, 2020

Credit Card Fraud Detection using Keras (Imbalanced response)

Buiding a Fraud Detection model using a sample Credit Card transaction data from Kaggle. The data is highly imbalanced, so it shows how to adjust sampling to solve the problem. Then we check important metrics needed to be evalulated (fp/tp/precision/recall, etc)

Reference: [Kaggle CreditCard data](https://www.kaggle.com/mlg-ulb/creditcardfraud/) May 20, 2020

(Kaggle) Handwritten_Image_Classification (Grapheme language):

Pytorch version of builing a CNN model to classify a image of a langauge. Complete model building from loading/defining/transforming data to create and train model. From [Bengali.AI Handwritten Grapheme Classification](https://www.kaggle.com/c/bengaliai-cv19) in Kaggle.

Jan 4, 2020

(Kaggle) M5_Forecasting:

From Walmart sales data, forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. Pre-process (Feature Enginenering / Hyperparameter Optimization) given data and used LGB/XGB ensemble to generate a final submission. From [M5 Forecasting - Accuracy](https://www.kaggle.com/c/m5-forecasting-accuracy/overview) in Kaggle.

Mar 24, 2020

(Kaggle) NCAAW® 2020 ML Competition:

To forecast the outcomes of March-Madness during rest of 2020's NCAAW games. Covers all team-by-team season games results data. Pre-processing of tabular data and ensemble of LGB/XGB generates a final submission. From [Google Cloud & NCAA® ML Competition 2020-NCAAW](https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-womens-tournament/overview) in Kaggle. *Update: this competition was cancelled in Mar.2020 due to the COVID-19.*

Feb 27, 2020

Text Classification_final (Language Model):

2-way polarity (positive, negative) classification system for tweets. Using Fast.ai framework to fine-tune a language model and build a classification model with close to 80% accuracy.

Feb 21, 2020

2019 and older

Machine Learning
```
 Library / Tools: Keras, Tensorflow, fast.ai, pandas, numpy, xgboost, lightgbm, scikit-learn, optuna, Seaborn, Matplotlib
```
Tabular data / Collaborative filtering:

Finding a customer who's income level. Simple ML Classification problem tackled with Fast.ai API. Executable to almost all types of tabular data to naively achieve a good baseline model in a few lines of code. Also, collaborative filtering is when you're tasked to predict how much a user is going to like a certain item. Here I looked at "MovieLens" dataset to predict the rating a user would give a particular movie (from 0 to 5)
May 10, 2018
(Kaggle) Handwritten_Image_Classification (Grapheme language):

Use Fast.ai to build a CNN model to classify a image of a langauge. From [Bengali.AI Handwritten Grapheme Classification](https://www.kaggle.com/c/bengaliai-cv19) in Kaggle. Includes Load image / Genearte custom loss function / Train & Test data using Fast.ai.
Jan 3, 2020
(Kaggle) NY Taxi Trip Duration:

To Forecast total ridetime of taxi trips in New York City. Covers both Fast.ai and LGB version of solving the problem. From [New York City Taxi Trip Duration](https://www.kaggle.com/c/nyc-taxi-trip-duration) in Kaggle.
August 5, 2019

Deep Learning
```
  Library / Tools: Pytorch, cv2, Keras, fast.ai, pandas, numpy, Pandas, Matplotlib
```
Image Restoration_and_Enhancement using Generative Adversarial Network(GANs):

Use Fast.ai framework to load image data, create generator/discriminator from images. Then create a model with a custom GAN loss function. Check error and improve on test image sets.
June 13, 2019
DCGAN - Generate_Fake_Images:

Based on a set of celebrity images, we are generating a new set of fake images. Then compare Real Images vs. Fake Images create generator/discriminator from images. Used Pytorch to load image / create Generator/Discriminator and training loop.
June 24, 2019
MNIST CNN, Skip-connection (U-NET):

Use Fast.ai framework that's built on top of pytorch, to build a simple MNIST CNN model. Use Skip-connection to build a simpel conv-nn, which achieve a state-of-the-art result (99.6% accuracy on test-set).
June 30, 2019
Simple CNN data Augmentation:

Image-Augmentation on CNN model is one of the most important feature engineering steps. Here I looked at how image tranformation can be done with a built-in. Wider range of selection are availalbe in [fast.ai-vision-transform](https://docs.fast.ai/vision.transform.html) except the ones shown. *Things to add*: How ["Albumentation"](https://github.com/albumentations-team/albumentations) library can be used within Fast.ai framework.
Nov 12, 2019
(Kaggle) MNIST Digit Recognizer:

Kaggle version of MNIST. Use Fast.ai and transfer learning to solve.
December 5, 2017

Time Series
```
 Library / Tools: Keras, Tensorflow, fast.ai, pandas, numpy, xgboost, lightgbm, scikit-learn, optuna, Seaborn, Matplotlib
```
(Kaggle) Sales Prediction on store items:

Using Fast.ai to expand a tabular data to utilize many of columns in order to predict sales on stroes based on different situations like promotion, seaons, holidays, etc. Insights are from [Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales)
December 5, 2015
(Kaggle) M5_Forecasting:

From Walmart sales data, forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. Pre-process (Feature Enginenering / Hyperparameter Optimization) given data and used LGB/XGB ensemble to generate a final submission. From [M5 Forecasting - Accuracy](https://www.kaggle.com/c/m5-forecasting-accuracy/overview) in Kaggle.
Mar 24, 2020

NLP/TextClassification
```
 Library / Tools: Pytorch, transformers, fast.ai, tqdm, pandas, numpy, pygments, google_play_scraper, albumentations, joblib, xgboost, lightgbm, scikit-learn, optuna, Seaborn, Matplotlib
```
BERT-base: classify twitter sentiment:

Used Pytorch to encode/tokenize/train/evaluate model. The most simple version
December 5, 2019
BERT-large: classify twitter sentiment:

Using large BERT (takes longer)
December 7, 2019

Micellenous
```
 Library / Tools: pandas, numpy, elasticsearch, datetime
```
ElasticSearch connections with Python:

Use of Python language to pull data directly from ELK stack. Origianlly came in to JSON format, convert it to Dataframe and do simple EDA / Visualization.
December 12, 2019

Technologies

Fast.ai
Pytorch
Tensorflow
Keras
CV2
tqdm
pandas
numpy
albumentations
datetime
xgboost
lightgbm
scikit-learn
optuna
Seaborn
Matplotlib
elasticsearch
And More...

Reference

Deep Learning Model Implementation Zoo (Tensorflow 1 and Pytorch) Github

TO-DOs

List of features ready and TODOs for future development

Tableau Public - Add visulization using work data : in progress
Python Dash for intractive wep-app : in progress
Data cleaning .ipynbs : in progress

Contact

Created by @hyunjoonbok - feel free to contact me!

hyunjoonbok/Python-Projects

Folders and files

Latest commit

History

Repository files navigation

Python Portfolio

Motivation

Table of contents

Projects

2020 Edition

Cohort Analysis

Marketing Channel Attribution with Markov Chains:

Market Basket Analysis 101 with Real Example - Association rules, Lift, Confidence, Support:

Simple Text Mining concept and Practice from scratch:

Time Series Forecasting With Prophet in Python - WIP:

Machine Learning / Deep Learning with H2O

Explainable Machine Learning with SHAP

Multi-Class Text Classification 1 (with PySpark and Doc2Vec):

Multi-class Text Classification 2 (with PySpark, MLlib, SparkSQL):

Retail Price Recommendation model with Gradient Boosting Tree:

Full Pytorch Implementation of Recommender System (Collaborative Filtering):

End-to-End Machine Learning Model using PySpark and MLlib:

ML Model for predicting a Crew Size:

Simple Text Mining concept and practice from scratch:

Reuters News Text Classification in Tensorflow Keras:

Text Classification with MLP (MultiLayer Perceptron) in Tensorflow Keras:

Recommedation System - Collaborative Filtering

TabNet in Pytorch - New Solution to Tabular ML problems:

Image Segmentation using a modified U-Net in Tensorflow:

Rock Paper Scissors (using MobileNetV2) in Tensorflow 2.0:

Few Shot Learning (N-shot) in Pytorch:

f-AnoGAN in Pytorch:

Transfer Learning in Pytorch by building CIFAR-10 model:

Pytorch Training Loop Implementation:

Recommender System (Collaborative filtering):

Neural Transfer Using PyTorch (VGG19):

Pytorch Training in Pratice:

Tensorboard usage in Pytorch:

Google-play App Review Sentiment Analysis with BERT:

Credit Card Fraud Detection using Keras (Imbalanced response)

(Kaggle) Handwritten_Image_Classification (Grapheme language):

(Kaggle) M5_Forecasting:

(Kaggle) NCAAW® 2020 ML Competition:

Text Classification_final (Language Model):

2019 and older

Machine Learning

Tabular data / Collaborative filtering:

(Kaggle) Handwritten_Image_Classification (Grapheme language):

(Kaggle) NY Taxi Trip Duration:

Deep Learning

Image Restoration_and_Enhancement using Generative Adversarial Network(GANs):

DCGAN - Generate_Fake_Images:

MNIST CNN, Skip-connection (U-NET):

Simple CNN data Augmentation:

(Kaggle) MNIST Digit Recognizer:

Time Series

(Kaggle) Sales Prediction on store items:

(Kaggle) M5_Forecasting:

NLP/TextClassification

BERT-base: classify twitter sentiment:

BERT-large: classify twitter sentiment:

Micellenous

ElasticSearch connections with Python:

Technologies

Reference

TO-DOs

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages