Skip to content

Latest commit

 

History

History
403 lines (289 loc) · 24.6 KB

README.md

File metadata and controls

403 lines (289 loc) · 24.6 KB

Python Portfolio

Portfolio of data science projects from either original work or revised for a study and learning purpose. Portfolio in this repo is presented in the form of iPython Notebooks and .py files.

For a detailed code example and images, please refer to readme.md file in each folder under framework names.

Note: Data used in the projects is for learning and demo purposes only


Motivation

This repository was origianlly to have a record of project progress and my own learning process, but I found that it would be helpful to who wants to improve data-science skills to next-level, as it contains a numerious real-life data science example and notebooks created by @hyunjoonbok and codes borrowed from authors who produced state-of-the-art results.

As Python is nowadays a go-to for Data-Science. I have managed to use the best out of Python to use its full functionality for not only simple EDA, but building a complex ML/DL models.

Below examples include the intense usage of industry-hot frameworks (i.e. Pytorch, Fast.ai, H2O, Grandient Boosting, etc) to produce ready-to-use results.


Table of contents


Projects

2020 Edition

Sept 7, 2020

Machine Learning / Deep Learning with H2O

Complete guide to Machine Learning with H2O (AutoML)
Machine Learning Regression problem with H2O (XGBoost & Deeplearning)

In this notebook, we will use the subset of the Freddie Mac Single-Family dataset to try to predict the interest rate for a loan using H2O's XGBoost and Deep Learning models. We will explore how to use these models for a regression problem, and we will also demonstrate how to use H2O's grid search to tune the hyper-parameters of both models. We're going to use machine learning with H2O-3 to predict the interest rate for each loan. To do this, we will build two regression models: an XGBoost model and a Deep Learning model that will help us find the interest rate that a loan should be assigned. Complete this tutorial to see how we achieved those results. Also, we go over H2O's AutoML solution, which is an automated algorithm for automating the machine learning workflow, which includes some light data preparation such as imputing missing data, standardization of numeric features, and one-hot encoding categorical features. It also provides automatic training, hyper-parameter optimization, model search, and selection under time, space, and resource constraints. H2O's AutoML further optimizes model performance by stacking an ensemble of models. H2O AutoML trains one stacked ensemble based on all previously trained models and another one on the best model of each family.

Aug 11, 2020

Explainable Machine Learning with SHAP

Understand Classification Model with SHAP
Understand Regression Model with SHAP
SHAP Decision Plots in Depth

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. Here, we look at the implementation of Tree SHAP, a fast and exact algorithm to compute SHAP values for trees and ensembles of trees. We have 3 different basic examples (regression / classifcation / more in-depth graphics) that can be applied to visualizaing the model.

Aug 8, 2020

In this notebook, we utilize Apache Spark's machine learning library (MLlib) with PySpark to tackle NLP problem and how to simulate Doc2Vec inside Spark envioronment. Apache Spark is a famous distributed competiting system to to scale up any data processing solutions. Spark also provides a Machine-learning powered library called 'MLlib'. We utilize Spark Machine Learning Library (Spark MLlib) to look at 3297 labeled sentences, and classify them into 5 different categories.

Jul 7, 2020

Apache Spark is quickly gaining steam both in the headlines and real-world adoption, mainly because of its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for us to be able to stream and analyze it in real time. We use Spark Machine Learning Library (Spark MLlib) to classify crime rescription into 33 categories.

Jul 6, 2020

Mercari (Japan’s biggest shopping app) would like to offer pricing suggestions to sellers, but this is not easy because their sellers are enabled to put just about anything, or any bundle of things, on Mercari’s marketplace. In this machine learning project, we will build a model that automatically suggests the right product prices. Here we build a complete price recommendation model leveraging LightGBM.

Jul 4, 2020

We utilize Pytoch's embeddings layers to build a simple recommendation system. Our model will predict user ratings for specific movies.

Jul 2, 2020

We build a complete ML model (Binary Classification with Imbalanced Classes problem) leveraging Spark's computation. Full cycle of ML (EDA, feature engineering, model building) is covered. In-Memory computation and Parallel-Processing are some of the major reasons that Apache Spark has become very popular in the big data industry to deal with data products at large scale and perform faster analysis

Jun 28, 2020

EDA-focused regression model building to predict a ship's Crew Size. CSV Dataset included in a same folder.

Jun 23, 2020

This notebook shows 3 different approches that could be taken when performing a text-mining, from it's concept and actual implementation of codes. Text mining is an approach to find a relationship between two words in a given sentence. It could be found by using: 1) Frequency of appearance of two words 2) Statistical method of extracting connection 3) Word2vec (DL)

Jun 19, 2020

We build a text classifcation on Reuters News (available through sklearn) based on LSTM method using Tensorflow

Jun 14, 2020

Simple MLP model in Tensorflow to solve the text classification problem. Here, we will use the texts_to_matrx() function in Keras to perform text-classification.

Jun 14, 2020

Recommedation System - Collaborative Filtering

FastAI Implementation
Surprise Library Implementation

Experiment with the MovieLens 100K Data to provide movie recommendations for users based on different settings (Item-based, user-based, etc)

Jun 10, 2020

An introduction of TabNet, which is a neural-net based algorithm to be readily used in Tabular dataset Machine Learning problems (most common in Kaggle Competitions). A Pytorch Implementation with a Toy example (adult census income dataset and forest cover type dataset) are shown in this notebook, along with a basic architecture and workflow.

Jun 7, 2020

An image segmentation task with Oxford-IIIT Pet Dataset to build a model that genenarte masks around the pet images and eventaully segment the image itself. Built using MobileNetV2 pretrained on ImageNet.

Jun 5, 2020

CNN model using Tensorflow that recognizes Rock-Paper-Scissors. Built using MobileNetV2 pretrained on ImageNet.

Jun 4, 2020

Pytoch implementation of N-shot learning. We look at image classification of word image in many different languages (Omniglot Dataset) to and build the model that determines which of the evaluatiion set classes the sample belongs to.

Jun 2, 2020

Concept and codes for the fast unsupervised anomaly detection with generative adversarial networks (GAN), which is widely used for real-time anomaly detection applications. Uses "DCGAN" model, which is State-of-the-Art GAN model.

May 27, 2020

Transfer learning explained. Modify a few last layers to fit-in to my own dataset.

May 26, 2020

A simple walkthrough of training loops and metrics used in learning in Pytorch, follow by a complete example in the last using CIFAR-10 dataset.

May 16, 2020

A complete guide to recommendation system using Collaborative Filtering: Matrix Factorization. Concepts that are used in industry are explained, and compare model/metrics and build prediction algorithm.

May 7, 2020

Style transfer in practice using Pytorch using pretrained VGG19 model.

Apr 31, 2020

Going through a complete modeling step in Pytorch based on MNIST dataset. Can grasp a general idea of Pytorch concept.

Apr 24, 2020

How to use Tensorboard in Jupyter notebook when training a model in Pytorch.

Apr 14, 2020

3-way polarity (positive, neutral, negative) sentiment analysis system for Google-Play App reviews. Use Pytorch to get review in JSON, data-preprocess, Create pytorch dataloader , train/evaluate the model. Evaluate the errors and testing on the raw text data in the end.

Mar 5, 2020

Buiding a Fraud Detection model using a sample Credit Card transaction data from Kaggle. The data is highly imbalanced, so it shows how to adjust sampling to solve the problem. Then we check important metrics needed to be evalulated (fp/tp/precision/recall, etc)

Reference: [Kaggle CreditCard data](https://www.kaggle.com/mlg-ulb/creditcardfraud/) May 20, 2020

Pytorch version of builing a CNN model to classify a image of a langauge. Complete model building from loading/defining/transforming data to create and train model. From [Bengali.AI Handwritten Grapheme Classification](https://www.kaggle.com/c/bengaliai-cv19) in Kaggle.

Jan 4, 2020

From Walmart sales data, forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. Pre-process (Feature Enginenering / Hyperparameter Optimization) given data and used LGB/XGB ensemble to generate a final submission. From [M5 Forecasting - Accuracy](https://www.kaggle.com/c/m5-forecasting-accuracy/overview) in Kaggle.

Mar 24, 2020

To forecast the outcomes of March-Madness during rest of 2020's NCAAW games. Covers all team-by-team season games results data. Pre-processing of tabular data and ensemble of LGB/XGB generates a final submission. From [Google Cloud & NCAA® ML Competition 2020-NCAAW](https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-womens-tournament/overview) in Kaggle. *Update: this competition was cancelled in Mar.2020 due to the COVID-19.*

Feb 27, 2020

2-way polarity (positive, negative) classification system for tweets. Using Fast.ai framework to fine-tune a language model and build a classification model with close to 80% accuracy.

Feb 21, 2020

2019 and older

  • Machine Learning

     Library / Tools: Keras, Tensorflow, fast.ai, pandas, numpy, xgboost, lightgbm, scikit-learn, optuna, Seaborn, Matplotlib
    

    Finding a customer who's income level. Simple ML Classification problem tackled with Fast.ai API. Executable to almost all types of tabular data to naively achieve a good baseline model in a few lines of code. Also, collaborative filtering is when you're tasked to predict how much a user is going to like a certain item. Here I looked at "MovieLens" dataset to predict the rating a user would give a particular movie (from 0 to 5)

    May 10, 2018

    Use Fast.ai to build a CNN model to classify a image of a langauge. From [Bengali.AI Handwritten Grapheme Classification](https://www.kaggle.com/c/bengaliai-cv19) in Kaggle. Includes Load image / Genearte custom loss function / Train & Test data using Fast.ai.

    Jan 3, 2020

    To Forecast total ridetime of taxi trips in New York City. Covers both Fast.ai and LGB version of solving the problem. From [New York City Taxi Trip Duration](https://www.kaggle.com/c/nyc-taxi-trip-duration) in Kaggle.

    August 5, 2019

  • Deep Learning

      Library / Tools: Pytorch, cv2, Keras, fast.ai, pandas, numpy, Pandas, Matplotlib
    

    Use Fast.ai framework to load image data, create generator/discriminator from images. Then create a model with a custom GAN loss function. Check error and improve on test image sets.

    June 13, 2019

    Based on a set of celebrity images, we are generating a new set of fake images. Then compare Real Images vs. Fake Images create generator/discriminator from images. Used Pytorch to load image / create Generator/Discriminator and training loop.

    June 24, 2019

    Use Fast.ai framework that's built on top of pytorch, to build a simple MNIST CNN model. Use Skip-connection to build a simpel conv-nn, which achieve a state-of-the-art result (99.6% accuracy on test-set).

    June 30, 2019

    Image-Augmentation on CNN model is one of the most important feature engineering steps. Here I looked at how image tranformation can be done with a built-in. Wider range of selection are availalbe in [fast.ai-vision-transform](https://docs.fast.ai/vision.transform.html) except the ones shown. *Things to add*: How ["Albumentation"](https://github.com/albumentations-team/albumentations) library can be used within Fast.ai framework.

    Nov 12, 2019

    Kaggle version of MNIST. Use Fast.ai and transfer learning to solve.

    December 5, 2017

  • Time Series

     Library / Tools: Keras, Tensorflow, fast.ai, pandas, numpy, xgboost, lightgbm, scikit-learn, optuna, Seaborn, Matplotlib
    

    Using Fast.ai to expand a tabular data to utilize many of columns in order to predict sales on stroes based on different situations like promotion, seaons, holidays, etc. Insights are from [Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales)

    December 5, 2015

    From Walmart sales data, forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. Pre-process (Feature Enginenering / Hyperparameter Optimization) given data and used LGB/XGB ensemble to generate a final submission. From [M5 Forecasting - Accuracy](https://www.kaggle.com/c/m5-forecasting-accuracy/overview) in Kaggle.

    Mar 24, 2020

  • NLP/TextClassification

     Library / Tools: Pytorch, transformers, fast.ai, tqdm, pandas, numpy, pygments, google_play_scraper, albumentations, joblib, xgboost, lightgbm, scikit-learn, optuna, Seaborn, Matplotlib
    

    Used Pytorch to encode/tokenize/train/evaluate model. The most simple version

    December 5, 2019

    Using large BERT (takes longer)

    December 7, 2019

  • Micellenous

     Library / Tools: pandas, numpy, elasticsearch, datetime
    

    Use of Python language to pull data directly from ELK stack. Origianlly came in to JSON format, convert it to Dataframe and do simple EDA / Visualization.

    December 12, 2019

Technologies

  • Fast.ai
  • Pytorch
  • Tensorflow
  • Keras
  • CV2
  • tqdm
  • pandas
  • numpy
  • albumentations
  • datetime
  • xgboost
  • lightgbm
  • scikit-learn
  • optuna
  • Seaborn
  • Matplotlib
  • elasticsearch
  • And More...

Reference

  • Deep Learning Model Implementation Zoo (Tensorflow 1 and Pytorch) Github

TO-DOs

List of features ready and TODOs for future development

  • Tableau Public - Add visulization using work data : in progress
  • Python Dash for intractive wep-app : in progress
  • Data cleaning .ipynbs : in progress

Contact

Created by @hyunjoonbok - feel free to contact me!