Skip to content

amitp-ai/UCSDX_Mini_Projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Mini-Projects as part of UCSDX Machine Learning Engineer Bootcamp

Basic Info:

  • Most of these mini-projects are done in Colab and projects 5,6, & 14 are done in Databricks (i.e. Spark/SQL related ones).

Project Information:

  1. APIs to Retrieve Data
    "In this project APIs are used to retreave data from the web."

  2. Data Wrangling with Pandas
    "In this data wrangling mini-project, various transformation and visualization techniques using pandas are practiced. Knowing how to use these techniques is especially helpful for organizations that keep the majority of their data in relational databases and flat files. This mini-project has been adopted from the Brandon Rhodes tutorial and will focus on movie data from IMDB."

  3. Json Based Data Exercise using Pandas
    "This World Bank dataset from a school quality improvement project in Ethiopia is a good example of a real-life dataset that one is likely to encounter as an AI/ML engineer."

  4. Web Scraping
    "In this mini-project a dataset is created through web-scraping. As is typically the case with web scraping, the collected dataset is messy and unruly; and thus, it requires a lot of wrangling before running them through a model."

  5. PySpark/SQL
    "In this project we use Databricks and work through a series of exercises using Spark SQL. The purpose of this project is to get familiar with the Spark SQL interface, which scales easily, making it great for working with huge datasets."

  6. Data Wrangling at Scale with Spark
    "For this project we use Databricks and work with real-world datasets from NASA HTTP logs. The purpose of this project is to become familiar with both structured and unstructured data, analyze large-scale data with Spark, practice more advanced wrangling and cleaning techniques, and work on data transformation.

    This is a good project on Explore-Transform-Load (ETL) pipeline at scale using Pyspark. In particular, a large dataset containing over 3 million records of web server logs is downloaded (i.e. extracted/sourced) from NASA's website. Each record is a raw string that is transformed into 7 distinct columns/fields using Regular Expressions. The dataset is further transformed by dropping/imputing null values. Exploratory data analysis (EDA) is then performed on the transformed dataset. Finally, the transformed dataset is saved (i.e. loaded) in the Databricks filestore system in two different formats, csv and json, for future use."

  7. Linear Regression "This project is on linear regression. It also delves into generating statistically significant models."

  8. Logistic Regression "This project is on classification using logistic regression."

  9. Tree Based Algorithms "This is a project on tree based algorithms such as decision tress, random forests, and gradient boosting. Boosting is an ensemble learning technique where weak learners train the subsequent generation of learners to be stronger. This is a sequential learning process that is relatively time and resource intensive, but produces extremely useful results. Some of the highest performing boosting models are XGBoost, GBMlight, and CatBoost."

  10. Clustering "This is a project on various clustering algorithms such a K-means, affinity propagation, DBSCAN, etc. Different techniques for selecting the number of clusters such as Elbow, Silhoutte, and Gap statistic methods are also analyzed."

  11. Anomaly Detection "This is a project on anomaly detection for univariate and multivariate datasets. Various methods such a basic statistical analysis, random forest, clustering based local outlier factor (CBLOF), and autoencoders are studied."

  12. Recommendation Systems "In this project we use the IMDb and Netflix movie datasets to generate a movie recommender system using a global recommendation system, a content based recommendation system, a collaborative filtering system, and a deep learning based hybrid recommender system."

  13. Time Series Analysis "In this project we predict IBM's stock price using traditional statistical models such as ARIMA as well as using deep learning based models such as LSTM."

  14. Spark ML "In this project we use the US Census data and Pyspark ML to predict whether an individual is making less than or greater than $50k. The dataset has over 32,000 rows, each containing many different features such as age, education, occupation, marital status, native country, race, etc."

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published