Skip to content

Machine Learning using PySpark

Awantik Das edited this page Oct 8, 2018 · 3 revisions
  1. Fundamentals of Spark & Machine Learning
  • Understanding Big Data, Hadoop & Spark. Core data structures like RDD, DataFrames. Distributed tables for machine learning like Vectors & Matrice, Black Box Introduction of Machine Learning. Understanding Machine Learning Pipeline & different stages. Data Ingestion, Streaming, Wrangling, Visualization, Preprocessing, Training Models, Validation & Deployment
  1. Data Wrangling & Visualization
  • Using DataFrames to understand,clean & getting summary of data. Data merging, duplicate deletion, statistical data analysis. Making use of matplotlib visualization of information.
  1. Data Pre-processing
  • Numerical Data Scaling & Normalization. Dealing with categorical data. Handling Images. OneHotEncoder, VectorEssembler. Everything, that is required to get data ready for machine learning. Dealing with Text - TFIDF, CountVectorizer, HashingVectorizer
  1. Feature Selection & Extraction

    • Spark deals with large datasets, Selecting important feature columns. VectorSlicer, RFormula, Correlation, ChiSqSelector, PCA, SVD,
  2. Linear Models for Classification & Regression

    • Understanding linear models like linear regression, logistic regression, Regularized regression. Intuition
      about how distributed learning works. Problem solving using these
  3. Spark Pipeline, GridSearch, Model Validation & Persistance

    • Connecting transformers with estimators in pipeline. Hyper-parameter tuning using GridSearch, Persisting models. CrossValidation for finding the best model.
  4. Naive Bayes, Trees & Ensemble Methods

  • Fundamentals of Naive Bayes, Decision Tree. Understanding Ensemble Learning methods like RandomForest, GBT. Understanding distributed implmentation of these algorithms. Problem solving using these
  1. Clustering
  • Unsupervised Learning, Clustering, Bisecting KMeans, Gaussian Mixture Models, LDA. Customer segmentation using clustering methods
  1. Recommendation Engine
  • Content Based Recommendation, Collaborative Filtering, Cold start Problem, Distance Vectors for product similarity, ALS Model.
  1. Deep Learning in Spark
  • Understanding Perceptron. Understanding deep neural network. Introduction to tensorflow. Deep Learning Pipeline on Spark. TensorFlow on Spark.