-
Notifications
You must be signed in to change notification settings - Fork 119
Machine Learning using PySpark
Awantik Das edited this page Oct 8, 2018
·
3 revisions
- Fundamentals of Spark & Machine Learning
- Understanding Big Data, Hadoop & Spark. Core data structures like RDD, DataFrames. Distributed tables for machine learning like Vectors & Matrice, Black Box Introduction of Machine Learning. Understanding Machine Learning Pipeline & different stages. Data Ingestion, Streaming, Wrangling, Visualization, Preprocessing, Training Models, Validation & Deployment
- Data Wrangling & Visualization
- Using DataFrames to understand,clean & getting summary of data. Data merging, duplicate deletion, statistical data analysis. Making use of matplotlib visualization of information.
- Data Pre-processing
- Numerical Data Scaling & Normalization. Dealing with categorical data. Handling Images. OneHotEncoder, VectorEssembler. Everything, that is required to get data ready for machine learning. Dealing with Text - TFIDF, CountVectorizer, HashingVectorizer
-
Feature Selection & Extraction
- Spark deals with large datasets, Selecting important feature columns. VectorSlicer, RFormula, Correlation, ChiSqSelector, PCA, SVD,
-
Linear Models for Classification & Regression
- Understanding linear models like linear regression, logistic regression, Regularized regression. Intuition
about how distributed learning works. Problem solving using these
- Understanding linear models like linear regression, logistic regression, Regularized regression. Intuition
-
Spark Pipeline, GridSearch, Model Validation & Persistance
- Connecting transformers with estimators in pipeline. Hyper-parameter tuning using GridSearch, Persisting models. CrossValidation for finding the best model.
-
Naive Bayes, Trees & Ensemble Methods
- Fundamentals of Naive Bayes, Decision Tree. Understanding Ensemble Learning methods like RandomForest, GBT. Understanding distributed implmentation of these algorithms. Problem solving using these
- Clustering
- Unsupervised Learning, Clustering, Bisecting KMeans, Gaussian Mixture Models, LDA. Customer segmentation using clustering methods
- Recommendation Engine
- Content Based Recommendation, Collaborative Filtering, Cold start Problem, Distance Vectors for product similarity, ALS Model.
- Deep Learning in Spark
- Understanding Perceptron. Understanding deep neural network. Introduction to tensorflow. Deep Learning Pipeline on Spark. TensorFlow on Spark.