I offer the two versions of the code, in HTML and in Scala. This repository is the final project of the course Machine Learnign for Big Data from the Master 2 Artificial Intelligence and Data Science (IASD). This project explores the implementation and optimization of K-Means clustering algorithms using Apache Spark in Scala, and investigates different variants of the Stochastic Gradient Descent (SGD) algorithm. The focus is on optimizing performance for large-scale data processing in distributed environments.
BigData Final Project: K-Means Clustering and Gradient Descent Variants in Spark Welcome to the repository for the BigData Final Project! This project explores the implementation and optimization of K-Means clustering algorithms using Apache Spark in Scala, and investigates different variants of the Stochastic Gradient Descent (SGD) algorithm. The focus is on optimizing performance for large-scale data processing in distributed environments.
Table of Contents
Part 1: K-Means Implementation and Optimization in Spark Scala Baseline Implementation Performance Analysis Optimization and Justification K-means++ Implementation DataFrame-Based Implementation
Part 2: K-Means Implementation Using DataFrames or DataSets DataFrame/DataSet Implementation Performance Comparison
Part 3: Gradient Descent Variants Momentum and Nesterov Variants of SGD
Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.