95-885, Fall 2025 – Carnegie Mellon University
This repository contains comprehensive coursework and hands-on implementations for 95-885 Data Science & Big Data. It includes Python notebooks, code files, assignments, and practice exercises covering a wide range of topics across:
-
Probability & statistical modeling
-
Algorithms & optimization
-
Applied Machine Learning
-
Big Data processing & distributed systems
-
Applied computing & data engineering
-
Designing end-to-end data science and machine learning solutions, from data ingestion and preprocessing to modeling, evaluation, and deployment. Projects reflect real-world use cases—suitable for solving both industry problems and academic research challenges.
-
Hands-on practice with Big Data tools, including:
-
Apache Spark for distributed data processing
-
Hadoop ecosystem tools
-
Cloud data handling
-
-
Building production-ready pipelines using tools like Pandas, Scikit-learn, PySpark, and Hadoop streaming, and integrating them with machine learning models.
Class-Labs/: Jupyter Notebooks used in class labs and practical projects
Assignments/: Clean, tested Python scripts and reports
Documents/: Sample Research Papers
projects/: Capstone or course mini-projects on real datasets
By the end of this course, learners will be able to:
-
Handle massive datasets and perform distributed computation
-
Apply statistical methods and ML models to large-scale problems
-
Understand performance bottlenecks in data pipelines
-
Translate academic theory into practical, scalable solutions