Skip to content

Latest commit

 

History

History
83 lines (56 loc) · 2.15 KB

File metadata and controls

83 lines (56 loc) · 2.15 KB

EDA-Netflix-Dataset-using-PySpark-on-Docker

This project demonstrates how to perform Exploratory Data Analysis (EDA) on the Netflix dataset using PySpark in a Jupyter Notebook environment. It involves setting up Spark, loading a dataset, performing basic data cleaning, and visualizing the results. All of it is runnning on a container in Docker.

How Does it Work?

  1. Install Jupyter notebook.

  2. Install Docker Desktop.

Download The Dataset

  1. Download the "Netflix Movies and TV Shows" csv file for EDA:
https://www.kaggle.com/datasets/shivamb/netflix-shows 

Pull Jupyter pyspark notebook Image

  1. Pull "jupyter pyspark notebook" image from the following command:
docker pull jupyter/pyspark-notebook

Run the PySpark Container with VS Code Integration

  1. Start the container and access the PySpark shell on a Container using Docker:
docker run -p 8888:8888 -p 4040:4040 -v C:\Users\user\Desktop\EDA:/home/jovyan/work --name pyspark_container jupyter/pyspark-notebook
  1. After running this command links will be present at the end, copy it and run it on your browser. Links will look like this:
http://localhost:8888/?token=your-token

and

http://localhost:4040/?token=your-token

Explanation

-p 8888:8888 #Maps the Jupyter Notebook port (if needed).
-p 4040:4040 #Maps the Spark Web UI port.
-v C:\Users\user\Desktop\EDA:/home/jovyan/work #Shares your local folder with the container.
--name pyspark_container #Assigns a name to the container.

Load the Dataset in pyspark in Jupyter Notebook

  1. Once the dataset is accessible inside the Docker container, use the following PySpark code to load it:
# Import required libraries
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.master("local").appName("Netflix EDA").getOrCreate()

# Load the dataset
df = spark.read.csv("/datasets/netflix.csv", header=True, inferSchema=True)

# Show the first few rows
df.show()

Perform EDA

  1. EDA is performed on the Dataset in the ipynb file uploaded.

  2. After performing EDA upload it on Github.