EDA-Netflix-Dataset-using-PySpark-on-Docker

This project demonstrates how to perform Exploratory Data Analysis (EDA) on the Netflix dataset using PySpark in a Jupyter Notebook environment. It involves setting up Spark, loading a dataset, performing basic data cleaning, and visualizing the results. All of it is runnning on a container in Docker.

How Does it Work?

Install Jupyter notebook.
Install Docker Desktop.

Download The Dataset

Download the "Netflix Movies and TV Shows" csv file for EDA:

https://www.kaggle.com/datasets/shivamb/netflix-shows

Pull Jupyter pyspark notebook Image

Pull "jupyter pyspark notebook" image from the following command:

docker pull jupyter/pyspark-notebook

Run the PySpark Container with VS Code Integration

Start the container and access the PySpark shell on a Container using Docker:

docker run -p 8888:8888 -p 4040:4040 -v C:\Users\user\Desktop\EDA:/home/jovyan/work --name pyspark_container jupyter/pyspark-notebook

After running this command links will be present at the end, copy it and run it on your browser. Links will look like this:

http://localhost:8888/?token=your-token

and

http://localhost:4040/?token=your-token

Explanation

-p 8888:8888 #Maps the Jupyter Notebook port (if needed).

-p 4040:4040 #Maps the Spark Web UI port.

-v C:\Users\user\Desktop\EDA:/home/jovyan/work #Shares your local folder with the container.

--name pyspark_container #Assigns a name to the container.

Load the Dataset in pyspark in Jupyter Notebook

Once the dataset is accessible inside the Docker container, use the following PySpark code to load it:

# Import required libraries
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.master("local").appName("Netflix EDA").getOrCreate()

# Load the dataset
df = spark.read.csv("/datasets/netflix.csv", header=True, inferSchema=True)

# Show the first few rows
df.show()

Perform EDA

EDA is performed on the Dataset in the ipynb file uploaded.
After performing EDA upload it on Github.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

EDA-Netflix-Dataset-using-PySpark-on-Docker

How Does it Work?

Download The Dataset

Pull Jupyter pyspark notebook Image

Run the PySpark Container with VS Code Integration

Explanation

Load the Dataset in pyspark in Jupyter Notebook

Perform EDA

Files

README.md

Latest commit

History

README.md

File metadata and controls

EDA-Netflix-Dataset-using-PySpark-on-Docker

How Does it Work?

Download The Dataset

Pull Jupyter pyspark notebook Image

Run the PySpark Container with VS Code Integration

Explanation

Load the Dataset in pyspark in Jupyter Notebook

Perform EDA