Skip to content

This project demonstrates how to perform Exploratory Data Analysis (EDA) on the Netflix dataset using PySpark in a Jupyter Notebook environment. It involves setting up Spark, loading a dataset, performing basic data cleaning, and visualizing the results. All of it is runnning on a container in Docker.

Notifications You must be signed in to change notification settings

TahirZia-1/EDA-Netflix-Dataset-using-PySpark-on-Docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

EDA-Netflix-Dataset-using-PySpark-on-Docker

This project demonstrates how to perform Exploratory Data Analysis (EDA) on the Netflix dataset using PySpark in a Jupyter Notebook environment. It involves setting up Spark, loading a dataset, performing basic data cleaning, and visualizing the results. All of it is runnning on a container in Docker.

How Does it Work?

  1. Install Jupyter notebook.

  2. Install Docker Desktop.

Download The Dataset

  1. Download the "Netflix Movies and TV Shows" csv file for EDA:
https://www.kaggle.com/datasets/shivamb/netflix-shows 

Pull Jupyter pyspark notebook Image

  1. Pull "jupyter pyspark notebook" image from the following command:
docker pull jupyter/pyspark-notebook

Run the PySpark Container with VS Code Integration

  1. Start the container and access the PySpark shell on a Container using Docker:
docker run -p 8888:8888 -p 4040:4040 -v C:\Users\user\Desktop\EDA:/home/jovyan/work --name pyspark_container jupyter/pyspark-notebook
  1. After running this command links will be present at the end, copy it and run it on your browser. Links will look like this:
http://localhost:8888/?token=your-token

and

http://localhost:4040/?token=your-token

Explanation

-p 8888:8888 #Maps the Jupyter Notebook port (if needed).
-p 4040:4040 #Maps the Spark Web UI port.
-v C:\Users\user\Desktop\EDA:/home/jovyan/work #Shares your local folder with the container.
--name pyspark_container #Assigns a name to the container.

Load the Dataset in pyspark in Jupyter Notebook

  1. Once the dataset is accessible inside the Docker container, use the following PySpark code to load it:
# Import required libraries
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.master("local").appName("Netflix EDA").getOrCreate()

# Load the dataset
df = spark.read.csv("/datasets/netflix.csv", header=True, inferSchema=True)

# Show the first few rows
df.show()

Perform EDA

  1. EDA is performed on the Dataset in the ipynb file uploaded.

  2. After performing EDA upload it on Github.

About

This project demonstrates how to perform Exploratory Data Analysis (EDA) on the Netflix dataset using PySpark in a Jupyter Notebook environment. It involves setting up Spark, loading a dataset, performing basic data cleaning, and visualizing the results. All of it is runnning on a container in Docker.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published