EDA-Netflix-Dataset-using-PySpark-on-Docker

This project demonstrates how to perform Exploratory Data Analysis (EDA) on the Netflix dataset using PySpark in a Jupyter Notebook environment. It involves setting up Spark, loading a dataset, performing basic data cleaning, and visualizing the results. All of it is runnning on a container in Docker.

How Does it Work?

Install Jupyter notebook.
Install Docker Desktop.

Download The Dataset

Download the "Netflix Movies and TV Shows" csv file for EDA:

https://www.kaggle.com/datasets/shivamb/netflix-shows

Pull Jupyter pyspark notebook Image

Pull "jupyter pyspark notebook" image from the following command:

docker pull jupyter/pyspark-notebook

Run the PySpark Container with VS Code Integration

Start the container and access the PySpark shell on a Container using Docker:

docker run -p 8888:8888 -p 4040:4040 -v C:\Users\user\Desktop\EDA:/home/jovyan/work --name pyspark_container jupyter/pyspark-notebook

After running this command links will be present at the end, copy it and run it on your browser. Links will look like this:

http://localhost:8888/?token=your-token

and

http://localhost:4040/?token=your-token

Explanation

-p 8888:8888 #Maps the Jupyter Notebook port (if needed).

-p 4040:4040 #Maps the Spark Web UI port.

-v C:\Users\user\Desktop\EDA:/home/jovyan/work #Shares your local folder with the container.

--name pyspark_container #Assigns a name to the container.

Load the Dataset in pyspark in Jupyter Notebook

Once the dataset is accessible inside the Docker container, use the following PySpark code to load it:

# Import required libraries
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.master("local").appName("Netflix EDA").getOrCreate()

# Load the dataset
df = spark.read.csv("/datasets/netflix.csv", header=True, inferSchema=True)

# Show the first few rows
df.show()

Perform EDA

EDA is performed on the Dataset in the ipynb file uploaded.
After performing EDA upload it on Github.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Screenshots		Screenshots
README.md		README.md
eda.ipynb		eda.ipynb
netflix_titles.csv		netflix_titles.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDA-Netflix-Dataset-using-PySpark-on-Docker

How Does it Work?

Download The Dataset

Pull Jupyter pyspark notebook Image

Run the PySpark Container with VS Code Integration

Explanation

Load the Dataset in pyspark in Jupyter Notebook

Perform EDA

About

Releases

Packages

Languages

TahirZia-1/EDA-Netflix-Dataset-using-PySpark-on-Docker

Folders and files

Latest commit

History

Repository files navigation

EDA-Netflix-Dataset-using-PySpark-on-Docker

How Does it Work?

Download The Dataset

Pull Jupyter pyspark notebook Image

Run the PySpark Container with VS Code Integration

Explanation

Load the Dataset in pyspark in Jupyter Notebook

Perform EDA

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages