GitHub - yycorcino/distributed-system-for-movie-recommendations: Apache Spark with Apache Hadoop for Machine Learning Application

Distributed System For Movie Recommendations

Developed by: Sebastian Corcino and Gaozong Lo

About The Project

This project uses Google Cloud Platform (GCP) to host Apache Spark and Hadoop via Dataproc. Utilizing the Google Console, we are able to submit jobs to emulate the user loading home, there "For You Page".

Our system utilizes PySpark to create a job that recommends movies to a selected user based on their previous movie ratings. The system operates on a movie dataset and user dataset stored in Google Cloud Storage (GCP) via a bucket path, which is accessed by the PySpark job.

How it works:

Data Loading: The job loads the necessary data files from GCP.
Filtering and Recommendation: The system asks the user for a user ID to generate a personalized list of 10 movies. It identifies the user’s preferred genres based on their previous movie rankings that scored a 5 or higher. It then selects a list of 10 random movies that also have an average rating of 5 or higher.
Output: The system generates and presents a list of 10 movies to recommend to the selected user. The output includes details such as the movie title, release date, vote average, and genre(s).

Creating GCP Bucket

Create a new bucket and upload the following files:

rec_movies.py
master_movies.csv
users_list.csv

Usage

How to submit jobs?

Submit a job via Google Console

  gcloud dataproc jobs submit pyspark <gs://genre-movies/rec_movie.py> --cluster=<apache-hadoop-instance> --region=<us-central1> -- <user_id>

Replace <{data}> according to your setup. Replace bucket path (in our case {genre-movies}) to your bucket path.

Creating Data

How to create your own data with The Movie Database?

Create API Key for TMDB
```
API_KEY=1234
```
Put inside .env file
Run fetch_movies.py
```
python3 fetch_movies.py
```
Run create_usr_data.py
```
python3 create_usr_data.py
```
Edit variable "user_genre_preferences" for custom user movie list.

How to Run Our System

Create a dataproc cluster.
Create a bucket and upload required files.
Start cluster and navigate to the google cloud shell.
Run the script and pass the argument of the user ID.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
diagrams		diagrams
movie-dataset		movie-dataset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
combine_movies.py		combine_movies.py
create_usr_data.py		create_usr_data.py
fetch_movies.py		fetch_movies.py
rec_movie.py		rec_movie.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed System For Movie Recommendations

About The Project

Creating GCP Bucket

Usage

Creating Data

How to Run Our System

About

Uh oh!

Languages

License

yycorcino/distributed-system-for-movie-recommendations

Folders and files

Latest commit

History

Repository files navigation

Distributed System For Movie Recommendations

About The Project

Creating GCP Bucket

Usage

Creating Data

How to Run Our System

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages