Skip to content

yycorcino/distributed-system-for-movie-recommendations

 
 

Repository files navigation

Distributed System For Movie Recommendations

Developed by: Sebastian Corcino and Gaozong Lo

About The Project

This project uses Google Cloud Platform (GCP) to host Apache Spark and Hadoop via Dataproc. Utilizing the Google Console, we are able to submit jobs to emulate the user loading home, there "For You Page".

Our system utilizes PySpark to create a job that recommends movies to a selected user based on their previous movie ratings. The system operates on a movie dataset and user dataset stored in Google Cloud Storage (GCP) via a bucket path, which is accessed by the PySpark job.

How it works:

  1. Data Loading: The job loads the necessary data files from GCP.
  2. Filtering and Recommendation: The system asks the user for a user ID to generate a personalized list of 10 movies. It identifies the user’s preferred genres based on their previous movie rankings that scored a 5 or higher. It then selects a list of 10 random movies that also have an average rating of 5 or higher.
  3. Output: The system generates and presents a list of 10 movies to recommend to the selected user. The output includes details such as the movie title, release date, vote average, and genre(s).

Creating GCP Bucket

Create a new bucket and upload the following files:

  • rec_movies.py
  • master_movies.csv
  • users_list.csv

Usage

How to submit jobs?

  • Submit a job via Google Console

      gcloud dataproc jobs submit pyspark <gs://genre-movies/rec_movie.py> --cluster=<apache-hadoop-instance> --region=<us-central1> -- <user_id>
    

    Replace <{data}> according to your setup. Replace bucket path (in our case {genre-movies}) to your bucket path.

Creating Data

How to create your own data with The Movie Database?

  1. Create API Key for TMDB

    API_KEY=1234
    

    Put inside .env file

  2. Run fetch_movies.py

    python3 fetch_movies.py
    
  3. Run create_usr_data.py

    python3 create_usr_data.py
    

    Edit variable "user_genre_preferences" for custom user movie list.

How to Run Our System

  1. Create a dataproc cluster.
  2. Create a bucket and upload required files.
  3. Start cluster and navigate to the google cloud shell.
  4. Run the script and pass the argument of the user ID.

(back to top)

About

Apache Spark with Apache Hadoop for Machine Learning Application

Topics

Resources

License

Stars

Watchers

Forks

Languages

  • Python 100.0%