This project uses Google Cloud Platform (GCP) to host Apache Spark and Hadoop via Dataproc. Utilizing the Google Console, we are able to submit jobs to emulate the user loading home, there "For You Page".
Our system utilizes PySpark to create a job that recommends movies to a selected user based on their previous movie ratings. The system operates on a movie dataset and user dataset stored in Google Cloud Storage (GCP) via a bucket path, which is accessed by the PySpark job.
How it works:
- Data Loading: The job loads the necessary data files from GCP.
- Filtering and Recommendation: The system asks the user for a user ID to generate a personalized list of 10 movies. It identifies the user’s preferred genres based on their previous movie rankings that scored a 5 or higher. It then selects a list of 10 random movies that also have an average rating of 5 or higher.
- Output: The system generates and presents a list of 10 movies to recommend to the selected user. The output includes details such as the movie title, release date, vote average, and genre(s).
Create a new bucket and upload the following files:
rec_movies.py
master_movies.csv
users_list.csv
How to submit jobs?
-
Submit a job via Google Console
gcloud dataproc jobs submit pyspark <gs://genre-movies/rec_movie.py> --cluster=<apache-hadoop-instance> --region=<us-central1> -- <user_id>
Replace <{data}> according to your setup. Replace bucket path (in our case {genre-movies}) to your bucket path.
How to create your own data with The Movie Database?
-
Create API Key for TMDB
API_KEY=1234
Put inside .env file
-
Run fetch_movies.py
python3 fetch_movies.py
-
Run create_usr_data.py
python3 create_usr_data.py
Edit variable "user_genre_preferences" for custom user movie list.
- Create a dataproc cluster.
- Create a bucket and upload required files.
- Start cluster and navigate to the google cloud shell.
- Run the script and pass the argument of the user ID.