Skip to content

Ajay026/Azure-Project-on-Movielens-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Azure-Project-on-Movielens-Data

Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

๐—ช๐—ต๐—ฎ๐˜ ๐—ถ๐˜€ ๐——๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜ ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜€๐—ถ๐˜€?
Dataset Analysis is defined as manipulating or processing unstructured data or raw data to draw valuable insights and conclusions that will help derive critical decisions that will add some business value. The dataset analysis process is followed by organizing the dataset, transforming the dataset, visualizing the dataset, and finally modeling the dataset to derive predictions for solving the business problems, making informed decisions, and effectively planning for the future.

๐—ช๐—ต๐—ฎ๐˜ ๐—ถ๐˜€ ๐——๐—ฎ๐˜๐—ฎ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ:
Data pipeline involves extracting or capturing data using various tools, storing raw data, cleaning, validating data, transforming data into a query-worthy format, visualizing KPIs, and orchestration of the above process. It refers to a system for moving data from one system to another. The data may or may not be transformed, and it may be processed in real-time (or streaming) instead of batches.

๐—ช๐—ต๐—ฎ๐˜ ๐—ถ๐˜€ ๐˜๐—ต๐—ฒ ๐—”๐—ด๐—ฒ๐—ป๐—ฑ๐—ฎ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—ฝ๐—ฟ๐—ผ๐—ท๐—ฒ๐—ฐ๐˜?
The Agenda of the project involves deriving Movie Recommendations using Python and Spark on Microsoft Azure. We first understand the problem and download the Movielens dataset from the grouplens website. Then a subscription is set up for using Microsoft Azure, and categorization of resources is done into a resource group. A standard storage account is a setup to store all the data required for serving movie recommendations using Python and Spark on Azure, followed by creating a standard storage blob account in the same resource group. Firstly, we make containers in a standard storage account and standard storage blob account and upload the movielens zip file dataset in its standard storage blob account. Then we create an Azure data factory, a copy data pipeline, and start link storage for standard blob storage account in the Azure data factory. We are copying data from Azure blob storage to Azure data lake storage using a copy data pipeline in the Azure data factory. It is followed by creating the databricks workspace, cluster on databricks, and accessing Azure data lake storage from databricks. We are creating mount points and extracting the zip file to get CSV files. Finally, we upload files into databricks, read the datasets into Spark dataframes in databricks, and analyze the dataset to get the movie recommendations.

๐—จ๐˜€๐—ฎ๐—ด๐—ฒ ๐—ผ๐—ณ ๐——๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜:
Here we are going to use Movielens data in the following ways:

โ— ๐—˜๐˜…๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป: During the extraction process, the Movielens data zip file is extracted to get the CSV files out of it in two ways: the Databricks local file system(DFS) and the Azure data factory(ADF) copy pipeline.

โ— ๐—ง๐—ฟ๐—ฎ๐—ป๐˜€๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฎ๐—ป๐—ฑ ๐—Ÿ๐—ผ๐—ฎ๐—ฑ: During the transformation and load process, the uploaded dataset in Spark is read into Spark dataframes. Data tags are also read into Spark in Databricks, and output is displayed through Bar chart. And dataset is finally analyzed in Databricks into Spark, and movies are recommended.

๐——๐—ฎ๐˜๐—ฎ ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜€๐—ถ๐˜€:
โ— From the grouplens website, data is downloaded containing names of movies, ratings given to the movies, links of the movies, and tags assigned to the movies.
โ— Resource manager is created in Azure to categorize the resources required, followed by a Storage account.
โ— The Copy Data pipeline is created to copy the data from Azure blob storage to Azure data lake storage in the Azure data factory.
โ— The Databricks workspace and cluster are created and accessed Azure data lake storage from databricks followed by the creation of Mount pairs.
โ— The extraction process is done by extracting the Movielens data zip file to get the CSV files out of it using the Databricks file system(DFS) and using the Azure data factory(ADF).
โ— In the transformation and load process, the uploaded dataset in Spark is read into Spark dataframes. Data tags are read into Spark in Databricks.
โ— Finally, data is analyzed into Spark in Databricks using mount points, and data is visualized using bar charts.

image

image

About

Build an Azure Recommendation Engine on Movielens Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published