📺 Youtube Analysis 📽️
*For problem statement 1.
We analyze the data to identify the top 5 categories in which the most number of videos are uploaded. The dataset is gathered using the YouTube API and stored in Hadoop Distributed File System(HDFS). MapReduce algorithm is applied to process the dataset and identify the video categories.
-
Open terminal 🖥️
-
hadoop fs -mkdir /youtube
-
hadoop fs -put /home/mangal/Desktop/youtubedata.txt /youtube
-
Open bashrc file by gedit ~/.bashrc and type ///// don't add this path if it is already exist export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
-
Then source ~/.bashrc
-
cd Desktop 📀
-
Now compile youtube1.java file suppose that file is on Desktop
hadoop com.sun.tools.javac.Main youtube1.java
-
To combine all class files
jar cf wc.jar youtube1*.class
-
To execute
hadoop jar wc.jar youtube1 /youtube/youtubedata.txt /top5rating
-
hadoop fs -ls / //// to check output folder is there or not
-
hadoop fs -cat /top5rating/part-r-00000
For problem statement 2****
-
Open terminal
-
hadoop fs -mkdir /youtube
-
hadoop fs -put /home/mangal/Desktop/youtubedata.txt /youtube
-
Open bashrc file by gedit ~/.bashrc and type ///// don't add this path if it is already exist export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
-
Then source ~/.bashrc
-
cd Desktop
-
Now compile youtube2.java file suppose that file is on Desktop
hadoop com.sun.tools.javac.Main youtube2.java
-
To combine all class files
jar cf wc.jar youtube2*.class
-
To execute 💽
hadoop jar wc.jar youtube2 /youtube/youtubedata.txt /videorating
-
hadoop fs -ls / //// to check output folder is there or not
-
hadoop fs -cat /videorating/part-r-00000
Hadoop Hadoop is a distributed computing Framework developed and maintained by The Apache Software Foundation written in Java. Hadoop consists of HDFS and MapReduce and is genrally deployed in a group of machines called cluster. Initially, GFS and MapReduce were built to empower Google Search. HDFS stands for Hadoop Distributed File System and is used to store data across multiple disks.MapReduce is a way to parallelize Data processing tasks.
MapReduce Algorithm MapReduce Algorithm consists of Map() procedure that performs filtering and sorting of input data and Reduce() performs summary\aggregate function per (key, value) pair.
This project implements Hadoop MapReduce algorithm on the YouTube data and display the result on a web server.