YouTube has more than 2 billion active monthly users and more than 122 million active daily users who view more than a billion hours per day. It's apparent how influential it's grown as the world's second most viewed website with over 26 Billion videos. A thumbnail, a title, comments and video statistics such as likes, dislikes, and views are the three primary components of a video.
Certain platform content providers have created videos with purposefully deceptive names and thumbnails in order to get platform users to click on their films, much as how some news headlines misrepresent the content of the article. Clickbait is a term used to describe an overblown video.
Because various individuals interpret clickbait differently, the term employed in this project is: a piece of material that employs an exaggerated title and thumbnail to deceive the audience into watching the video that does not deliver on what the title and thumbnail promised.
Main Notebooks to check out:
- Cognitive evidences
- Data Exploration
- Ensemble Learning
- Data Preprocessing & ML estimators
- Title Classifier
I created a method to retrieve the thumbnail, title, and statistics from a given video using the YouTubeDataV3 API. The videos that weren't clickbait were chosen at random from the Explore page and verified to make sure they weren't deceptive.
In terms of obtaining the clickbait videos, I had a Catch 22 situation. The ultimate objective of this investigation is to construct a machine learning system that can automatically gather clickbait and non-clickbait videos.
The combination of a clickbait video's title and thumbnail was used to identify it, and to lessen bias, the videos were chosen manually by two separate people. To avoid giving any one theme too much weight, other genres were sampled.
- Nonclickbait Videos titiles have nouns like game , official , highlisghts etc and have relavant comments matching to the context of the video
- While Clickbait video titiles are like Prank, Hack or gone wrong with comments similar to "fake video" or "do not agree" and "wrong thumbanil". etc
developed a new function: Dislike to Like Ratio because clickbait videos sometimes have a high dislike count. All of the statistics are scaled to a normal distribution and shuffled randomly. Other video metadata is eliminated, including ID and Favorites. Emojis and other non-ascii characters are deleted from all titles since they cannot be used to name files on Windows file systems.
-
Ensemble of ML Model estimators (Random Forest, K Nearest Neighbors, Support Vector Classifier, XGBoost, Logisitic Regression, Gaussian Naive Bayes) for Video Statistics Classification using soft voting as we want to display probabilities
-
Feedforward Neural Network for Video Title Classification using the Google Universal Sentence Encoder, an encoder with with 512 dimensional embeddings trained by a DAN encoder for language classification tasks. It is a 1GB model, so it takes a while to load at first.
-
Combining both with my own custom Ensemble model Used this to convert logits to logs
- Analyzing distributions and creating correlations for both numerical and natural language data
- Fetching & Parsing through JSON data with the YouTube API
- Ensembling Machine Learning Estimators and Tensorflow Hub NLP models together
Any YouTube user might utilise a web application as a potential software solution. The UI of the web app would be optimised for mobile devices given that more than 70% of YouTube viewing time occurs on mobile devices (though works fine on desktop)