This repository was created as part of the Data Mining course of the Computer Science master’s program at TH Köln.
This repository examines Kaggle's Trending YouTube Video Statistics data set. For this purpose, the data is analyzed and, using various given and specially derived attributes, the period of time that a video needs to go trending after publication is predicted. The data set only contains videos that have actually been trending. Further information can be found in the Business Understanding section.
The evaluation is primarily limited to the data relating to Germany. In the evaluation section there is also a comparison with selected other regions. Various algorithms are used for the predictions, whereby classifiers are in the foreground.
0_data/
Contains the data processed in the project.1_business_understanding/
Information about the business understanding.2_data_understanding/
Information about the data understanding, as well as notebooks for the exploration and visual processing of the data.3_data_preparation/
Notebooks to prepare the data for the modeling phase.4_modeling/
Notebooks and scripts for the application of prediction models, as well as the optimization of the feature set and the model parameters.5_evaluation/
Notebooks for working out acquired insights and pecularities, as well as the visual representation of these.
Programs:
Additional Python packages:
- numpy
- pandas
- scipy
- mathplotlib
- seaborn
- pycountry
- sklearn
- xgboost
For a reproducible environment:
pip install -r requirements.txt