IT4142E Data Science Capstone, Hanoi University of Science and Technology, 2021
Note that in this project, the data crawler works independently. That means you have to run code manually. All the other
steps are visually demonstrated in our demo. The src
folder is for reference purpose only.
Therefore, this section title should be How to run data crawler.
cd src/crawler
scrapy crawl full2ImdbCrawler
Since the data crawler outputs two different files, we need to join them into a single final dataset:
cd ..
python join_data.py
├── dataset Dataset files in .csv
│ ├── extracted
│ ├── processed
│ └── **/*.csv
├── demo Demo website source
├── notebook Jupyter notebooks
├── src Data crawler and other source code only for reference purpose
├── README.md Project overview
After data collection: data_joined.csv
After data cleaning: cleaned_data.csv
For Machine Learning: feature_extracted.csv
Visit the project demo website at theobmgit.github.io/it4142e-bor.github.io/