To create an ETL pipeline to scrap website, do some processing and finally load in a database of choice.
- Python
- requests, request-html
- beautifulsoup
- concurrent
- sqlite3
- pandas
- re module
- Pycharm IDE
- Multithreading
- OOPs
- SQL
We will create an ETL class that will encapsulate the ETL logic for a particular page. Once we have the list of all such pages we shall use multithreading and scale up the ETL by creating an object of ETL class for each page and running the pipeline for each in multithreading.
- Web scraping and website inspecting
- Multithreading
- ETL building
- Creating Object-Oriented Programs
- Creation of functools' partial functions
- Storing data into RDBMS
- Regex pattern building (Basics)