This repository is an improvement of the original GW2-SRS project that make use of AWS S3 to store data such as txt files of urls and csv and json files from the ETL result. Apart from that, this project intention is also implement Docker as a container medium and Airflow to execute the ETL on monthly intervals (31 days).
This module have been kept quite the same as the original, just making some few changes to adapt to the new work environment and also to admit some new features.
Making use of a S3 bucket, it is intended to store some raw and clean data on it, so it can later pivot to other tools such as AWS database options with DynamoDB or RDBMS.
The intention behind the use of Docker is make the ETL functional anywhere at anytime, by making a Docker container it will be possible to execute the ETL is other devices such as AWS EC2 machine.
To make the ETL a lot more functional it is intended to implement an Airflow DAG. By making Airflow execute the ETL monthly (about 31 days of wait per round) we will make sure the ETL keeps not only running but also adding new data.
- This ETL is running in batch mode, no stream mode is being used since the data requires being stored first on a web.
- The ETL use Python logging to store all the errors or value information into a log file.