Skip to content

Built an ETL Pipeline that extract Climate data from API and transform the data by combining all data extracted from API into a single file which is then loaded into an output folder

Notifications You must be signed in to change notification settings

Joshua-omolewa/Toronto_Climate_API_ETL_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PROJECT: Toronto climate ETL data pipeline that extracts climate data from API and transfrom the data by combining the climate data files using python and shell script and then loading the transfromed data into a local ouptut folder

Author: 👤 Joshua Omolewa

1. Business Scenario

Company requires data engineer to obtain Toronto climate data from Canadian Climate API and concatenate them into a single file and also generate log files for error tracking . To download the weather data manually, visit https://climate.weather.gc.ca/historical_data/search_historic_data_e.html.

2. Business Requirements

Download the data from Canadian Climate API. Concatenate the downloaded data files into one final csv file, called all_years.csv as ouput. Upload the scripts and final csv file all_years.csv to Github repository.

3. Deliverable

Upload shell script, python script and all_years.csv to the github repository .

Shell script: The shell script will control every operation, including data downloading, log setting, python script running.

Python script: The Python script is used to concatenate all the data into one file.

all_years.csv: The output file to be generated after concatenating the files.

4. Specification Detail

The data required is from Station ID = 48549. The year range of the data we want is from 2020 to 2022. We only want the data in February. The data will be downloaded in hourly format. The output file will be named as all_years.csv.

Please note the following to use the climate data API (see shell script)

  • year = year (e.g 2022, 2023, 2000 etc)
  • month = 2 (this refers to February)
  • format= [csv|xml]: the format output
  • timeframe = 1: for hourly data
  • timeframe = 2: for daily data
  • timeframe = 3 for monthly data
  • Day = Day of the month the value of the "day" variable is not used and can be an arbitrary value
  • station ID= station ID, For another station, change the value of the variable stationID
  • format: file format (specify csv, xml e.t.c) For the data in XML format, change the value of the variable format to xml in the URL.

Project Architecture

5. STEPS USED TO COMPLETE THIS PROJECT

  • Download data with shell script into the input folder in the Ubuntu virtual machine (VM) and automate log generation process
  • Execute python script ./python_script.py from shell script to concatenate the data in input folder into one file called all_years_csv and store transformed data in output folder
  • Shell script to print out SUCCESS when if all operations are completed successfuly.
  • Upload files to the github repo using git git push

Note: Pipeline can be automated using chronjob if needed

PROJECT FILES

PROJECT BEING EXECUTED ON SHELL

FINAL SCRIPT IMAGE

Follow Me On

Show your support

Give a ⭐️ if this project helped you!

About

Built an ETL Pipeline that extract Climate data from API and transform the data by combining all data extracted from API into a single file which is then loaded into an output folder

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published