Objective of this project: Make NOAA's Global Historical Climatalogy Network Daily (GHCN-D) data readily available to query for all locations on a range of granularity levels. NOAA provides free access to this data which is updated on a daily basis. I used Python here to convert the files to .csv and then SQL to build a database for querying. Later on I subsetted the data and uploaded it into DataBricks community edition using Python, then Scala to explore lambda functions.
LINK TO DATA FILES: download "ghcnd_all.tar.gz" --> ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily
Note as of Jan 2021: it appears the ftp link is no longer available, if so use https://registry.opendata.aws/noaa-ghcn/
Highly recommended, at least skim it. CLICK LINK TO SEE .TXT FILE
For an overview on converting the ghcnd-all.tar.gz
dataset to .csv,
see my other repo by clicking here. This
guide also includes information about uploading this dataset into a Postgres database, which
can then be used in conjunction with the SQL queries in the SQL folder.
This database hosts an extensive amount of climate records and individual characteristics that could prove useful to diverse research projects and study. I have tried to make the process of setting this up as simple as possible to understand to benefit other researchers who want to work with it. Some possible use cases:
- Subsetting this data for a certain region or set of stations to support research
- Querying this data for wind power research (best areas with steady, fast wind?)
- Querying this data for solar power research
- Tracking effects of climate change worldwide based on any characteristic (see doc about data contents)
Complete:
- Convert NOAA's .txt fact files into .csv (fixed-width, using Pandas)
- Fact file imports into Postgres database using COPY
- SQL script to build Postgres DB, tables for fact files
- Entity-Relationship Diagram / ERD below
- Postgres import of full dataset (roughly 25gb after
.csv
conversion) - Optimized queries for entire country yearly averages
- Exported GSN-station-only data to
.csv
- Ingested GSN-station-only data into Apache Spark as Dataframe using Python
- Ingested GSN-station-only data back into PostgreSQL as a separate entity
- Compared query execution times running on Spark SQL vs. PostgreSQL
- Ingested GSN-station-only data into Apache Spark as DataSet using Scala
- Mimicked SQL queries in Spark in Scala using
lambda
functions
Need to add to this ReadMe:
- Intro to NOAA, the data, limitations, etc
- Use of sed & stat error
- Query performance (20 min down to 5!)
- Main query explanation (
buildAVERAGES.sql
)
TODO Coding:
- Queries for #-year averages (of all stations or GSN, etc.)
- Check temperature anomalies with query for entire dataset on Postgres