related link: https://www.nytimes.com/interactive/2020/09/18/opinion/wildfire-hurricane-climate.html
Data found via: https://www.climate.gov/maps-data/dataset/daily-temperature-and-precipitation-reports-data-tables
Downloaded on 2020-07-15:
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2019.csv.gz ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/ghcn-daily-by_year-format.rtf
gunzip 2019.csv.gz
md5 2019.csv
## MD5 (2019.csv) = d5b1fbeda6962341efff7fc48bcaa958
wc 2019.csv
## 34188762 34188762 1196914814 2019.csv
Cool, 34M rows.
Text from ghcn-daily-by_year-format.rtf:
The following information serves as a definition of each field in one
line of data covering one station-day. Each field described below is
separated by a comma ( , ) and follows the order presented in this
document.
ID = 11 character station identification code
YEAR/MONTH/DAY = 8 character date in YYYYMMDD format (e.g. 19860529 = May 29, 1986)
ELEMENT = 4 character indicator of element type
DATA VALUE = 5 character data value for ELEMENT
M-FLAG = 1 character Measurement Flag
Q-FLAG = 1 character Quality Flag
S-FLAG = 1 character Source Flag
OBS-TIME = 4-character time of observation in hour-minute format (i.e.
0700 =7:00 am)
See section III of the GHCN-Daily readme.txt file for an explanation
of ELEMENT codes and their units as well as the M-FLAG, Q-FLAGS and
S-FLAGS.
The OBS-TIME field is populated with the observation times contained
in NOAA/NCDC’s Multinetwork Metadata System (MMS).
Downloading:
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/readme.txt
Got "Version 3.26." And it has info on how to cite this data.
mv ~/Downloads/readme.txt ./ghcn-readme.txt
Okay... Simple stats!
# count the unique weather stations
time cut -d, -f1 2019.csv | sort | uniq | wc
## 39602 39602 475224
## real 2m46.411s
# count the unique dates
time cut -d, -f2 2019.csv | sort | uniq | wc
## 365 365 3285
## real 1m35.366s
# Good!
# Suggests 34188762 / 39602 / 365 = 2.4 rows per station day?
# Surely not?
# count the unique "elements"
time cut -d, -f3 2019.csv | sort | uniq | wc
## 70 70 350
## real 2m38.814s
# count the unique data values
time cut -d, -f4 2019.csv | sort | uniq | wc
## 4455 4455 21905
## real 3m48.674
# count the unique measurement flags (M-FLAGs)
time cut -d, -f5 2019.csv | sort | uniq | wc
## 5 4 9
## real 1m10.871s
# count the unique quality flags (Q-FLAGs)
time cut -d, -f6 2019.csv | sort | uniq | wc
## 15 14 29
## real 1m5.142s
# count the unique source flags (S-FLAGs)
time cut -d, -f7 2019.csv | sort | uniq | wc
## 12 12 24
## real 1m18.996s
# count the unique OBS-TIMEs
time cut -d, -f8 2019.csv | sort | uniq | wc
## 40 39 196
## real 1m14.728s
# anything in a "field 9"?
time cut -d, -f9 2019.csv | sort | uniq | wc
## 1 0 1
## real 0m44.193s
time cut -d, -f9 2019.csv | sort | uniq -c
## 34188762
## real 0m43.326s
time cut -d, -f10 2019.csv | sort | uniq -c
## 34188762
## real 0m48.951s
Tables!
# table of 70 unique "elements"
time cut -d, -f3 2019.csv | sort | uniq -c | sort -rh
## 10360416 PRCP
## 4414862 TMIN
## 4374247 TMAX
## 4201205 SNOW
## 3122707 SNWD
## 2307503 TAVG
## 1728384 TOBS
## 459047 WESD
## 361753 AWND
## 334625 WSF2
## 334506 WDF2
## 331469 WSF5
## 331425 WDF5
## 272900 WESF
## 154984 WT01
## 152689 WSFG
## 145615 WDFG
## 109630 DAPR
## 108680 MDPR
## 91107 PGTM
## 64920 WT03
## 49543 SX32
## 49207 SN32
## 46274 EVAP
## 37386 WT08
## 34952 WDMV
## 22068 MXPN
## 21608 MNPN
## 19480 SX52
## 19100 SN52
## 18921 WT02
## 13595 WSFI
## 13109 AWDR
## 7560 DWPR
## 7398 WT06
## 5869 SX31
## 5869 SN31
## 4597 SN33
## 4593 SX33
## 4444 WT04
## 4315 WT11
## 2741 WT05
## 2693 MDTN
## 2693 DATN
## 2647 MDTX
## 2647 DATX
## 2547 SX53
## 2547 SN53
## 2006 SN51
## 2004 SX51
## 1806 SX35
## 1440 SN35
## 1290 PSUN
## 1286 TSUN
## 1180 SX55
## 1148 SN55
## 1089 SX36
## 1078 WT09
## 971 THIC
## 757 SX56
## 726 SN56
## 723 SN36
## 91 WT07
## 31 SX57
## 27 WT10
## 21 WT16
## 4 MDSF
## 4 DASF
## 2 WT18
## 1 WT15
##
## real 2m27.033s
Okay well TMIN and TMAX have got to be what I'm looking for, right?
From the readme:
The five core elements are:
PRCP = Precipitation (tenths of mm)
SNOW = Snowfall (mm)
SNWD = Snow depth (mm)
TMAX = Maximum temperature (tenths of degrees C)
TMIN = Minimum temperature (tenths of degrees C)
Taking a look...
grep TMIN 2019.csv | head -3
## AE000041196,20190101,TMIN,140,,,S,
## AEM00041194,20190101,TMIN,185,,,S,
## AEM00041217,20190101,TMIN,163,,,S,
Measurement flags (field 5) from readme:
Blank = no measurement information applicable
B = precipitation total formed from two 12-hour totals
D = precipitation total formed from four six-hour totals
H = represents highest or lowest hourly temperature (TMAX or TMIN)
or the average of hourly values (TAVG)
K = converted from knots
L = temperature appears to be lagged with respect to reported
hour of observation
O = converted from oktas
P = identified as "missing presumed zero" in DSI 3200 and 3206
T = trace of precipitation, snowfall, or snow depth
W = converted from 16-point WBAN code (for wind direction)
Quality flags (field 6) from readme:
Blank = did not fail any quality assurance check
D = failed duplicate check
G = failed gap check
I = failed internal consistency check
K = failed streak/frequent-value check
L = failed check on length of multiday period
M = failed megaconsistency check
N = failed naught check
O = failed climatological outlier check
R = failed lagged range check
S = failed spatial consistency check
T = failed temporal consistency check
W = temperature too warm for snow
X = failed bounds check
Z = flagged as a result of an official Datzilla
investigation
What does that "S" mean?
time cut -d, -f7 2019.csv | sort | uniq -c | sort -rh
## 10198314 7 # US cooperative summary of the day
## 7791663 N # Community Collaborative Rain, Hail,and Snow (CoCoRaHS)
## 3308803 S # Global Summary of the Day (NCDC DSI-9618)
## 3229521 W # WBAN/ASOS Summary of the Day from NCDC's ISD
## 2045245 E # European Climate Assessment and Dataset
## 1996312 a # Australian data from the Australian Bureau of Meteorology
## 1966640 T # SNOwpack TELemtry (SNOTEL) data
## 1902334 C # Environment Canada
## 1334856 U # Remote Automatic Weather Station (RAWS) data
## 244630 H # High Plains Regional Climate Center real-time data
## 167829 R # All-Russian Research Institute
## 2615 Z # Datzilla official additions or replacements
##
## real 1m1.723s
Source flags (field 7) from readme:
Blank = No source (i.e., data value missing)
0 = U.S. Cooperative Summary of the Day (NCDC DSI-3200)
6 = CDMP Cooperative Summary of the Day (NCDC DSI-3206)
7 = U.S. Cooperative Summary of the Day -- Transmitted
via WxCoder3 (NCDC DSI-3207)
A = U.S. Automated Surface Observing System (ASOS)
real-time data (since January 1, 2006)
a = Australian data from the Australian Bureau of Meteorology
B = U.S. ASOS data for October 2000-December 2005 (NCDC
DSI-3211)
b = Belarus update
C = Environment Canada
D = Short time delay US National Weather Service CF6 daily
summaries provided by the High Plains Regional Climate
Center
E = European Climate Assessment and Dataset (Klein Tank
et al., 2002)
F = U.S. Fort data
G = Official Global Climate Observing System (GCOS) or
other government-supplied data
H = High Plains Regional Climate Center real-time data
I = International collection (non U.S. data received through
personal contacts)
K = U.S. Cooperative Summary of the Day data digitized from
paper observer forms (from 2011 to present)
M = Monthly METAR Extract (additional ASOS data)
m = Data from the Mexican National Water Commission (Comision
National del Agua -- CONAGUA)
N = Community Collaborative Rain, Hail,and Snow (CoCoRaHS)
Q = Data from several African countries that had been
"quarantined", that is, withheld from public release
until permission was granted from the respective
meteorological services
R = NCEI Reference Network Database (Climate Reference Network
and Regional Climate Reference Network)
r = All-Russian Research Institute of Hydrometeorological
Information-World Data Center
S = Global Summary of the Day (NCDC DSI-9618)
NOTE: "S" values are derived from hourly synoptic reports
exchanged on the Global Telecommunications System (GTS).
Daily values derived in this fashion may differ significantly
from "true" daily data, particularly for precipitation
(i.e., use with caution).
s = China Meteorological Administration/National Meteorological
Information Center/Climatic Data Center (http://cdc.cma.gov.cn)
T = SNOwpack TELemtry (SNOTEL) data obtained from the U.S.
Department of Agriculture's Natural Resources Conservation Service
U = Remote Automatic Weather Station (RAWS) data obtained
from the Western Regional Climate Center
u = Ukraine update
W = WBAN/ASOS Summary of the Day from NCDC's Integrated
Surface Data (ISD).
X = U.S. First-Order Summary of the Day (NCDC DSI-3210)
Z = Datzilla official additions or replacements
z = Uzbekistan update
When data are available for the same time from more than one source,
the highest priority source is chosen according to the following
priority order (from highest to lowest):
Z,R,D,0,6,C,X,W,K,7,F,B,M,m,r,E,z,u,b,s,a,G,Q,I,A,N,T,U,H,S
Okay really let's just split out the TMIN and TMAX stuff.
grep TMIN 2019.csv > 2019_TMIN.csv &
grep TMAX 2019.csv > 2019_TMAX.csv &
wc 2019_*
## 4374247 4374247 159624414 2019_TMAX.csv
## 4414862 4414862 160644176 2019_TMIN.csv
# Matches against earlier counts from field 7:
## 4374247 TMAX
## 4414862 TMIN
time cat 2019_TMIN.csv | cut -d, -f1 | sort | uniq -c | cut -c1-4 | sort -rh | uniq -c
## 5143 365
## 991 364
## 537 363
## 357 362
## 260 361
## 215 360
## 160 359
## 147 358
## 134 357
## 125 356
## 95 355
## ...
# Of this total number of statins
time cat 2019_TMIN.csv | cut -d, -f1 | sort | uniq -c | wc
## 13927 27854 236759
time cat 2019_TMAX.csv | cut -d, -f1 | sort | uniq -c | cut -c1-4 | sort -rh | uniq -c
## 5205 365
## 1006 364
## 531 363
## 369 362
## 277 361
## 217 360
## 187 359
## 158 358
## 141 357
## 128 356
## 108 355
## 85 354
## ...
time cat 2019_TMAX.csv | cut -d, -f1 | sort | uniq -c | wc
## 13861 27722 235637
Hmm hmm hmm... Could probably use just the complete data ones. But I could also have even more stations by including those with some missing data!
What are my temperature bounds going to be?
I was thinking 40 and 80 degrees F, but now it's in C... I could convert, I guess?
I'll just convert my targets, so 40 F is about 4 C (39.2 F) and 80 F is about 27 C (80.6 F).
I can adjust these later too. Keep them as variables.
cold, hot = 4, 27
Oh by the way: downloaded this station info, which has lat/lon.
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt
wc ghcnd-stations.txt
## 115082 816901 9897052 ghcnd-stations.txt
Lots of stations!
Is the TMIN/TMAX data all unique, as in just one reading per station-day?
wc 2019_T*
## 4374247 4374247 159624414 2019_TMAX.csv
## 4414862 4414862 160644176 2019_TMIN.csv
cut -d, -f1-2 2019_TMAX.csv | sort | uniq | wc
## 4374247 4374247 91859187
cut -d, -f1-2 2019_TMIN.csv | sort | uniq | wc
## 4414862 4414862 92712102
Looks good! Don't have to worry about duplicate entries etc.!
Working in analyze*.ipynb
...
I want more data!
Finding data via ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/
FTP site is being flaky...
Downloading on 2020-08-18:
- ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2018.csv.gz
- ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2017.csv.gz
- ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2016.csv.gz
- and so on, through 2010
gunzip ~/Downloads/*gz
mv ~/Downloads/201?.csv ./
Ten years of weather data should fill in the map a little...
Returning to this on Monday 2020-12-28...
l ????.csv
## 2010.csv
## 2011.csv
## 2012.csv
## 2013.csv
## 2014.csv
## 2015.csv
## 2016.csv
## 2017.csv
## 2018.csv
## 2019.csv
# -h to avoid filenames
grep -h TMIN ????.csv > 201X_TMIN.csv &
grep -h TMAX ????.csv > 201X_TMAX.csv &
grep -h PRCP ????.csv > 201X_PRCP.csv &