Skip to content

Commit 76705b8

Browse files
committed
Merge branch 'tiliaChronFixes' into production
2 parents 0f1bb81 + 9b4cfbd commit 76705b8

34 files changed

+15877
-1019
lines changed

.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,15 @@ helpers/settings.yaml
2020
fixGeorge.sh
2121

2222
helpers/localDup.sh
23+
24+
*.tar
25+
26+
helpers/archives/
27+
28+
*.gz
29+
30+
.vscode/
31+
32+
helpers/figshareUpload/settings.yaml
33+
34+
helpers/figshareUpload/lib/__pycache__/

Proposals/ostracode_support/EANOD published data June 2024.csv

Lines changed: 2188 additions & 0 deletions
Large diffs are not rendered by default.

Proposals/ostracode_support/NODE database 22May2024.csv

Lines changed: 10362 additions & 0 deletions
Large diffs are not rendered by default.
171 KB
Binary file not shown.

Proposals/uncertainty/uncertainty.svg

Lines changed: 415 additions & 0 deletions
Loading
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
---
2+
title: "A New Neotoma Uncertainty Model"
3+
format: pdf
4+
---
5+
6+
# Adding Uncertainty to Neotoma
7+
8+
The use of uncertainty for measured values is critical. We need it directly associated with individual measurements, and we need to identify the type of uncertainty, and, potentially, the source of the uncertainty (methods of calculation, etc.). This means that for any uncertainty measurement we need to have a link to the sample and the variable that is being measured, we need to have some set of fixed uncertainty measures (standard deviations, standard errors), we also need to be able to freely define the source of the uncertainty (or perhaps again have a fixed set of measures). So, it should be possible to report the following:
9+
10+
| reference | value | units | uncertainty reported | source |
11+
|--------------------------------|-------|-------|----------------------|-----------------------------------------|
12+
| Pinus count for sample 1223445 | 12 | NISP | 1SD | Mahr Nomograms (cf. Maher Jr 1972) |
13+
| pH for sample 23244 | .02 | pH | 95% CI | Reported instrumental error from device |
14+
| NaOH for sample 23244 | .02 | ug | 95% CI | Reported instrumental error from device |
15+
16+
## Table modifications
17+
18+
The uncertainty must be linked with the `ndb.data.dataid` because it modifies the `ndb.data.value` for that variable & sample. If we can assume that the units for the uncertainty are equivalent to the units associated with the variable, however it is possible that uncertainty may be expressed as a percent value. Given this, we will create a new table that links the `ndb.data.dataid` primary key. This allows us to traverse the `ndb.variables` entry for the record (to retrieve the taxonomic information), and potentially link to the variable units if they are equivalent.
19+
20+
Given this data model:
21+
22+
* The table `ndb.data` remains as is.
23+
* The table `ndb.variables` remains as is.
24+
* We add a new table `ndb.datauncertainties` that uses fk(dataid) (the `fk(variableid)` is implied).
25+
* The table has columns `uncertaintyvalue`, `uncertaintyunit`, `uncertaintybasisid` and `notes` along with the standard `recdatecreated` and `recdatemodified`.
26+
27+
They will inherit information from the `ndb.variables` row, so the assumption is that the uncertainty is reported in the same units (and for the same taxon) as the `ndb.data.value`.
28+
29+
![Overall structure of the tables](uncertainty.svg)
30+
31+
### Example Table
32+
33+
| column | type | nulls | default | children | parents | comments |
34+
|---------------------|---------|-------|---------|----------|----------|------------|
35+
| dataid | integer | F | null | | ndb.data | fk(dataid) |
36+
| uncertaintyvalue | float | F | | | | | The value is required. |
37+
| uncertaintyunit | float | F | | | | | The value is required. |
38+
| uncertaintybasisid | integer | F | | | | ndb.uncertaintybases | |
39+
| notes | text | T | null | | | |
40+
41+
#### Proposed `ndb.uncertaintybasis.uncertaintybasis` values
42+
43+
Proposed values for uncertainty tables will come from standard reporting of uncertainty.
44+
45+
* 1 Standard Deviation
46+
* 2 Standard Deviations
47+
* 3 Standard Deviations
48+
* Mean square error
49+
50+
```SQL
51+
CREATE TABLE IF NOT EXISTS ndb.uncertaintybases (
52+
uncertaintybasisid SERIAL PRIMARY KEY,
53+
uncertaintybasis text,
54+
CONSTRAINT uniquebasis UNIQUE (uncertaintybasis))
55+
)
56+
INSERT INTO ndb.uncertaintybases (uncertaintybasis)
57+
VALUES ('1 Standard Deviation'),
58+
('2 Standard Deviations'),
59+
('3 Standard Deviation'),
60+
('1 Standard Error');
61+
```
62+
63+
### Proposed `ndb.datauncertainties` structure
64+
65+
| uncertaintybasisid | uncertaintybasis | . . . |
66+
67+
```SQL
68+
CREATE TABLE IF NOT EXISTS ndb.datauncertainties (
69+
dataid INTEGER REFERENCES ndb.data(dataid),
70+
uncertaintyvalue float,
71+
uncertaintyunitid integer REFERENCES ndb.variableunits(variableunitsid),
72+
uncertaintybasisid integer REFERENCES ndb.uncertaintybases(uncertaintybasisid),
73+
notes text,
74+
CONSTRAINT uniqueentryvalue UNIQUE (dataid, uncertaintyunitid, uncertaintybasisid)
75+
);
76+
```
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
title: "Untitled"
3+
format: html
4+
---
5+
6+
# Adding Uncertainty to Neotoma
7+
8+
The use of uncertainty is critical. We need it directly associated with individual measurements, and we need to identify the type of uncertainty.
9+
10+
## Table modifications
11+
12+
The table `ndb.data` needs two new columns: `uncertaintyvalue` and `uncertaintytype`.
13+
14+
They will inherit information from the `ndb.variables` row, so the assumption is that the uncertainty is reported in the same units (and for the same taxon) as the `ndb.data.value`.
15+
16+
![Overall structure of the tables](uncertainty.svg)
17+
18+
### Proposed `ndb.data` structure:
19+
20+
| dataid | sampleid | variableid | value | uncertaintyvalue | uncertaintybasisid | . . . |
21+
22+
### Proposed `ndb.uncertaintybasis` structure:
23+
24+
| uncertaintybasisid | uncertaintybasis | . . . |
25+
26+
#### Proposed `ndb.uncertaintybasis.uncertaintybasis` values:
27+
28+
Proposed values for uncertainty tables will come from standard reporting of uncertainty.
29+
30+
* 1 Standard Deviation
31+
* 2 Standard Deviations
32+
* 3 Standard Deviations
33+
* Mean square error
34+

datachecks/clean_empty_strings.py

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
"""_Check for non-breaking spaces in text fields_
2+
This issue arose as part of some text-searching that a
3+
user was doing. The code here connects to a PostgreSQL
4+
database,
5+
"""
6+
import json
7+
import psycopg2
8+
from psycopg2 import sql
9+
10+
print("\nRunning database tests.")
11+
with open('../connect_remote.json', encoding='UTF-8') as f:
12+
data = json.load(f)
13+
14+
conn = psycopg2.connect(**data)
15+
conn.autocommit = True
16+
cur = conn.cursor()
17+
18+
# List all text columns.
19+
TEXT_COLS = """
20+
select col.table_schema,
21+
col.table_name,
22+
col.ordinal_position as column_id,
23+
col.column_name,
24+
col.data_type,
25+
col.character_maximum_length as maximum_length
26+
from information_schema.columns col
27+
join information_schema.tables tab on tab.table_schema = col.table_schema
28+
and tab.table_name = col.table_name
29+
and tab.table_type = 'BASE TABLE'
30+
where col.data_type in ('character varying', 'character',
31+
'text', '"char"', 'name')
32+
and col.table_schema not in ('information_schema', 'pg_catalog', 'public', 'tmp', 'pglogical')
33+
order by col.table_schema,
34+
col.table_name,
35+
col.ordinal_position;"""
36+
37+
cur.execute(TEXT_COLS)
38+
tables = cur.fetchall()
39+
40+
EMPTY_SPACE = """
41+
SELECT {}
42+
FROM {}
43+
WHERE {} ~ '.*[\u00A0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF].*'"""
44+
45+
runcounter = []
46+
47+
for row in tables:
48+
tableobj = {'schema': row[0], 'table': row[1], 'column': row[3]}
49+
cur.execute(
50+
sql.SQL(EMPTY_SPACE).format(sql.Identifier(row[3]),
51+
sql.Identifier(row[0], row[1]),
52+
sql.Identifier(row[3])),
53+
tableobj)
54+
tableobj.update({'rows': cur.fetchall()})
55+
runcounter.append(tableobj)
56+
57+
FIELDS = list(filter(lambda x: len(x['rows']) > 0, runcounter))
58+
59+
len(FIELDS)
60+
61+
UPDATE_QUERY = """
62+
UPDATE {}
63+
SET {} = regexp_replace({},
64+
'[\u00A0\u1680\u180E\u2000\u200B\u202F\u205F\u3000\uFEFF]',
65+
' ')
66+
WHERE {} ~ '.*[\u00A0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF].*'"""
67+
68+
for row in FIELDS:
69+
cur.execute(
70+
sql.SQL(UPDATE_QUERY).format(sql.Identifier(row['schema'], row['table']),
71+
sql.Identifier(row['column']),
72+
sql.Identifier(row['column']),
73+
sql.Identifier(row['column'])))

function/ap/dailyquerytable.sql

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
CREATE FUNCTION ap.dailyquerytable(_interval VARCHAR)
2+
RETURNS TABLE (siteid INT,
3+
sitename VARCHAR,
4+
datasetid INT,
5+
chronologyid INT,
6+
altitude FLOAT,
7+
datasettype VARCHAR,
8+
databaseid INT,
9+
collectionunitid INT,
10+
colltype VARCHAR,
11+
depenvt VARCHAR,
12+
geog GEOGRAPHY,
13+
older FLOAT,
14+
younger FLOAT,
15+
agetype VARCHAR,
16+
publications INT[],
17+
taxa INT[],
18+
keywords INT[],
19+
contacts INT[],
20+
collectionunit JSONB,
21+
geopol INT[])
22+
AS $$
23+
WITH allids AS (
24+
SELECT st.siteid,
25+
unnest(array_append(gp.geoout, gp.geoin::int)) AS geopol
26+
FROM ndb.sites AS st
27+
INNER JOIN ndb.sitegeopolitical AS sgp ON st.siteid = sgp.siteid
28+
INNER JOIN ndb.geopoliticalunits AS gpu ON gpu.geopoliticalid = sgp.geopoliticalid
29+
INNER JOIN ndb.geopaths AS gp ON gp.geoin = sgp.geopoliticalid
30+
),
31+
sgp AS (
32+
SELECT siteid, array_agg(DISTINCT geopol) AS geopol
33+
FROM allids
34+
GROUP BY siteid
35+
)
36+
SELECT st.siteid,
37+
st.sitename,
38+
ds.datasetid,
39+
chron.chronologyid,
40+
st.altitude,
41+
dst.datasettype,
42+
dsdb.databaseid,
43+
cu.collectionunitid,
44+
cut.colltype,
45+
dvt.depenvt,
46+
st.geog,
47+
arg.older,
48+
arg.younger,
49+
agetypes.agetype,
50+
array_remove(array_agg(DISTINCT dspb.publicationid), NULL) AS publications,
51+
array_remove(array_agg(DISTINCT var.taxonid), NULL) AS taxa,
52+
array_remove(array_agg(DISTINCT smpkw.keywordid), NULL) AS keywords,
53+
array_remove(array_agg(DISTINCT dpi.contactid) || array_agg(DISTINCT sma.contactid), NULL) AS contacts,
54+
jsonb_build_object('collectionunitid', cu.collectionunitid,
55+
'collectionunit', cu.collunitname,
56+
'handle', cu.handle,
57+
'collectionunittype', cut.colltype,
58+
'datasets', json_agg(DISTINCT jsonb_build_object('datasetid', ds.datasetid,
59+
'datasettype', dst.datasettype))) AS collectionunit,
60+
sgp.geopol
61+
FROM ndb.sites AS st
62+
LEFT OUTER JOIN ndb.collectionunits AS cu ON cu.siteid = st.siteid
63+
LEFT OUTER JOIN ndb.collectiontypes AS cut ON cut.colltypeid = cu.colltypeid
64+
LEFT OUTER JOIN ndb.datasets AS ds ON ds.collectionunitid = cu.collectionunitid
65+
LEFT OUTER JOIN ndb.depenvttypes AS dvt ON dvt.depenvtid = cu.depenvtid
66+
LEFT OUTER JOIN ndb.datasetpis AS dpi ON dpi.datasetid = ds.datasetid
67+
LEFT OUTER JOIN ndb.datasettypes AS dst ON dst.datasettypeid = ds.datasettypeid
68+
LEFT OUTER JOIN ndb.datasetdatabases AS dsdb ON ds.datasetid = dsdb.datasetid
69+
LEFT OUTER JOIN ndb.datasetpublications AS dspb ON dspb.datasetid = ds.datasetid
70+
LEFT OUTER JOIN ndb.chronologies AS chron ON chron.collectionunitid = ds.collectionunitid
71+
LEFT OUTER JOIN ndb.dsageranges AS arg ON ds.datasetid = arg.datasetid AND chron.agetypeid = arg.agetypeid
72+
LEFT OUTER JOIN ndb.agetypes AS agetypes ON agetypes.agetypeid = arg.agetypeid
73+
LEFT OUTER JOIN ndb.samples AS smp ON smp.datasetid = ds.datasetid
74+
LEFT OUTER JOIN ndb.sampleanalysts AS sma ON sma.sampleid = smp.sampleid
75+
LEFT OUTER JOIN ndb.samplekeywords AS smpkw ON smpkw.sampleid = smp.sampleid
76+
LEFT OUTER JOIN ndb.data AS dt ON dt.sampleid = smp.sampleid
77+
LEFT OUTER JOIN ndb.variables AS var ON var.variableid = dt.variableid
78+
LEFT OUTER JOIN sgp AS sgp ON st.siteid = sgp.siteid
79+
WHERE ds.recdatemodified > current_date - (_interval || 'day')::INTERVAL OR
80+
smp.recdatemodified > current_date - (_interval || 'day')::INTERVAL OR
81+
st.recdatemodified > current_date - (_interval || 'day')::INTERVAL
82+
GROUP BY st.siteid,
83+
cu.collectionunitid,
84+
st.sitename,
85+
ds.datasetid,
86+
cut.colltype,
87+
chron.chronologyid,
88+
dsdb.databaseid,
89+
st.altitude,
90+
dst.datasettype,
91+
st.geog,
92+
arg.older,
93+
arg.younger,
94+
agetypes.agetype,
95+
sgp.geopol,
96+
dvt.depenvt
97+
$$ LANGUAGE sql;

function/ap/dailysummaries.sql

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
CREATE or REPLACE FUNCTION ap.dailysummaries(_interval VARCHAR DEFAULT '1')
2+
RETURNS TABLE (dbdate DATE, sites BIGINT, datasets BIGINT, publications BIGINT, observations BIGINT)
3+
AS $$
4+
SELECT DISTINCT date_trunc('day', ds.recdatecreated)::date AS dbdate,
5+
COUNT(DISTINCT st.siteid) AS sites,
6+
COUNT(DISTINCT ds.datasetid) AS datasets,
7+
COUNT(DISTINCT pu.publicationid) AS publications,
8+
COUNT(DISTINCT dt.dataid) AS observations
9+
FROM ndb.sites AS st
10+
INNER JOIN ndb.collectionunits AS cu ON cu.siteid = st.siteid
11+
INNER JOIN ndb.datasets AS ds ON ds.collectionunitid = cu.collectionunitid
12+
INNER JOIN ndb.datasetpublications AS dspu ON dspu.datasetid = ds.datasetid
13+
INNER JOIN ndb.publications AS pu ON pu.publicationid = dspu.publicationid
14+
INNER JOIN ndb.analysisunits AS au ON au.collectionunitid = cu.collectionunitid
15+
INNER JOIN ndb.samples AS smp ON smp.analysisunitid = au.analysisunitid
16+
INNER JOIN ndb.data AS dt ON dt.sampleid = smp.sampleid
17+
WHERE ds.recdatecreated > current_date - (_interval || 'day')::INTERVAL
18+
GROUP BY date_trunc('day', ds.recdatecreated)
19+
$$ LANGUAGE sql;

0 commit comments

Comments
 (0)