NeotomaDB
diff --git a/‎.gitignore‎
Lines changed: 12 additions & 0 deletions b/‎.gitignore‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎Proposals/ostracode_support/EANOD published data June 2024.csv‎
Lines changed: 2188 additions & 0 deletions b/‎Proposals/ostracode_support/EANOD published data June 2024.csv‎
Lines changed: 2188 additions & 0 deletions
diff --git a/‎Proposals/ostracode_support/NODE database 22May2024.csv‎
Lines changed: 10362 additions & 0 deletions b/‎Proposals/ostracode_support/NODE database 22May2024.csv‎
Lines changed: 10362 additions & 0 deletions
diff --git a/‎Proposals/uncertainty/uncertainty.pdf‎
171 KB b/‎Proposals/uncertainty/uncertainty.pdf‎
171 KB
diff --git a/‎Proposals/uncertainty/uncertainty.svg‎
Lines changed: 415 additions & 0 deletions b/‎Proposals/uncertainty/uncertainty.svg‎
Lines changed: 415 additions & 0 deletions
diff --git a/‎Proposals/uncertainty/uncertaintyadditions.md‎
Lines changed: 76 additions & 0 deletions b/‎Proposals/uncertainty/uncertaintyadditions.md‎
Lines changed: 76 additions & 0 deletions
diff --git a/‎Proposals/uncertainty/uncertaintyadditions.qmd‎
Lines changed: 34 additions & 0 deletions b/‎Proposals/uncertainty/uncertaintyadditions.qmd‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎datachecks/clean_empty_strings.py‎
Lines changed: 73 additions & 0 deletions b/‎datachecks/clean_empty_strings.py‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎function/ap/dailyquerytable.sql‎
Lines changed: 97 additions & 0 deletions b/‎function/ap/dailyquerytable.sql‎
Lines changed: 97 additions & 0 deletions
diff --git a/‎function/ap/dailysummaries.sql‎
Lines changed: 19 additions & 0 deletions b/‎function/ap/dailysummaries.sql‎
Lines changed: 19 additions & 0 deletions
@@ -20,3 +20,15 @@ helpers/settings.yaml
 fixGeorge.sh
 
 helpers/localDup.sh
+
+*.tar
+
+helpers/archives/
+
+*.gz
+
+.vscode/
+
+helpers/figshareUpload/settings.yaml
+
+helpers/figshareUpload/lib/__pycache__/
@@ -0,0 +1,76 @@
+---
+title: "A New Neotoma Uncertainty Model"
+format: pdf
+---
+
+# Adding Uncertainty to Neotoma
+
+The use of uncertainty for measured values is critical. We need it directly associated with individual measurements, and we need to identify the type of uncertainty, and, potentially, the source of the uncertainty (methods of calculation, etc.). This means that for any uncertainty measurement we need to have a link to the sample and the variable that is being measured, we need to have some set of fixed uncertainty measures (standard deviations, standard errors), we also need to be able to freely define the source of the uncertainty (or perhaps again have a fixed set of measures). So, it should be possible to report the following:
+
+| reference                      | value | units | uncertainty reported | source                                  |
+|--------------------------------|-------|-------|----------------------|-----------------------------------------|
+| Pinus count for sample 1223445 | 12    | NISP  | 1SD                  | Mahr Nomograms (cf. Maher Jr 1972)      |
+| pH for sample 23244            | .02   | pH    | 95% CI               | Reported instrumental error from device |
+| NaOH for sample 23244          | .02   | ug    | 95% CI               | Reported instrumental error from device |
+
+## Table modifications
+
+The uncertainty must be linked with the `ndb.data.dataid` because it modifies the `ndb.data.value` for that variable & sample. If we can assume that the units for the uncertainty are equivalent to the units associated with the variable, however it is possible that uncertainty may be expressed as a percent value. Given this, we will create a new table that links the `ndb.data.dataid` primary key. This allows us to traverse the `ndb.variables` entry for the record (to retrieve the taxonomic information), and potentially link to the variable units if they are equivalent.
+
+Given this data model:
+
+* The table `ndb.data` remains as is.
+* The table `ndb.variables` remains as is.
+* We add a new table `ndb.datauncertainties` that uses fk(dataid) (the `fk(variableid)` is implied).
+  * The table has columns `uncertaintyvalue`, `uncertaintyunit`, `uncertaintybasisid` and `notes` along with the standard `recdatecreated` and `recdatemodified`.
+
+They will inherit information from the `ndb.variables` row, so the assumption is that the uncertainty is reported in the same units (and for the same taxon) as the `ndb.data.value`.
+
+![Overall structure of the tables](uncertainty.svg)
+
+### Example Table
+
+| column | type    | nulls | default | children | parents  | comments   |
+|---------------------|---------|-------|---------|----------|----------|------------|
+| dataid | integer | F     | null    |          | ndb.data | fk(dataid) |
+| uncertaintyvalue   | float | F | | | | | The value is required. |
+| uncertaintyunit   | float | F | | | | | The value is required. |
+| uncertaintybasisid | integer | F | | | | ndb.uncertaintybases | |
+| notes              | text | T | null | | | |
+
+#### Proposed `ndb.uncertaintybasis.uncertaintybasis` values
+
+Proposed values for uncertainty tables will come from standard reporting of uncertainty.
+
+* 1 Standard Deviation
+* 2 Standard Deviations
+* 3 Standard Deviations
+* Mean square error
+
+```SQL
+CREATE TABLE IF NOT EXISTS ndb.uncertaintybases (
+    uncertaintybasisid SERIAL PRIMARY KEY,
+    uncertaintybasis text,
+    CONSTRAINT uniquebasis UNIQUE (uncertaintybasis))
+)
+INSERT INTO ndb.uncertaintybases (uncertaintybasis)
+VALUES ('1 Standard Deviation'),
+       ('2 Standard Deviations'),
+       ('3 Standard Deviation'),
+       ('1 Standard Error');
+```
+
+### Proposed `ndb.datauncertainties` structure
+
+| uncertaintybasisid | uncertaintybasis | . . . |
+
+```SQL
+CREATE TABLE IF NOT EXISTS ndb.datauncertainties (
+    dataid INTEGER REFERENCES ndb.data(dataid),
+    uncertaintyvalue float,
+    uncertaintyunitid integer REFERENCES ndb.variableunits(variableunitsid),
+    uncertaintybasisid integer REFERENCES ndb.uncertaintybases(uncertaintybasisid),
+    notes text,
+    CONSTRAINT uniqueentryvalue UNIQUE (dataid, uncertaintyunitid, uncertaintybasisid)
+);
+```
@@ -0,0 +1,34 @@
+---
+title: "Untitled"
+format: html
+---
+
+# Adding Uncertainty to Neotoma
+
+The use of uncertainty is critical. We need it directly associated with individual measurements, and we need to identify the type of uncertainty.
+
+## Table modifications
+
+The table `ndb.data` needs two new columns: `uncertaintyvalue` and `uncertaintytype`.
+
+They will inherit information from the `ndb.variables` row, so the assumption is that the uncertainty is reported in the same units (and for the same taxon) as the `ndb.data.value`. 
+
+![Overall structure of the tables](uncertainty.svg)
+
+### Proposed `ndb.data` structure:
+
+| dataid | sampleid | variableid | value | uncertaintyvalue | uncertaintybasisid | . . . |
+
+### Proposed `ndb.uncertaintybasis` structure:
+
+| uncertaintybasisid | uncertaintybasis | . . . |
+
+#### Proposed `ndb.uncertaintybasis.uncertaintybasis` values:
+
+Proposed values for uncertainty tables will come from standard reporting of uncertainty.
+
+* 1 Standard Deviation
+* 2 Standard Deviations
+* 3 Standard Deviations
+* Mean square error
+
@@ -0,0 +1,73 @@
+"""_Check for non-breaking spaces in text fields_
+   This issue arose as part of some text-searching that a 
+   user was doing. The code here connects to a PostgreSQL
+   database, 
+"""
+import json
+import psycopg2
+from psycopg2 import sql
+
+print("\nRunning database tests.")
+with open('../connect_remote.json', encoding='UTF-8') as f:
+    data = json.load(f)
+
+conn = psycopg2.connect(**data)
+conn.autocommit = True
+cur = conn.cursor()
+
+# List all text columns.
+TEXT_COLS = """
+    select col.table_schema,
+        col.table_name,
+        col.ordinal_position as column_id,
+        col.column_name,
+        col.data_type,
+        col.character_maximum_length as maximum_length
+    from information_schema.columns col
+    join information_schema.tables tab on tab.table_schema = col.table_schema
+                                    and tab.table_name = col.table_name
+                                    and tab.table_type = 'BASE TABLE'
+    where col.data_type in ('character varying', 'character',
+                            'text', '"char"', 'name')
+        and col.table_schema not in ('information_schema', 'pg_catalog', 'public', 'tmp', 'pglogical')
+    order by col.table_schema,
+            col.table_name,
+            col.ordinal_position;"""
+
+cur.execute(TEXT_COLS)
+tables = cur.fetchall()
+
+EMPTY_SPACE = """
+SELECT {}
+FROM {}
+WHERE {} ~ '.*[\u00A0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF].*'"""
+
+runcounter = []
+
+for row in tables:
+    tableobj = {'schema': row[0], 'table': row[1], 'column': row[3]}
+    cur.execute(
+        sql.SQL(EMPTY_SPACE).format(sql.Identifier(row[3]),
+                                    sql.Identifier(row[0], row[1]),
+                                    sql.Identifier(row[3])),
+        tableobj)
+    tableobj.update({'rows': cur.fetchall()})
+    runcounter.append(tableobj)
+
+FIELDS = list(filter(lambda x: len(x['rows']) > 0, runcounter))
+
+len(FIELDS)
+
+UPDATE_QUERY = """
+UPDATE {}
+SET {} = regexp_replace({},
+    '[\u00A0\u1680\u180E\u2000\u200B\u202F\u205F\u3000\uFEFF]',
+    ' ')
+    WHERE {} ~ '.*[\u00A0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF].*'"""
+
+for row in FIELDS:
+    cur.execute(
+        sql.SQL(UPDATE_QUERY).format(sql.Identifier(row['schema'], row['table']),
+                                     sql.Identifier(row['column']),
+                                     sql.Identifier(row['column']),
+                                     sql.Identifier(row['column'])))
@@ -0,0 +1,97 @@
+CREATE FUNCTION ap.dailyquerytable(_interval VARCHAR)
+RETURNS TABLE (siteid INT,
+        sitename VARCHAR,
+        datasetid INT,
+        chronologyid INT,
+        altitude FLOAT,
+        datasettype VARCHAR,
+        databaseid INT,
+        collectionunitid INT,
+        colltype VARCHAR,
+        depenvt VARCHAR,
+        geog GEOGRAPHY,
+        older FLOAT,
+        younger FLOAT,
+        agetype VARCHAR,
+        publications INT[],
+        taxa INT[],
+        keywords INT[],
+        contacts INT[],
+        collectionunit JSONB,
+        geopol INT[])
+AS $$
+WITH allids AS (
+        SELECT st.siteid,
+        unnest(array_append(gp.geoout, gp.geoin::int)) AS geopol
+        FROM ndb.sites AS st
+        INNER JOIN ndb.sitegeopolitical AS sgp ON st.siteid = sgp.siteid
+        INNER JOIN ndb.geopoliticalunits AS gpu ON gpu.geopoliticalid = sgp.geopoliticalid
+        INNER JOIN ndb.geopaths AS gp ON gp.geoin = sgp.geopoliticalid
+    ),
+    sgp AS (
+        SELECT siteid, array_agg(DISTINCT geopol) AS geopol
+        FROM allids
+        GROUP BY siteid
+    )
+    SELECT st.siteid,
+        st.sitename,
+        ds.datasetid,
+        chron.chronologyid,
+        st.altitude,
+        dst.datasettype,
+        dsdb.databaseid,
+        cu.collectionunitid,
+        cut.colltype,
+        dvt.depenvt,
+        st.geog,
+        arg.older,
+        arg.younger,
+        agetypes.agetype,
+        array_remove(array_agg(DISTINCT dspb.publicationid), NULL) AS publications,
+        array_remove(array_agg(DISTINCT var.taxonid), NULL) AS taxa,
+        array_remove(array_agg(DISTINCT smpkw.keywordid), NULL) AS keywords,
+        array_remove(array_agg(DISTINCT dpi.contactid) || array_agg(DISTINCT sma.contactid), NULL) AS contacts,
+        jsonb_build_object('collectionunitid', cu.collectionunitid,
+							 'collectionunit', cu.collunitname,
+									 'handle', cu.handle,
+						 'collectionunittype', cut.colltype,
+								   'datasets', json_agg(DISTINCT jsonb_build_object('datasetid', ds.datasetid,
+                                                                        'datasettype', dst.datasettype))) AS collectionunit,
+        sgp.geopol
+    FROM ndb.sites AS st
+    LEFT OUTER JOIN ndb.collectionunits AS cu ON cu.siteid = st.siteid
+    LEFT OUTER JOIN ndb.collectiontypes AS cut ON cut.colltypeid = cu.colltypeid
+    LEFT OUTER JOIN ndb.datasets AS ds ON ds.collectionunitid = cu.collectionunitid
+    LEFT OUTER JOIN ndb.depenvttypes AS dvt ON dvt.depenvtid = cu.depenvtid
+    LEFT OUTER JOIN ndb.datasetpis AS dpi ON dpi.datasetid = ds.datasetid
+    LEFT OUTER JOIN ndb.datasettypes AS dst ON dst.datasettypeid = ds.datasettypeid
+    LEFT OUTER JOIN ndb.datasetdatabases AS dsdb ON ds.datasetid = dsdb.datasetid
+    LEFT OUTER JOIN ndb.datasetpublications AS dspb ON dspb.datasetid = ds.datasetid
+    LEFT OUTER JOIN ndb.chronologies AS chron ON chron.collectionunitid = ds.collectionunitid
+    LEFT OUTER JOIN ndb.dsageranges AS arg ON ds.datasetid = arg.datasetid AND chron.agetypeid = arg.agetypeid
+    LEFT OUTER JOIN ndb.agetypes  AS agetypes ON agetypes.agetypeid = arg.agetypeid
+    LEFT OUTER JOIN ndb.samples AS smp ON smp.datasetid = ds.datasetid 
+    LEFT OUTER JOIN ndb.sampleanalysts AS sma ON sma.sampleid = smp.sampleid
+    LEFT OUTER JOIN ndb.samplekeywords AS smpkw ON smpkw.sampleid = smp.sampleid
+    LEFT OUTER JOIN ndb.data AS dt ON dt.sampleid = smp.sampleid
+    LEFT OUTER JOIN ndb.variables AS var ON var.variableid = dt.variableid
+    LEFT OUTER JOIN sgp AS sgp ON st.siteid = sgp.siteid
+    WHERE ds.recdatemodified > current_date - (_interval || 'day')::INTERVAL OR
+          smp.recdatemodified > current_date - (_interval || 'day')::INTERVAL OR
+          st.recdatemodified > current_date - (_interval || 'day')::INTERVAL    
+    GROUP BY st.siteid,
+        cu.collectionunitid,
+        st.sitename,
+        ds.datasetid,
+        cut.colltype,
+        chron.chronologyid,
+        dsdb.databaseid,
+        st.altitude,
+        dst.datasettype,
+        st.geog,
+        arg.older,
+        arg.younger,
+        agetypes.agetype,
+        sgp.geopol,
+        dvt.depenvt
+$$ LANGUAGE sql;
@@ -0,0 +1,19 @@
+CREATE or REPLACE FUNCTION ap.dailysummaries(_interval VARCHAR DEFAULT '1')
+RETURNS TABLE (dbdate DATE, sites BIGINT, datasets BIGINT, publications BIGINT, observations BIGINT)
+AS $$
+SELECT DISTINCT date_trunc('day', ds.recdatecreated)::date AS dbdate,
+       COUNT(DISTINCT st.siteid) AS sites,
+       COUNT(DISTINCT ds.datasetid) AS datasets,
+	   COUNT(DISTINCT pu.publicationid) AS publications,
+	   COUNT(DISTINCT dt.dataid) AS observations
+FROM ndb.sites AS st
+INNER JOIN ndb.collectionunits AS cu ON cu.siteid = st.siteid
+INNER JOIN ndb.datasets AS ds ON ds.collectionunitid = cu.collectionunitid
+INNER JOIN ndb.datasetpublications AS dspu ON dspu.datasetid = ds.datasetid
+INNER JOIN ndb.publications AS pu ON pu.publicationid = dspu.publicationid
+INNER JOIN ndb.analysisunits AS au ON au.collectionunitid = cu.collectionunitid
+INNER JOIN ndb.samples AS smp ON smp.analysisunitid = au.analysisunitid
+INNER JOIN ndb.data AS dt ON dt.sampleid = smp.sampleid
+WHERE ds.recdatecreated > current_date - (_interval || 'day')::INTERVAL
+GROUP BY date_trunc('day', ds.recdatecreated)
+$$ LANGUAGE sql;