Description
@gsrohde commented on Tue Feb 17 2015
date, dateloc, time, timeloc, date_year, date_month, date_day, time_hour, time_minute
- Decide what constraints to use
- Clean up data if needed
- Add to migration
mean
- Clean up NULLs
- Add not-null constraint to migration
n
- Decide if n can be 1; change 1's to NULL if not. Similarly for 0.
- Add CHECK (n >= 2) or CHECK (n >= 1) or CHECK (n >= 0)
stat, statname, n consistency
- Decide on consistency constraints
- Fix violations
- Add needed constraints
cultivar_id, specie_id consistency
- Fix cases where cultivar_id = 0
- Fix cases where specie_id is NULL but cultivar_id is not
- Fix inconsistent species-cultivar references
- Figure out how to write a consistency constraint
- Add it to migration
checked
- Decide how to handle NULLs
- Add not null constraint
access_level
- Decide how to handle cases where access_level = 0
- Change column type to the new DOMAIN level_of_access
Other NULLs
- Clean up NULLs in key columns (or rethink whether they need to be non-null)
Details
date, dateloc, time, timeloc, date_year, date_month, date_day
See discussion in GH #239 and in section 3.26 of https://www.overleaf.com/2086241dwjyrd#/5297403/.
UPDATE 9/15/2015
There are serious issues with the way we use date and time fields in the traits and yields tables. I've made a separate issue for this.
mean
29 rows have NULLs in the mean column. Should these be deleted? 18 of these rows have non-null stat values.
n
See discussion in GH #231.
stat, statname, n consistency
See discussion in GH #231.
cultivar_id, specie_id consistency
This is discussed in the context of foreign-key constraints. See GH # 175.
To see inconsistencies in human-readable form, use
SELECT t_sp.scientificname AS "species referred to by traits table", c_sp.scientificname AS "species matching cultivar", c.name FROM traits t JOIN cultivars c ON t.cultivar_id = c.id JOIN species t_sp ON t_sp.id = t.specie_id JOIN species c_sp ON c.specie_id = c_sp.id WHERE t.specie_id != c.specie_id;
checked
checked is NULL in 143 rows.
access_level
access_level = 0 in 2195 rows.
Other NULLs
This supplements the information discussed in GH issues #175 (about the foreign keys in the traits table) and #204 (about the candidate key for the traits table).
As noted in issue #204, NULLs are rampant in this table, even in columns that make up the proposed key. To get an idea of the scope of the problem, run
SELECT COUNT(*) AS "total number of rows", SUM((site_id IS NULL)::int) AS "site_id NULLs", SUM((specie_id IS NULL)::int) AS "species_id NULLs", SUM((citation_id IS NULL)::int) AS "citation_id NULLs", SUM((cultivar_id IS NULL)::int) AS "cultivar_id NULLs", SUM((treatment_id IS NULL)::int) AS "treatment_id NULLs", SUM((date IS NULL)::int) AS "date NULLs", SUM((time IS NULL)::int) AS "time NULLs", SUM((variable_id IS NULL)::int) AS "variable_id NULLs", SUM((entity_id IS NULL)::int) AS "entity_id NULLs", SUM((method_id IS NULL)::int) AS "method_id NULLs", SUM((date_year IS NULL)::int) AS "date_year NULLs", SUM((date_month IS NULL)::int) AS "date_month NULLs", SUM((date_day IS NULL)::int) AS "date_day NULLs", SUM((time_hour IS NULL)::int) AS "time_hour NULLs", SUM((time_minute)::int) AS "time_minute" FROM traits;
The result is
-[ RECORD 1 ]--------+------
total number of rows | 32027
site_id NULLs | 8364
species_id NULLs | 79
citation_id NULLs | 30
cultivar_id NULLs | 23571
treatment_id NULLs | 8120
date NULLs | 13694
time NULLs | 5290
variable_id NULLs | 1
entity_id NULLs | 28374
method_id NULLs | 29176
date_year NULLs | 11022
date_month NULLs | 13559
date_day NULLs | 15354
time_hour NULLs | 29120
time_minute | 15893
(In psql, to see results in this format, use the "\x" command before running the query.)
To get a better idea of how many rows contain no date information or no time information, run
SELECT COUNT(*) AS "total number of rows", SUM((date IS NULL AND date_year IS NULL AND date_month IS NULL AND date_day IS NULL)::int) AS "rows with no date info", SUM((time IS NULL AND time_hour IS NULL AND time_minute IS NULL)::int) AS "rows with no time into" FROM traits;
This yields
-[ RECORD 1 ]----------+------
total number of rows | 32027
rows with no date info | 9787
rows with no time into | 4988