Skip to content

traits value constraints: needed cleanup and decisions #6

Open
@dlebauer

Description

@dlebauer

@gsrohde commented on Tue Feb 17 2015

date, dateloc, time, timeloc, date_year, date_month, date_day, time_hour, time_minute

  • Decide what constraints to use
  • Clean up data if needed
  • Add to migration

mean

  • Clean up NULLs
  • Add not-null constraint to migration

n

  • Decide if n can be 1; change 1's to NULL if not. Similarly for 0.
  • Add CHECK (n >= 2) or CHECK (n >= 1) or CHECK (n >= 0)

stat, statname, n consistency

  • Decide on consistency constraints
  • Fix violations
  • Add needed constraints

cultivar_id, specie_id consistency

  • Fix cases where cultivar_id = 0
  • Fix cases where specie_id is NULL but cultivar_id is not
  • Fix inconsistent species-cultivar references
  • Figure out how to write a consistency constraint
  • Add it to migration

checked

  • Decide how to handle NULLs
  • Add not null constraint

access_level

  • Decide how to handle cases where access_level = 0
  • Change column type to the new DOMAIN level_of_access

Other NULLs

  • Clean up NULLs in key columns (or rethink whether they need to be non-null)

Details

date, dateloc, time, timeloc, date_year, date_month, date_day

See discussion in GH #239 and in section 3.26 of https://www.overleaf.com/2086241dwjyrd#/5297403/.

UPDATE 9/15/2015

There are serious issues with the way we use date and time fields in the traits and yields tables. I've made a separate issue for this.

mean

29 rows have NULLs in the mean column. Should these be deleted? 18 of these rows have non-null stat values.

n

See discussion in GH #231.

stat, statname, n consistency

See discussion in GH #231.

cultivar_id, specie_id consistency

This is discussed in the context of foreign-key constraints. See GH # 175.

To see inconsistencies in human-readable form, use

SELECT t_sp.scientificname AS "species referred to by traits table", c_sp.scientificname AS "species matching cultivar", c.name FROM traits t JOIN cultivars c ON t.cultivar_id = c.id JOIN species t_sp ON t_sp.id = t.specie_id JOIN species c_sp ON c.specie_id = c_sp.id WHERE t.specie_id != c.specie_id;

checked

checked is NULL in 143 rows.

access_level

access_level = 0 in 2195 rows.

Other NULLs

This supplements the information discussed in GH issues #175 (about the foreign keys in the traits table) and #204 (about the candidate key for the traits table).

As noted in issue #204, NULLs are rampant in this table, even in columns that make up the proposed key. To get an idea of the scope of the problem, run

SELECT COUNT(*) AS "total number of rows", SUM((site_id IS NULL)::int) AS "site_id NULLs", SUM((specie_id IS NULL)::int) AS "species_id NULLs", SUM((citation_id IS NULL)::int) AS "citation_id NULLs", SUM((cultivar_id IS NULL)::int) AS "cultivar_id NULLs", SUM((treatment_id IS NULL)::int) AS "treatment_id NULLs", SUM((date IS NULL)::int) AS "date NULLs", SUM((time IS NULL)::int) AS "time NULLs", SUM((variable_id IS NULL)::int) AS "variable_id NULLs", SUM((entity_id IS NULL)::int) AS "entity_id NULLs", SUM((method_id IS NULL)::int) AS "method_id NULLs", SUM((date_year IS NULL)::int) AS "date_year NULLs", SUM((date_month IS NULL)::int) AS "date_month NULLs", SUM((date_day IS NULL)::int) AS "date_day NULLs", SUM((time_hour IS NULL)::int) AS "time_hour NULLs", SUM((time_minute)::int) AS "time_minute" FROM traits;

The result is

-[ RECORD 1 ]--------+------
total number of rows | 32027
site_id NULLs        | 8364
species_id NULLs     | 79
citation_id NULLs    | 30
cultivar_id NULLs    | 23571
treatment_id NULLs   | 8120
date NULLs           | 13694
time NULLs           | 5290
variable_id NULLs    | 1
entity_id NULLs      | 28374
method_id NULLs      | 29176
date_year NULLs      | 11022
date_month NULLs     | 13559
date_day NULLs       | 15354
time_hour NULLs      | 29120
time_minute          | 15893

(In psql, to see results in this format, use the "\x" command before running the query.)

To get a better idea of how many rows contain no date information or no time information, run

SELECT COUNT(*) AS "total number of rows", SUM((date IS NULL AND date_year IS NULL AND date_month IS NULL AND date_day IS NULL)::int) AS "rows with no date info", SUM((time IS NULL AND time_hour IS NULL AND time_minute IS NULL)::int) AS "rows with no time into" FROM traits;

This yields

-[ RECORD 1 ]----------+------
total number of rows   | 32027
rows with no date info | 9787
rows with no time into | 4988

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions