Automatic QC/filtering of NASIS records w/ fetchNASIS rmHzErrors ETC. #160

brownag · 2021-01-21T03:40:25Z

brownag
Jan 21, 2021
Maintainer

To remove, or not to remove, that is the question. Whether 'tis nobler to lose more than half your data... The slings and arrows of outrageous fortune... or to take arms against a sea of troubles... And "fix" the "errors." - someone

We do in defined cases.

The sequence and specific cutoff for what is fixable needs to be discussed. And we need to document those parameters after it has been discussed.

This is me trying to get more discussion going outside the context of "issues"

Especially about fundamental things about important functions like default behavior of fetchNASIS

The following issues are related.

error in fetchNASIS with diagHzBoolean NASIS-local

this issue is resolved, but we can probably better document the cases when fetchNASIS does remove profiles with horizon errors via rmHzErrors
error in fetchNASIS with diagHzBoolean #158 opened Jan 20, 2021 by @smroecker

Dissaggregated glossic horizons unsupported? NASIS-local

Dissaggregated glossic horizons unsupported? #122 opened on Jan 22, 2020 by @phytoclast

get_hz_data_from_NASIS_db - issues with join to phsample NASIS-local

get_hz_data_from_NASIS_db - issues with join to phsample #120 opened on Jan 17, 2020 by @brownag

new functions for common QC of pedon / component data NASIS-local

new functions for common QC of pedon / component data #89 opened on Jan 31, 2019 by @dylanbeaudette

normalize parent material, geomorp, ecosite, flattening strategies

normalize parent material, geomorp, ecosite, flattening strategies #84 opened on Oct 31, 2018 by @dylanbeaudette

should `NA` be interpreted as FALSE in .diagHzLongtoWide()?

NA should be interpreted as FALSE in .diagHzLongtoWide() #59 opened on on Feb 23, 2018 by @dylanbeaudette

brownag · 2021-01-28T23:22:31Z

brownag
Jan 28, 2021
Maintainer Author

RE @smroecker's bug report #158 -- you were right to be suspicious of these results as they probably have not have been the same historically. I am sorry this passed my initial sniff test... I should have looked further beyond rmHzErrors--at the specifics of why these pedons were coming back with depthLogic and overlapOrGap errors from aqp::hzDepthTests

After speaking with @jskovlin today, I identified an issue with the join logic for horizon data being used inside .fetchNASIS_pedons.

The details of the bug have to do with nuances of base::merge behavior when some of the join IDs are absent from the RHS of the join (i.e. they are all-NA rows in the resulting joined data.frame)

Since Summer 2020, the SoilProfileCollection horizons<- join method handles this case "correctly". bf40686 converts fetchNASIS to use horizons<- (after building a minimal SPC) as opposed to base::merge() before making the SPC.

A corresponding version bump of soilDB to 2.6.0 has been made. I certainly wish I had caught this before the 2.5.9 release-- and Stephen provided me the opportunity, but alas I missed it.

That said this is very timely considering a) we didnt find it during the stats class and b) the plans to incorporate much more rigorous unit testing of NASIS functionality via DBI interface + SQLite backend etc. in 2.6.x. Upgrading soilDB to take advantage of the new aqp integrity features should also reduce the amount of duplication of logic-checking and join code across the two packages.

0 replies

dylanbeaudette · 2021-01-29T17:51:14Z

dylanbeaudette
Jan 29, 2021
Maintainer

After talking with @brownag, I'd support the following:

rmHzErrors=FALSE as the default for fetchNASIS
checking for hz errors happens at the very end (applied to the SPC, with new, optimized methods) but only when rmHzErrors=TRUE
future plans for fetchNASIS involve data-getting vs. data-cleaning / filling

0 replies

brownag · 2021-02-17T22:24:36Z

brownag
Feb 17, 2021
Maintainer Author

Added two more older issues to above list:

normalize parent material, geomorp, ecosite, flattening strategies

#84 opened on Oct 31, 2018 by @dylanbeaudette

### NA should be interpreted as FALSE in .diagHzLongtoWide()
#59 opened on on Feb 23, 2018 by @dylanbeaudette

2 replies

dylanbeaudette Feb 18, 2021
Maintainer

Meta-question: should we be crossing things off in the top-level post as issues are closed?

brownag Feb 18, 2021
Maintainer Author

Yes, can do! I crossed Stephen's issue off from before, but left it on the list

brownag · 2021-07-01T22:49:14Z

brownag
Jul 1, 2021
Maintainer Author

Added #192 to list; thanks @hammerly for pointing out more cases where this discussion is relevant

0 replies

brownag · 2021-07-06T21:33:42Z

brownag
Jul 6, 2021
Maintainer Author

Some more thoughts from #192 for laboratory (field or KSSL) records that cause duplication by (naive) queries that assume 1:1 relationship between phorizon and child tables:

Ideally we would be able to distinguish depth repeated measures from the technical replicates of data within strata. If a re-run was due to "bad data" or some error in the process then I would argue that old data should probably be removed from database to eliminate any uncertainty; or we add a flag that allows data to be filtered and marked obsolete while still retaining the "record" that it was measured and re-done.

You raise a very good point on the PSCS. That seems like an example where we wouldn't want to weighted average the values within a specific horizon, but rather across horizons? So, say you have a profile that has 25-100cm PSCS, and morphologic horizon upper bounds at 18, 36, 75, 100. The first PSCS subsample in phlabsample might only be the 25-36 portion of the 18-36cm horizon, and the purpose of collecting that sample was not to produce a weighted average for 18-36 but rather as a component of the 25-100 (which spans 3 morphologic horizons). If a single horizon contained the phsample data for the whole 25-100 interval that would be another case. Currently there is very little validaton that is done on the sample depths populated in there with respect to parent horion depths, and that is probably what we would need more of to get more specific here.

Another related example is subsampling diffuse clay increases to interpolate where the clay increase/upper boundary of the argillic horizon is. I am not very familiar with this approach or sure of how prevalent it is but I remember it from correlation training. Soil Taxonomy (p.33 1999 "The top of the argillic horizon") briefly discusses a method for interpolating the upper boundary of the argillic by "fitting a smooth curve." In the cases where this has been done I am not sure of the conventions for the field horizons versus the subsamples of layers are portrayed in NASIS. I imagine some sort of constant depth sampling within horizons that would then in disaggregated form be used to fit some sort of function to.

Perhaps some queries or reports on NASIS side to help identify [potential] data population issues in those tables would help highlight how and where these tables are being used.

0 replies

brownag · 2023-11-26T16:45:07Z

brownag
Nov 26, 2023
Maintainer Author

since soilDB 2.7.0 (may 2022) for fetchNASIS from pedons, components and reports rmHzErrors=FALSE 0f57ba5

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic QC/filtering of NASIS records w/ fetchNASIS rmHzErrors ETC. #160

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Automatic QC/filtering of NASIS records w/ fetchNASIS rmHzErrors ETC. #160

brownag Jan 21, 2021 Maintainer

error in fetchNASIS with diagHzBoolean NASIS-local

Dissaggregated glossic horizons unsupported? NASIS-local

get_hz_data_from_NASIS_db - issues with join to phsample NASIS-local

new functions for common QC of pedon / component data NASIS-local

normalize parent material, geomorp, ecosite, flattening strategies

should NA be interpreted as FALSE in .diagHzLongtoWide()?

Replies: 6 comments · 2 replies

brownag Jan 28, 2021 Maintainer Author

dylanbeaudette Jan 29, 2021 Maintainer

brownag Feb 17, 2021 Maintainer Author

normalize parent material, geomorp, ecosite, flattening strategies

dylanbeaudette Feb 18, 2021 Maintainer

brownag Feb 18, 2021 Maintainer Author

brownag Jul 1, 2021 Maintainer Author

brownag Jul 6, 2021 Maintainer Author

brownag Nov 26, 2023 Maintainer Author

brownag
Jan 21, 2021
Maintainer

should `NA` be interpreted as FALSE in .diagHzLongtoWide()?

Replies: 6 comments 2 replies

brownag
Jan 28, 2021
Maintainer Author

dylanbeaudette
Jan 29, 2021
Maintainer

brownag
Feb 17, 2021
Maintainer Author

dylanbeaudette Feb 18, 2021
Maintainer

brownag Feb 18, 2021
Maintainer Author

brownag
Jul 1, 2021
Maintainer Author

brownag
Jul 6, 2021
Maintainer Author

brownag
Nov 26, 2023
Maintainer Author