Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with uncode() #273

Closed
smroecker opened this issue Oct 11, 2022 · 6 comments
Closed

problem with uncode() #273

smroecker opened this issue Oct 11, 2022 · 6 comments

Comments

@smroecker
Copy link
Member

The following line doesn't work when the ChoiceName != ChoiceLabel (e.g. nhiforsoigrp column in the mapunit table), which is the case for ~25% of the metadata lookup table. In these instances, line 118 converts the values to NA because they don't match the levels in the other factor. This is complicated by the fact that SOME but not ALL of factor levels maybe different, thus the preceding if statements logic overwrites SOME but not ALL of them to NA. I haven't fully determined what the best fix would be, but wanted to log this before I logged off for the day. Values from SDA and GDB (need to check LIMS) appear to return values coded as ChoiceLabel, thus it may not be necessary to use an if statement on line 116.

Cheers

@brownag
Copy link
Member

brownag commented Oct 11, 2022

Can you provide a sample dataset / query / function call where this is an issue?

The current version of uncode() should not require ChoiceName == ChoiceLabel, in general. That is why changes were made to allow for both name and label values to be used in coding/uncoding. I thought it was working but if there is an issue with something getting uncoded incorrectly then we can definitely sort that out.

@brownag
Copy link
Member

brownag commented Oct 11, 2022

I think this mostly works... but I see what you mean about if you code() a ChoiceLabel and then uncode() those coded values, you get ChoiceName.

library(soilDB)
x <- SDA_query("SELECT DISTINCT nhiforsoigrp FROM mapunit") 
#> single result set, returning a data.frame
x
#>   nhiforsoigrp
#> 1     Group IB
#> 2    Group IIB
#> 3     Group IA
#> 4         <NA>
#> 5    Group IIA
#> 6     Group IC
#> 7           NC
x2 <- code(x)
x2
#>   nhiforsoigrp
#> 1            2
#> 2            5
#> 3            1
#> 4           NA
#> 5            4
#> 6            3
#> 7            6
x3 <- uncode(x2)
x3
#>   nhiforsoigrp
#> 1           IB
#> 2          IIB
#> 3           IA
#> 4         <NA>
#> 5          IIA
#> 6           IC
#> 7           NC

@smroecker
Copy link
Member Author

test <- SDA_query("SELECT TOP 10000 * FROM mapunit WHERE nhiforsoigrp IS NOT NULL")
test2 <- uncode(test)
table(test$nhiforsoigrp)
test2 <- uncode(test[-19])
uncode(test["nhiforsoigrp"])

brownag added a commit that referenced this issue Oct 12, 2022
@brownag
Copy link
Member

brownag commented Oct 12, 2022

I think you are right that the logic should be simplified at least for the invert=FALSE case... and as you show there is an issue when combining two different sets of factor levels that partially overlap. It creates NA except in the simplest cases (depends on the input data)

I took a crack at it and it seems to work for the following case (a mix of mapunits from NH and MA).

Before change I get:

library(soilDB)
test <- SDA_query("SELECT TOP 10000 * FROM mapunit 
                   INNER JOIN legend ON legend.lkey = mapunit.lkey
                   WHERE areasymbol LIKE 'NH%' OR areasymbol LIKE 'MA%'")
#> single result set, returning a data.frame
sum(is.na(test$nhiforsoigrp))
#> [1] 1582
test2 <- uncode(test)
#> Warning in `[<-.factor`(`*tmp*`, is.na(nc), value = structure(c(NA, NA, :
#> invalid factor level, NA generated
table(test2$nhiforsoigrp, useNA = "ifany")
#> 
#>   NC <NA> 
#>  181 3273

After a change (c6f0f56) I get:

library(soilDB)
test <- SDA_query("SELECT TOP 10000 * FROM mapunit 
                   INNER JOIN legend ON legend.lkey = mapunit.lkey
                   WHERE areasymbol LIKE 'NH%' OR areasymbol LIKE 'MA%'")
#> single result set, returning a data.frame
sum(is.na(test$nhiforsoigrp))
#> [1] 1582
test2 <- uncode(test)
table(test2$nhiforsoigrp, useNA = "ifany")
#> 
#>  Group IA  Group IB  Group IC Group IIA Group IIB        NC      <NA> 
#>       857       268       210       192       164       181      1582

I'll leave this issue open, please make/suggest any additional changes you see fit as long as existing tests of uncode/code pass. If you have certain things you want to become the normal expectation for these functions, add to tests/test-uncode.R.

I think the main thing is we wanted out of the recent changes (source of this problem) was for code() to work for heterogeneous inputs (a mixture of ChoiceLabel and ChoiceName) ... but as you point out here the same behavior does not make sense/is not needed for uncode() and can cause problems when the logic in that if statement fails (for example with overlap in names and labels)

brownag added a commit that referenced this issue Oct 12, 2022
 - do not omit NA values
 - properly order on ChoiceSequence when domain is ranked and ordered=TRUE
@brownag
Copy link
Member

brownag commented Oct 12, 2022

This issue also alerted me to a problem with some of the new functions to help with explicitly setting factor levels. Again the overlap between ChoiceName and ChoiceLabel causes issues, thanks for bringing this up.

Before change:

library(soilDB)
test <- SDA_query("SELECT TOP 10000 * FROM mapunit WHERE nhiforsoigrp IS NOT NULL")
#> single result set, returning a data.frame
x1 <- factor(test$nhiforsoigrp, levels = get_NASIS_column_metadata("nhiforsoigrp")$ChoiceLabel)
x2 <- NASISChoiceList(list(nhiforsoigrp=test$nhiforsoigrp), choice = "ChoiceLabel")
#> Error in na.omit(as.numeric(apply(do.call("cbind", lapply(y[c("ChoiceValue", : 'list' object cannot be coerced to type 'double'
all.equal(x1, x2)
#> Error in all.equal.factor(x1, x2): object 'x2' not found

After change (2985b37):

library(soilDB)
test <- SDA_query("SELECT TOP 10000 * FROM mapunit WHERE nhiforsoigrp IS NOT NULL")
#> single result set, returning a data.frame
x1 <- factor(test$nhiforsoigrp, levels = get_NASIS_column_metadata("nhiforsoigrp")$ChoiceLabel)
x2 <- NASISChoiceList(list(nhiforsoigrp=test$nhiforsoigrp), choice = "ChoiceLabel")
all.equal(x1, x2)
#> [1] TRUE

@brownag
Copy link
Member

brownag commented Dec 20, 2023

I am going to close this issue since I think the problems that were found have been fixed for some time. Create a new issue if more bugs or unexpected results are found.

@brownag brownag closed this as completed Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants