-
Notifications
You must be signed in to change notification settings - Fork 9
Description
1. Summary
- dht2() will throw an error when it tries to provide stratified estimates if one (or more) of the strata have no observations but do have effort (ie samples).
- This previously worked fine in Distance 1.0.5 but became broken in 1.0.6. The commit that broke it was 2e18d38 on June 16, 2022.
- This is the same problem as originally reported here on the Distance Sampling google group by Shanti Davis and Caroline Fox.
- I understand that Shanti has emailed @LHMarshall some example code and data, but I haven't seen it entered as an issue here yet. So, in an effort to save you folks some time, I went ahead and did some debugging and found the problem and have a suggested fix - see below.
2. Details
- The error ultimately occurs because
dhtd()(defined withinvarNhat()invarNhat.R) no longer returns a vector of the correct length containing estimated Nhats for ALL strata, including zeros for any strata that had no observations, as it did in version 1.0.5. Instead, since version 1.0.6, it returns a vector which only contains Nhats for strata that had observations, i.e., there are no zeros for the "empty" strata in this vector as of version 1.0.6. Thus the vector returned bydht()is short one element for each stratum that had no observations. dhtdis then used byDeltaMethod()in this code starting on line 56 ofvarNhat.R:
- however, because of the shortened vector returned by
dhtd(), the resulting variance-covariance matrix indm$variancehas missing rows and columns for any strata that had no observations. - the error ultimately occurs on line 62 (
vardat_str$df_var <- diag(dm$variance)because the length ofdiag(dm$variance)is shorter than the number of rows ofvardat_str
3. Reprex
The following reprex reproduces the error. Here I have used the minke dataset and added a third Region ("East") containing two sample lines with no observations.
library(Distance)
#> Loading required package: mrds
#> This is mrds 3.0.0
#> Built: R 4.4.2; ; 2025-02-11 02:34:55 UTC; windows
#>
#> Attaching package: 'Distance'
#> The following object is masked from 'package:mrds':
#>
#> create.bins
library(magrittr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
packageVersion("Distance")
#> [1] '2.0.0'
data(minke)
# Lets add a third Region with two samples having no observations. Adding two samples instead of just one to prevent a warning
# message about being unable to estimate encounter rate variance with just one sample in the East stratum.
new.row1 <- minke[1, ] %>%
mutate(
Region.Label = "East",
Area = 10000,
Sample.Label = max(minke$Sample.Label) + 1,
Effort = 10,
distance = NA,
object = NA
)
new.row2 <- minke[1, ] %>%
mutate(
Region.Label = "East",
Area = 10000,
Sample.Label = max(minke$Sample.Label) + 2,
Effort = 10,
distance = NA,
object = NA
)
minke <- rbind(minke, new.row1, new.row2)
detfn <- ds(data=minke, truncation=1.5, key="hr", adjustment=NULL)
#> Fitting hazard-rate key function
#> AIC= 48.637
summary(detfn)
#>
#> Summary for distance analysis
#> Number of observations : 88
#> Distance range : 0 - 1.5
#>
#> Model : Hazard-rate key function
#> AIC : 48.63688
#> Optimisation: mrds (nlminb)
#>
#> Detection function parameters
#> Scale coefficient(s):
#> estimate se
#> (Intercept) -0.2967912 0.1765812
#>
#> Shape coefficient(s):
#> estimate se
#> (Intercept) 0.9648331 0.3605009
#>
#> Estimate SE CV
#> Average p 0.6224396 0.0668011 0.1073214
#> N in covered region 141.3791718 17.7757784 0.1257312
#>
#> Summary statistics:
#> Region Area CoveredArea Effort n k ER se.ER cv.ER
#> 1 East 10000 60.00 20.00 0 2 0.00000000 0.00000000 0.0000000
#> 2 North 630582 4075.14 1358.38 49 12 0.03607238 0.01317937 0.3653591
#> 3 South 84734 1453.23 484.41 39 13 0.08051031 0.01809954 0.2248102
#> 4 Total 725316 5588.37 1862.79 88 27 0.04076644 0.01165147 0.2858103
#>
#> Abundance:
#> Label Estimate se cv lcl ucl df
#> 1 East 0.000 0.0000 0.0000000 0.000 0.000 0.00000
#> 2 North 12181.419 4638.6283 0.3807954 5499.950 26979.692 12.96781
#> 3 South 3653.345 910.0975 0.2491135 2181.595 6117.971 17.96263
#> 4 Total 15834.764 4834.2848 0.3052957 8388.423 29891.167 15.25496
#>
#> Density:
#> Label Estimate se cv lcl ucl df
#> 1 East 0.00000000 0.000000000 0.0000000 0.000000000 0.00000000 0.00000
#> 2 North 0.01931774 0.007356106 0.3807954 0.008722023 0.04278538 12.96781
#> 3 South 0.04311546 0.010740641 0.2491135 0.025746391 0.07220207 17.96263
#> 4 Total 0.02183154 0.006665074 0.3052957 0.011565198 0.04121123 15.25496
# Now try dht2 with stratification - This will cause the error.
try(dht2(
ddf = detfn,
flatfile = minke,
strat_formula = ~ Region.Label,
stratification = "geographical"
))
#> Error in `$<-`(`*tmp*`, "df_var", value = c(1709104.11609035, 153728.436082082 :
#> Assigned data `diag(dm$variance)` must be compatible with existing data.
#> ✖ Existing data has 3 rows.
#> ✖ Assigned data has 2 rows.
#> ℹ Only vectors of size 1 are recycled.
#> Caused by error in `vectbl_recycle_rhs_rows()`:
#> ! Can't recycle input of size 2 to size 3.Created on 2025-02-15 with reprex v2.1.1
4. Proposed fix
My proposed fix is to restore the former (presumably correct??) behaviour of dhtd() by making sure that it includes a 0 in its return vector for any stratum that has no observations. In version 1.0.5, dhtd() accomplished this by initializing its result vector to be the correct length based on the length of the ind argument which was a list with one element per stratum. Since that time, the structure of the code for dhtd() (and for varNhat() in general) has changed a bit and variable ind no longer exists. However, variable grp_dat can provide what is needed as it contains one row for each stratum. My fix involves adding an extra column (n_obs) to grp_dat indicating the number of observations in each stratum. grp_dat is then passed as a new argument to dhtd() where my new column is used to make sure the result vector from dhtd() has zeros for any strata with no observations. This also means that grp_dat needs to be included in the list of 'dhtd()' args in the call to DeltaMethod() on lines 56-57 since that is where dhtd() is eventually used.
These changes are available in commit 7db535e of my fork of Distance at dfifield/Distance.
5. Example showing fixed version
The following reprex demonstrates installing my modified fork of Distance and running the reprex again, this time without the error.
# Install the fixed version from my fork. You will have to change library
# location to suit your system.
remotes::install_github("dfifield/Distance",
lib = "C:/Users/fifieldd/Documents/Offline/R/Rlibs_fixed")
#> Downloading GitHub repo dfifield/Distance@HEAD
#>
#> ── R CMD build ─────────────────────────────────────────────────────────────────
#> checking for file 'E:\SYS_TEMP\Rtmp6FkyjU\remotes98f465eb401f\dfifield-Distance-b93ec6f/DESCRIPTION' ... checking for file 'E:\SYS_TEMP\Rtmp6FkyjU\remotes98f465eb401f\dfifield-Distance-b93ec6f/DESCRIPTION' ... ✔ checking for file 'E:\SYS_TEMP\Rtmp6FkyjU\remotes98f465eb401f\dfifield-Distance-b93ec6f/DESCRIPTION' (770ms)
#> ─ preparing 'Distance': (30.1s)
#> checking DESCRIPTION meta-information ... checking DESCRIPTION meta-information ... ✔ checking DESCRIPTION meta-information
#> ─ checking for LF line-endings in source and make files and shell scripts (1.7s)
#> ─ checking for empty or unneeded directories
#> ─ looking to see if a 'data/datalist' file should be added
#> ─ building 'Distance_2.0.0.9001.tar.gz'
#>
#>
library(Distance, lib.loc = "C:/Users/fifieldd/Documents/Offline/R/Rlibs_fixed")
#> Loading required package: mrds
#> This is mrds 3.0.0
#> Built: R 4.4.2; ; 2025-02-11 02:34:55 UTC; windows
#>
#> Attaching package: 'Distance'
#> The following object is masked from 'package:mrds':
#>
#> create.bins
library(magrittr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
packageVersion("Distance")
#> [1] '2.0.0.9001'
data(minke)
# Lets add a third Region with two samples having no observations
new.row1 <- minke[1, ] %>%
mutate(
Region.Label = "East",
Area = 10000,
Sample.Label = max(minke$Sample.Label) + 1,
Effort = 10,
distance = NA,
object = NA
)
new.row2 <- minke[1, ] %>%
mutate(
Region.Label = "East",
Area = 10000,
Sample.Label = max(minke$Sample.Label) + 2,
Effort = 10,
distance = NA,
object = NA
)
minke <- rbind(minke, new.row1, new.row2)
detfn <- ds(data=minke, truncation=1.5, key="hr", adjustment=NULL)
#> Fitting hazard-rate key function
#> AIC= 48.637
summary(detfn)
#>
#> Summary for distance analysis
#> Number of observations : 88
#> Distance range : 0 - 1.5
#>
#> Model : Hazard-rate key function
#> AIC : 48.63688
#> Optimisation: mrds (nlminb)
#>
#> Detection function parameters
#> Scale coefficient(s):
#> estimate se
#> (Intercept) -0.2967912 0.1765812
#>
#> Shape coefficient(s):
#> estimate se
#> (Intercept) 0.9648331 0.3605009
#>
#> Estimate SE CV
#> Average p 0.6224396 0.0668011 0.1073214
#> N in covered region 141.3791718 17.7757784 0.1257312
#>
#> Summary statistics:
#> Region Area CoveredArea Effort n k ER se.ER cv.ER
#> 1 East 10000 60.00 20.00 0 2 0.00000000 0.00000000 0.0000000
#> 2 North 630582 4075.14 1358.38 49 12 0.03607238 0.01317937 0.3653591
#> 3 South 84734 1453.23 484.41 39 13 0.08051031 0.01809954 0.2248102
#> 4 Total 725316 5588.37 1862.79 88 27 0.04076644 0.01165147 0.2858103
#>
#> Abundance:
#> Label Estimate se cv lcl ucl df
#> 1 East 0.000 0.0000 0.0000000 0.000 0.000 0.00000
#> 2 North 12181.419 4638.6283 0.3807954 5499.950 26979.692 12.96781
#> 3 South 3653.345 910.0975 0.2491135 2181.595 6117.971 17.96263
#> 4 Total 15834.764 4834.2848 0.3052957 8388.423 29891.167 15.25496
#>
#> Density:
#> Label Estimate se cv lcl ucl df
#> 1 East 0.00000000 0.000000000 0.0000000 0.000000000 0.00000000 0.00000
#> 2 North 0.01931774 0.007356106 0.3807954 0.008722023 0.04278538 12.96781
#> 3 South 0.04311546 0.010740641 0.2491135 0.025746391 0.07220207 17.96263
#> 4 Total 0.02183154 0.006665074 0.3052957 0.011565198 0.04121123 15.25496
dht2(
ddf = detfn,
flatfile = minke,
strat_formula = ~ Region.Label,
stratification = "geographical"
)
#> Abundance estimates from distance sampling
#> Stratification : geographical
#> Variance : R2, n/L
#> Multipliers : none
#> Sample fraction : 1
#>
#>
#> Summary statistics:
#> Region.Label Area CoveredArea Effort n k ER se.ER cv.ER
#> East 10000 60.00 20.00 0 2 0.000 0.000 0.000
#> North 630582 4075.14 1358.38 49 12 0.036 0.013 0.365
#> South 84734 1453.23 484.41 39 13 0.081 0.018 0.225
#> Total 725316 5588.37 1862.79 88 27 0.047 0.011 0.236
#>
#> Abundance estimates:
#> Region.Label Estimate se cv LCI UCI df
#> East 0 0.000 0.000 0 0 1.000
#> North 12181 4638.628 0.381 5500 26980 12.968
#> South 3653 910.097 0.249 2182 6118 17.963
#> Total 15835 NaN NaN NaN NaN 15.255
#>
#> Component percentages of variance:
#> Region.Label Detection ER
#> East NaN NaN
#> North 7.94 92.06
#> South 18.56 81.44
#> Total 17.13 82.87Created on 2025-02-15 with reprex v2.1.1
Now dht2() no longer throws an error and it gives output almost identical to that from summary(detfn).
7. One final niggle?
Note that I said the output is almost identical. The only difference I can see is that the Total row cv.ER from summary(detfcn) is
0.286 whereas the corresponding value from dht2() is 0.236. Note that this difference remains even if I use the "unfixed" version 2.0.0 of Distance and also remove the stratum with no observations. Below, I rerun the analysis with the stock minke dataset using the released Version 2.0.0 of Distance and we see again that the Total cv.ER from summary() and dht2() differ. Is this a known and/or intentional difference?
library(Distance)
#> Loading required package: mrds
#> This is mrds 3.0.0
#> Built: R 4.4.2; ; 2025-02-11 02:34:55 UTC; windows
#>
#> Attaching package: 'Distance'
#> The following object is masked from 'package:mrds':
#>
#> create.bins
packageVersion("Distance")
#> [1] '2.0.0'
data(minke)
detfn <- ds(data=minke, truncation=1.5, key="hr", adjustment=NULL)
#> Fitting hazard-rate key function
#> AIC= 48.637
summary(detfn)
#>
#> Summary for distance analysis
#> Number of observations : 88
#> Distance range : 0 - 1.5
#>
#> Model : Hazard-rate key function
#> AIC : 48.63688
#> Optimisation: mrds (nlminb)
#>
#> Detection function parameters
#> Scale coefficient(s):
#> estimate se
#> (Intercept) -0.2967912 0.1765812
#>
#> Shape coefficient(s):
#> estimate se
#> (Intercept) 0.9648331 0.3605009
#>
#> Estimate SE CV
#> Average p 0.6224396 0.0668011 0.1073214
#> N in covered region 141.3791718 17.7757784 0.1257312
#>
#> Summary statistics:
#> Region Area CoveredArea Effort n k ER se.ER cv.ER
#> 1 North 630582 4075.14 1358.38 49 12 0.03607238 0.01317937 0.3653591
#> 2 South 84734 1453.23 484.41 39 13 0.08051031 0.01809954 0.2248102
#> 3 Total 715316 5528.37 1842.79 88 25 0.04133635 0.01181436 0.2858103
#>
#> Abundance:
#> Label Estimate se cv lcl ucl df
#> 1 North 12181.419 4638.6283 0.3807954 5499.950 26979.692 12.96781
#> 2 South 3653.345 910.0975 0.2491135 2181.595 6117.971 17.96263
#> 3 Total 15834.764 4834.2848 0.3052957 8388.423 29891.167 15.25496
#>
#> Density:
#> Label Estimate se cv lcl ucl df
#> 1 North 0.01931774 0.007356106 0.3807954 0.008722023 0.04278538 12.96781
#> 2 South 0.04311546 0.010740641 0.2491135 0.025746391 0.07220207 17.96263
#> 3 Total 0.02213674 0.006758251 0.3052957 0.011726878 0.04178736 15.25496
dht2(
ddf = detfn,
flatfile = minke,
strat_formula = ~ Region.Label,
stratification = "geographical"
)
#> Abundance estimates from distance sampling
#> Stratification : geographical
#> Variance : R2, n/L
#> Multipliers : none
#> Sample fraction : 1
#>
#>
#> Summary statistics:
#> Region.Label Area CoveredArea Effort n k ER se.ER cv.ER
#> North 630582 4075.14 1358.38 49 12 0.036 0.013 0.365
#> South 84734 1453.23 484.41 39 13 0.081 0.018 0.225
#> Total 715316 5528.37 1842.79 88 25 0.048 0.011 0.237
#>
#> Abundance estimates:
#> Region.Label Estimate se cv LCI UCI df
#> North 12181 4638.628 0.381 5500 26980 12.968
#> South 3653 910.097 0.249 2182 6118 17.963
#> Total 15835 4834.285 0.305 8388 29891 15.255
#>
#> Component percentages of variance:
#> Region.Label Detection ER
#> North 7.94 92.06
#> South 18.56 81.44
#> Total 17.07 82.93Created on 2025-02-15 with reprex v2.1.1
