Skip to content

dht2() with stratification causes error when one or more strata have no observations - includes fix #194

@dfifield

Description

@dfifield

1. Summary

  • dht2() will throw an error when it tries to provide stratified estimates if one (or more) of the strata have no observations but do have effort (ie samples).
  • This previously worked fine in Distance 1.0.5 but became broken in 1.0.6. The commit that broke it was 2e18d38 on June 16, 2022.
  • This is the same problem as originally reported here on the Distance Sampling google group by Shanti Davis and Caroline Fox.
  • I understand that Shanti has emailed @LHMarshall some example code and data, but I haven't seen it entered as an issue here yet. So, in an effort to save you folks some time, I went ahead and did some debugging and found the problem and have a suggested fix - see below.

2. Details

  • The error ultimately occurs because dhtd() (defined within varNhat() in varNhat.R)  no longer returns a vector of the correct length containing estimated Nhats for ALL strata, including zeros for any strata that had no observations, as it did in version 1.0.5. Instead, since version 1.0.6, it returns a vector which only contains Nhats for strata that had observations, i.e., there are no zeros for the "empty" strata in this vector as of version 1.0.6. Thus the vector returned by dht() is short one element for each stratum that had no observations.
  • dhtd is then used by DeltaMethod() in this code starting on line 56 of varNhat.R:

Image

  • however, because of the shortened vector returned by dhtd(), the resulting variance-covariance matrix in dm$variance has missing rows and columns for any strata that had no observations.
  • the error ultimately occurs on line 62 (vardat_str$df_var <- diag(dm$variance) because the length of diag(dm$variance) is shorter than the number of rows of vardat_str

3. Reprex

The following reprex reproduces the error. Here I have used the minke dataset and added a third Region ("East") containing two sample lines with no observations.

library(Distance)
#> Loading required package: mrds
#> This is mrds 3.0.0
#> Built: R 4.4.2; ; 2025-02-11 02:34:55 UTC; windows
#> 
#> Attaching package: 'Distance'
#> The following object is masked from 'package:mrds':
#> 
#>     create.bins
library(magrittr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
packageVersion("Distance")
#> [1] '2.0.0'

data(minke)

# Lets add a third Region with two samples having no observations. Adding two samples instead of just one to prevent a warning 
# message about being unable to estimate encounter rate variance with just one sample in the East stratum.
new.row1 <- minke[1, ] %>%
  mutate(
    Region.Label = "East",
    Area = 10000,
    Sample.Label = max(minke$Sample.Label) + 1,
    Effort = 10,
    distance = NA,
    object = NA
  )

new.row2 <- minke[1, ] %>%
  mutate(
    Region.Label = "East",
    Area = 10000,
    Sample.Label = max(minke$Sample.Label) + 2,
    Effort = 10,
    distance = NA,
    object = NA
  )

minke <- rbind(minke, new.row1, new.row2)
detfn <- ds(data=minke, truncation=1.5, key="hr", adjustment=NULL)
#> Fitting hazard-rate key function
#> AIC= 48.637
summary(detfn)
#> 
#> Summary for distance analysis 
#> Number of observations :  88 
#> Distance range         :  0  -  1.5 
#> 
#> Model       : Hazard-rate key function 
#> AIC         :  48.63688 
#> Optimisation:  mrds (nlminb) 
#> 
#> Detection function parameters
#> Scale coefficient(s):  
#>               estimate        se
#> (Intercept) -0.2967912 0.1765812
#> 
#> Shape coefficient(s):  
#>              estimate        se
#> (Intercept) 0.9648331 0.3605009
#> 
#>                        Estimate         SE        CV
#> Average p             0.6224396  0.0668011 0.1073214
#> N in covered region 141.3791718 17.7757784 0.1257312
#> 
#> Summary statistics:
#>   Region   Area CoveredArea  Effort  n  k         ER      se.ER     cv.ER
#> 1   East  10000       60.00   20.00  0  2 0.00000000 0.00000000 0.0000000
#> 2  North 630582     4075.14 1358.38 49 12 0.03607238 0.01317937 0.3653591
#> 3  South  84734     1453.23  484.41 39 13 0.08051031 0.01809954 0.2248102
#> 4  Total 725316     5588.37 1862.79 88 27 0.04076644 0.01165147 0.2858103
#> 
#> Abundance:
#>   Label  Estimate        se        cv      lcl       ucl       df
#> 1  East     0.000    0.0000 0.0000000    0.000     0.000  0.00000
#> 2 North 12181.419 4638.6283 0.3807954 5499.950 26979.692 12.96781
#> 3 South  3653.345  910.0975 0.2491135 2181.595  6117.971 17.96263
#> 4 Total 15834.764 4834.2848 0.3052957 8388.423 29891.167 15.25496
#> 
#> Density:
#>   Label   Estimate          se        cv         lcl        ucl       df
#> 1  East 0.00000000 0.000000000 0.0000000 0.000000000 0.00000000  0.00000
#> 2 North 0.01931774 0.007356106 0.3807954 0.008722023 0.04278538 12.96781
#> 3 South 0.04311546 0.010740641 0.2491135 0.025746391 0.07220207 17.96263
#> 4 Total 0.02183154 0.006665074 0.3052957 0.011565198 0.04121123 15.25496

# Now try dht2 with stratification - This will cause the error.
try(dht2(
  ddf = detfn,
  flatfile = minke,
  strat_formula =  ~ Region.Label,
  stratification = "geographical"
))
#> Error in `$<-`(`*tmp*`, "df_var", value = c(1709104.11609035, 153728.436082082 : 
#>   Assigned data `diag(dm$variance)` must be compatible with existing data.
#> ✖ Existing data has 3 rows.
#> ✖ Assigned data has 2 rows.
#> ℹ Only vectors of size 1 are recycled.
#> Caused by error in `vectbl_recycle_rhs_rows()`:
#> ! Can't recycle input of size 2 to size 3.

Created on 2025-02-15 with reprex v2.1.1

4. Proposed fix

My proposed fix is to restore the former (presumably correct??) behaviour of dhtd() by making sure that it includes a 0 in its return vector for any stratum that has no observations. In version 1.0.5, dhtd() accomplished this by initializing its result vector to be the correct length based on the length of the ind argument which was a list with one element per stratum. Since that time, the structure of the code for dhtd() (and for varNhat() in general) has changed a bit and variable ind no longer exists. However, variable grp_dat can provide what is needed as it contains one row for each stratum. My fix involves adding an extra column (n_obs) to grp_dat indicating the number of observations in each stratum. grp_dat is then passed as a new argument to dhtd() where my new column is used to make sure the result vector from dhtd() has zeros for any strata with no observations. This also means that grp_dat needs to be included in the list of 'dhtd()' args in the call to DeltaMethod() on lines 56-57 since that is where dhtd() is eventually used.

These changes are available in commit 7db535e of my fork of Distance at dfifield/Distance.

5. Example showing fixed version

The following reprex demonstrates installing my modified fork of Distance and running the reprex again, this time without the error.

# Install the fixed version from my fork. You will have to change library 
# location to suit your system.
remotes::install_github("dfifield/Distance", 
            lib = "C:/Users/fifieldd/Documents/Offline/R/Rlibs_fixed")
#> Downloading GitHub repo dfifield/Distance@HEAD
#> 
#> ── R CMD build ─────────────────────────────────────────────────────────────────
#>          checking for file 'E:\SYS_TEMP\Rtmp6FkyjU\remotes98f465eb401f\dfifield-Distance-b93ec6f/DESCRIPTION' ...     checking for file 'E:\SYS_TEMP\Rtmp6FkyjU\remotes98f465eb401f\dfifield-Distance-b93ec6f/DESCRIPTION' ...   ✔  checking for file 'E:\SYS_TEMP\Rtmp6FkyjU\remotes98f465eb401f\dfifield-Distance-b93ec6f/DESCRIPTION' (770ms)
#>       ─  preparing 'Distance': (30.1s)
#>    checking DESCRIPTION meta-information ...     checking DESCRIPTION meta-information ...   ✔  checking DESCRIPTION meta-information
#>       ─  checking for LF line-endings in source and make files and shell scripts (1.7s)
#>       ─  checking for empty or unneeded directories
#>       ─  looking to see if a 'data/datalist' file should be added
#>       ─  building 'Distance_2.0.0.9001.tar.gz'
#>      
#> 

library(Distance, lib.loc =  "C:/Users/fifieldd/Documents/Offline/R/Rlibs_fixed")
#> Loading required package: mrds
#> This is mrds 3.0.0
#> Built: R 4.4.2; ; 2025-02-11 02:34:55 UTC; windows
#> 
#> Attaching package: 'Distance'
#> The following object is masked from 'package:mrds':
#> 
#>     create.bins
library(magrittr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

packageVersion("Distance")
#> [1] '2.0.0.9001'

data(minke)

# Lets add a third Region with two samples having no observations
new.row1 <- minke[1, ] %>%
  mutate(
    Region.Label = "East",
    Area = 10000,
    Sample.Label = max(minke$Sample.Label) + 1,
    Effort = 10,
    distance = NA,
    object = NA
  )

new.row2 <- minke[1, ] %>%
  mutate(
    Region.Label = "East",
    Area = 10000,
    Sample.Label = max(minke$Sample.Label) + 2,
    Effort = 10,
    distance = NA,
    object = NA
  )

minke <- rbind(minke, new.row1, new.row2)
detfn <- ds(data=minke, truncation=1.5, key="hr", adjustment=NULL)
#> Fitting hazard-rate key function
#> AIC= 48.637
summary(detfn)
#> 
#> Summary for distance analysis 
#> Number of observations :  88 
#> Distance range         :  0  -  1.5 
#> 
#> Model       : Hazard-rate key function 
#> AIC         :  48.63688 
#> Optimisation:  mrds (nlminb) 
#> 
#> Detection function parameters
#> Scale coefficient(s):  
#>               estimate        se
#> (Intercept) -0.2967912 0.1765812
#> 
#> Shape coefficient(s):  
#>              estimate        se
#> (Intercept) 0.9648331 0.3605009
#> 
#>                        Estimate         SE        CV
#> Average p             0.6224396  0.0668011 0.1073214
#> N in covered region 141.3791718 17.7757784 0.1257312
#> 
#> Summary statistics:
#>   Region   Area CoveredArea  Effort  n  k         ER      se.ER     cv.ER
#> 1   East  10000       60.00   20.00  0  2 0.00000000 0.00000000 0.0000000
#> 2  North 630582     4075.14 1358.38 49 12 0.03607238 0.01317937 0.3653591
#> 3  South  84734     1453.23  484.41 39 13 0.08051031 0.01809954 0.2248102
#> 4  Total 725316     5588.37 1862.79 88 27 0.04076644 0.01165147 0.2858103
#> 
#> Abundance:
#>   Label  Estimate        se        cv      lcl       ucl       df
#> 1  East     0.000    0.0000 0.0000000    0.000     0.000  0.00000
#> 2 North 12181.419 4638.6283 0.3807954 5499.950 26979.692 12.96781
#> 3 South  3653.345  910.0975 0.2491135 2181.595  6117.971 17.96263
#> 4 Total 15834.764 4834.2848 0.3052957 8388.423 29891.167 15.25496
#> 
#> Density:
#>   Label   Estimate          se        cv         lcl        ucl       df
#> 1  East 0.00000000 0.000000000 0.0000000 0.000000000 0.00000000  0.00000
#> 2 North 0.01931774 0.007356106 0.3807954 0.008722023 0.04278538 12.96781
#> 3 South 0.04311546 0.010740641 0.2491135 0.025746391 0.07220207 17.96263
#> 4 Total 0.02183154 0.006665074 0.3052957 0.011565198 0.04121123 15.25496

dht2(
  ddf = detfn,
  flatfile = minke,
  strat_formula =  ~ Region.Label,
  stratification = "geographical"
)
#> Abundance estimates from distance sampling
#> Stratification : geographical 
#> Variance       : R2, n/L 
#> Multipliers    : none 
#> Sample fraction : 1 
#> 
#> 
#> Summary statistics:
#>  Region.Label   Area CoveredArea  Effort  n  k    ER se.ER cv.ER
#>          East  10000       60.00   20.00  0  2 0.000 0.000 0.000
#>         North 630582     4075.14 1358.38 49 12 0.036 0.013 0.365
#>         South  84734     1453.23  484.41 39 13 0.081 0.018 0.225
#>         Total 725316     5588.37 1862.79 88 27 0.047 0.011 0.236
#> 
#> Abundance estimates:
#>  Region.Label Estimate       se    cv  LCI   UCI     df
#>          East        0    0.000 0.000    0     0  1.000
#>         North    12181 4638.628 0.381 5500 26980 12.968
#>         South     3653  910.097 0.249 2182  6118 17.963
#>         Total    15835      NaN   NaN  NaN   NaN 15.255
#> 
#> Component percentages of variance:
#>  Region.Label Detection    ER
#>          East       NaN   NaN
#>         North      7.94 92.06
#>         South     18.56 81.44
#>         Total     17.13 82.87

Created on 2025-02-15 with reprex v2.1.1

Now dht2() no longer throws an error and it gives output almost identical to that from summary(detfn).

7. One final niggle?

Note that I said the output is almost identical. The only difference I can see is that the Total row cv.ER from summary(detfcn) is
0.286 whereas the corresponding value from dht2() is 0.236. Note that this difference remains even if I use the "unfixed" version 2.0.0 of Distance and also remove the stratum with no observations. Below, I rerun the analysis with the stock minke dataset using the released Version 2.0.0 of Distance and we see again that the Total cv.ER from summary() and dht2() differ. Is this a known and/or intentional difference?

library(Distance)
#> Loading required package: mrds
#> This is mrds 3.0.0
#> Built: R 4.4.2; ; 2025-02-11 02:34:55 UTC; windows
#> 
#> Attaching package: 'Distance'
#> The following object is masked from 'package:mrds':
#> 
#>     create.bins
packageVersion("Distance")
#> [1] '2.0.0'
data(minke)

detfn <- ds(data=minke, truncation=1.5, key="hr", adjustment=NULL)
#> Fitting hazard-rate key function
#> AIC= 48.637
summary(detfn)
#> 
#> Summary for distance analysis 
#> Number of observations :  88 
#> Distance range         :  0  -  1.5 
#> 
#> Model       : Hazard-rate key function 
#> AIC         :  48.63688 
#> Optimisation:  mrds (nlminb) 
#> 
#> Detection function parameters
#> Scale coefficient(s):  
#>               estimate        se
#> (Intercept) -0.2967912 0.1765812
#> 
#> Shape coefficient(s):  
#>              estimate        se
#> (Intercept) 0.9648331 0.3605009
#> 
#>                        Estimate         SE        CV
#> Average p             0.6224396  0.0668011 0.1073214
#> N in covered region 141.3791718 17.7757784 0.1257312
#> 
#> Summary statistics:
#>   Region   Area CoveredArea  Effort  n  k         ER      se.ER     cv.ER
#> 1  North 630582     4075.14 1358.38 49 12 0.03607238 0.01317937 0.3653591
#> 2  South  84734     1453.23  484.41 39 13 0.08051031 0.01809954 0.2248102
#> 3  Total 715316     5528.37 1842.79 88 25 0.04133635 0.01181436 0.2858103
#> 
#> Abundance:
#>   Label  Estimate        se        cv      lcl       ucl       df
#> 1 North 12181.419 4638.6283 0.3807954 5499.950 26979.692 12.96781
#> 2 South  3653.345  910.0975 0.2491135 2181.595  6117.971 17.96263
#> 3 Total 15834.764 4834.2848 0.3052957 8388.423 29891.167 15.25496
#> 
#> Density:
#>   Label   Estimate          se        cv         lcl        ucl       df
#> 1 North 0.01931774 0.007356106 0.3807954 0.008722023 0.04278538 12.96781
#> 2 South 0.04311546 0.010740641 0.2491135 0.025746391 0.07220207 17.96263
#> 3 Total 0.02213674 0.006758251 0.3052957 0.011726878 0.04178736 15.25496

dht2(
  ddf = detfn,
  flatfile = minke,
  strat_formula =  ~ Region.Label,
  stratification = "geographical"
)
#> Abundance estimates from distance sampling
#> Stratification : geographical 
#> Variance       : R2, n/L 
#> Multipliers    : none 
#> Sample fraction : 1 
#> 
#> 
#> Summary statistics:
#>  Region.Label   Area CoveredArea  Effort  n  k    ER se.ER cv.ER
#>         North 630582     4075.14 1358.38 49 12 0.036 0.013 0.365
#>         South  84734     1453.23  484.41 39 13 0.081 0.018 0.225
#>         Total 715316     5528.37 1842.79 88 25 0.048 0.011 0.237
#> 
#> Abundance estimates:
#>  Region.Label Estimate       se    cv  LCI   UCI     df
#>         North    12181 4638.628 0.381 5500 26980 12.968
#>         South     3653  910.097 0.249 2182  6118 17.963
#>         Total    15835 4834.285 0.305 8388 29891 15.255
#> 
#> Component percentages of variance:
#>  Region.Label Detection    ER
#>         North      7.94 92.06
#>         South     18.56 81.44
#>         Total     17.07 82.93

Created on 2025-02-15 with reprex v2.1.1

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions