dht2() with stratification causes error when one or more strata have no observations - includes fix

### 1. Summary

- dht2() will throw an error when it tries to provide stratified estimates if one (or more) of the strata have no observations but do have effort (ie samples). 
- This previously worked fine in Distance 1.0.5 but became broken in 1.0.6. The commit that broke it was [2e18d38](https://github.com/DistanceDevelopment/Distance/commit/2e18d382e67bb0743428bd68fb34df8cfc094af9) on June 16, 2022.
- This is the same problem as originally reported [here](https://groups.google.com/g/distance-sampling/c/_StxeF_DCd8/m/SVIUAtFsDgAJ) on the Distance Sampling google group by Shanti Davis and Caroline Fox.
- I understand that Shanti has emailed @LHMarshall some example code and data, but I haven't seen it entered as an issue here yet. So, in an effort to save you folks some time, I went ahead and did some debugging and found the problem and have a suggested fix - see below.

### 2. Details
- The error ultimately occurs because `dhtd()` (defined within `varNhat()` in `varNhat.R`)  no longer returns a vector of the correct length containing estimated Nhats for **ALL** strata, including zeros for any strata that had no observations, as it did in version 1.0.5. Instead, since version 1.0.6, it returns a vector which only contains Nhats for strata that had observations, i.e., there are no zeros for the "empty" strata in this vector as of version 1.0.6. Thus the vector returned by `dht()` is short one element for each stratum that had no observations.
- `dhtd` is then used by `DeltaMethod()` in this code starting on line 56 of `varNhat.R`:

![Image](https://github.com/user-attachments/assets/ca5a483c-676e-4871-adfb-ae6f6880097b)

- however, because of the shortened vector returned by `dhtd()`, the resulting variance-covariance matrix in `dm$variance` has missing rows and columns for any strata that had no observations.
- the error ultimately occurs on line 62 (`vardat_str$df_var <- diag(dm$variance)` because the length of `diag(dm$variance)` is shorter than the number of rows of `vardat_str`

### 3. Reprex
The following reprex reproduces the error. Here I have used the minke dataset and added a third Region ("East") containing two sample lines with no observations.

``` r
library(Distance)
#> Loading required package: mrds
#> This is mrds 3.0.0
#> Built: R 4.4.2; ; 2025-02-11 02:34:55 UTC; windows
#> 
#> Attaching package: 'Distance'
#> The following object is masked from 'package:mrds':
#> 
#> create.bins
library(magrittr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#> filter, lag
#> The following objects are masked from 'package:base':
#> 
#> intersect, setdiff, setequal, union
packageVersion("Distance")
#> [1] '2.0.0'

data(minke)

# Lets add a third Region with two samples having no observations. Adding two samples instead of just one to prevent a warning 
# message about being unable to estimate encounter rate variance with just one sample in the East stratum.
new.row1 <- minke[1, ] %>%
 mutate(
 Region.Label = "East",
 Area = 10000,
 Sample.Label = max(minke$Sample.Label) + 1,
 Effort = 10,
 distance = NA,
 object = NA
 )

new.row2 <- minke[1, ] %>%
 mutate(
 Region.Label = "East",
 Area = 10000,
 Sample.Label = max(minke$Sample.Label) + 2,
 Effort = 10,
 distance = NA,
 object = NA
 )

minke <- rbind(minke, new.row1, new.row2)
detfn <- ds(data=minke, truncation=1.5, key="hr", adjustment=NULL)
#> Fitting hazard-rate key function
#> AIC= 48.637
summary(detfn)
#> 
#> Summary for distance analysis 
#> Number of observations : 88 
#> Distance range : 0 - 1.5 
#> 
#> Model : Hazard-rate key function 
#> AIC : 48.63688 
#> Optimisation: mrds (nlminb) 
#> 
#> Detection function parameters
#> Scale coefficient(s): 
#> estimate se
#> (Intercept) -0.2967912 0.1765812
#> 
#> Shape coefficient(s): 
#> estimate se
#> (Intercept) 0.9648331 0.3605009
#> 
#> Estimate SE CV
#> Average p 0.6224396 0.0668011 0.1073214
#> N in covered region 141.3791718 17.7757784 0.1257312
#> 
#> Summary statistics:
#> Region Area CoveredArea Effort n k ER se.ER cv.ER
#> 1 East 10000 60.00 20.00 0 2 0.00000000 0.00000000 0.0000000
#> 2 North 630582 4075.14 1358.38 49 12 0.03607238 0.01317937 0.3653591
#> 3 South 84734 1453.23 484.41 39 13 0.08051031 0.01809954 0.2248102
#> 4 Total 725316 5588.37 1862.79 88 27 0.04076644 0.01165147 0.2858103
#> 
#> Abundance:
#> Label Estimate se cv lcl ucl df
#> 1 East 0.000 0.0000 0.0000000 0.000 0.000 0.00000
#> 2 North 12181.419 4638.6283 0.3807954 5499.950 26979.692 12.96781
#> 3 South 3653.345 910.0975 0.2491135 2181.595 6117.971 17.96263
#> 4 Total 15834.764 4834.2848 0.3052957 8388.423 29891.167 15.25496
#> 
#> Density:
#> Label Estimate se cv lcl ucl df
#> 1 East 0.00000000 0.000000000 0.0000000 0.000000000 0.00000000 0.00000
#> 2 North 0.01931774 0.007356106 0.3807954 0.008722023 0.04278538 12.96781
#> 3 South 0.04311546 0.010740641 0.2491135 0.025746391 0.07220207 17.96263
#> 4 Total 0.02183154 0.006665074 0.3052957 0.011565198 0.04121123 15.25496

# Now try dht2 with stratification - This will cause the error.
try(dht2(
 ddf = detfn,
 flatfile = minke,
 strat_formula = ~ Region.Label,
 stratification = "geographical"
))
#> Error in `$<-`(`*tmp*`, "df_var", value = c(1709104.11609035, 153728.436082082 : 
#> Assigned data `diag(dm$variance)` must be compatible with existing data.
#> âœ– Existing data has 3 rows.
#> âœ– Assigned data has 2 rows.
#> â„¹ Only vectors of size 1 are recycled.
#> Caused by error in `vectbl_recycle_rhs_rows()`:
#> ! Can't recycle input of size 2 to size 3.
```

Created on 2025-02-15 with [reprex v2.1.1](https://reprex.tidyverse.org)

### 4. Proposed fix
My proposed fix is to restore the former (presumably correct??) behaviour of `dhtd()` by making sure that it includes a 0 in its return vector for any stratum that has no observations. In version 1.0.5, `dhtd()` accomplished this by initializing its result vector to be the correct length based on the length of the `ind` argument which was a list with one element per stratum. Since that time, the structure of the code for `dhtd()` (and for `varNhat()` in general) has changed a bit and variable `ind` no longer exists. However, variable `grp_dat` can provide what is needed as it contains one row for each stratum. My fix involves adding an extra column (`n_obs`) to `grp_dat` indicating the number of observations in each stratum. `grp_dat` is then passed as a new argument to `dhtd()` where my new column is used to make sure the result vector from `dhtd()` has zeros for any strata with no observations. This also means that `grp_dat` needs to be included in the list of 'dhtd()' args in the call to `DeltaMethod()` on lines 56-57 since that is where `dhtd()` is eventually used. 

These changes are available in commit [7db535e](https://github.com/dfifield/Distance/commit/7db535eec2c5c4cb67ec705d541262528a8e8cdf) of my fork of Distance at [dfifield/Distance](https://github.com/dfifield/Distance).

### 5. Example showing fixed version
The following reprex demonstrates installing my modified fork of Distance and running the reprex again, this time without the error.


``` r
# Install the fixed version from my fork. You will have to change library 
# location to suit your system.
remotes::install_github("dfifield/Distance", 
 lib = "C:/Users/fifieldd/Documents/Offline/R/Rlibs_fixed")
#> Downloading GitHub repo dfifield/Distance@HEAD
#> 
#> â”€â”€ R CMD build â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
#> checking for file 'E:\SYS_TEMP\Rtmp6FkyjU\remotes98f465eb401f\dfifield-Distance-b93ec6f/DESCRIPTION' ... checking for file 'E:\SYS_TEMP\Rtmp6FkyjU\remotes98f465eb401f\dfifield-Distance-b93ec6f/DESCRIPTION' ... âœ” checking for file 'E:\SYS_TEMP\Rtmp6FkyjU\remotes98f465eb401f\dfifield-Distance-b93ec6f/DESCRIPTION' (770ms)
#> â”€ preparing 'Distance': (30.1s)
#> checking DESCRIPTION meta-information ... checking DESCRIPTION meta-information ... âœ” checking DESCRIPTION meta-information
#> â”€ checking for LF line-endings in source and make files and shell scripts (1.7s)
#> â”€ checking for empty or unneeded directories
#> â”€ looking to see if a 'data/datalist' file should be added
#> â”€ building 'Distance_2.0.0.9001.tar.gz'
#> 
#> 

library(Distance, lib.loc = "C:/Users/fifieldd/Documents/Offline/R/Rlibs_fixed")
#> Loading required package: mrds
#> This is mrds 3.0.0
#> Built: R 4.4.2; ; 2025-02-11 02:34:55 UTC; windows
#> 
#> Attaching package: 'Distance'
#> The following object is masked from 'package:mrds':
#> 
#> create.bins
library(magrittr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#> filter, lag
#> The following objects are masked from 'package:base':
#> 
#> intersect, setdiff, setequal, union

packageVersion("Distance")
#> [1] '2.0.0.9001'

data(minke)

# Lets add a third Region with two samples having no observations
new.row1 <- minke[1, ] %>%
 mutate(
 Region.Label = "East",
 Area = 10000,
 Sample.Label = max(minke$Sample.Label) + 1,
 Effort = 10,
 distance = NA,
 object = NA
 )

new.row2 <- minke[1, ] %>%
 mutate(
 Region.Label = "East",
 Area = 10000,
 Sample.Label = max(minke$Sample.Label) + 2,
 Effort = 10,
 distance = NA,
 object = NA
 )

minke <- rbind(minke, new.row1, new.row2)
detfn <- ds(data=minke, truncation=1.5, key="hr", adjustment=NULL)
#> Fitting hazard-rate key function
#> AIC= 48.637
summary(detfn)
#> 
#> Summary for distance analysis 
#> Number of observations : 88 
#> Distance range : 0 - 1.5 
#> 
#> Model : Hazard-rate key function 
#> AIC : 48.63688 
#> Optimisation: mrds (nlminb) 
#> 
#> Detection function parameters
#> Scale coefficient(s): 
#> estimate se
#> (Intercept) -0.2967912 0.1765812
#> 
#> Shape coefficient(s): 
#> estimate se
#> (Intercept) 0.9648331 0.3605009
#> 
#> Estimate SE CV
#> Average p 0.6224396 0.0668011 0.1073214
#> N in covered region 141.3791718 17.7757784 0.1257312
#> 
#> Summary statistics:
#> Region Area CoveredArea Effort n k ER se.ER cv.ER
#> 1 East 10000 60.00 20.00 0 2 0.00000000 0.00000000 0.0000000
#> 2 North 630582 4075.14 1358.38 49 12 0.03607238 0.01317937 0.3653591
#> 3 South 84734 1453.23 484.41 39 13 0.08051031 0.01809954 0.2248102
#> 4 Total 725316 5588.37 1862.79 88 27 0.04076644 0.01165147 0.2858103
#> 
#> Abundance:
#> Label Estimate se cv lcl ucl df
#> 1 East 0.000 0.0000 0.0000000 0.000 0.000 0.00000
#> 2 North 12181.419 4638.6283 0.3807954 5499.950 26979.692 12.96781
#> 3 South 3653.345 910.0975 0.2491135 2181.595 6117.971 17.96263
#> 4 Total 15834.764 4834.2848 0.3052957 8388.423 29891.167 15.25496
#> 
#> Density:
#> Label Estimate se cv lcl ucl df
#> 1 East 0.00000000 0.000000000 0.0000000 0.000000000 0.00000000 0.00000
#> 2 North 0.01931774 0.007356106 0.3807954 0.008722023 0.04278538 12.96781
#> 3 South 0.04311546 0.010740641 0.2491135 0.025746391 0.07220207 17.96263
#> 4 Total 0.02183154 0.006665074 0.3052957 0.011565198 0.04121123 15.25496

dht2(
 ddf = detfn,
 flatfile = minke,
 strat_formula = ~ Region.Label,
 stratification = "geographical"
)
#> Abundance estimates from distance sampling
#> Stratification : geographical 
#> Variance : R2, n/L 
#> Multipliers : none 
#> Sample fraction : 1 
#> 
#> 
#> Summary statistics:
#> Region.Label Area CoveredArea Effort n k ER se.ER cv.ER
#> East 10000 60.00 20.00 0 2 0.000 0.000 0.000
#> North 630582 4075.14 1358.38 49 12 0.036 0.013 0.365
#> South 84734 1453.23 484.41 39 13 0.081 0.018 0.225
#> Total 725316 5588.37 1862.79 88 27 0.047 0.011 0.236
#> 
#> Abundance estimates:
#> Region.Label Estimate se cv LCI UCI df
#> East 0 0.000 0.000 0 0 1.000
#> North 12181 4638.628 0.381 5500 26980 12.968
#> South 3653 910.097 0.249 2182 6118 17.963
#> Total 15835 NaN NaN NaN NaN 15.255
#> 
#> Component percentages of variance:
#> Region.Label Detection ER
#> East NaN NaN
#> North 7.94 92.06
#> South 18.56 81.44
#> Total 17.13 82.87
```

Created on 2025-02-15 with [reprex v2.1.1](https://reprex.tidyverse.org)

Now `dht2()` no longer throws an error and it gives output _almost_ identical to that from `summary(detfn)`.

### 7. One final niggle?

Note that I said the output is _almost_ identical. The only difference I can see is that the Total row cv.ER from `summary(detfcn)` is 
0.286 whereas the corresponding value from `dht2()` is 0.236. Note that this difference remains even if I use the "unfixed" version 2.0.0 of Distance and also remove the stratum with no observations. Below, I rerun the analysis with the stock minke dataset using the released Version 2.0.0 of Distance and we see again that the Total cv.ER from `summary()` and `dht2()` differ. Is this a known and/or intentional difference?

``` r
library(Distance)
#> Loading required package: mrds
#> This is mrds 3.0.0
#> Built: R 4.4.2; ; 2025-02-11 02:34:55 UTC; windows
#> 
#> Attaching package: 'Distance'
#> The following object is masked from 'package:mrds':
#> 
#> create.bins
packageVersion("Distance")
#> [1] '2.0.0'
data(minke)

detfn <- ds(data=minke, truncation=1.5, key="hr", adjustment=NULL)
#> Fitting hazard-rate key function
#> AIC= 48.637
summary(detfn)
#> 
#> Summary for distance analysis 
#> Number of observations : 88 
#> Distance range : 0 - 1.5 
#> 
#> Model : Hazard-rate key function 
#> AIC : 48.63688 
#> Optimisation: mrds (nlminb) 
#> 
#> Detection function parameters
#> Scale coefficient(s): 
#> estimate se
#> (Intercept) -0.2967912 0.1765812
#> 
#> Shape coefficient(s): 
#> estimate se
#> (Intercept) 0.9648331 0.3605009
#> 
#> Estimate SE CV
#> Average p 0.6224396 0.0668011 0.1073214
#> N in covered region 141.3791718 17.7757784 0.1257312
#> 
#> Summary statistics:
#> Region Area CoveredArea Effort n k ER se.ER cv.ER
#> 1 North 630582 4075.14 1358.38 49 12 0.03607238 0.01317937 0.3653591
#> 2 South 84734 1453.23 484.41 39 13 0.08051031 0.01809954 0.2248102
#> 3 Total 715316 5528.37 1842.79 88 25 0.04133635 0.01181436 0.2858103
#> 
#> Abundance:
#> Label Estimate se cv lcl ucl df
#> 1 North 12181.419 4638.6283 0.3807954 5499.950 26979.692 12.96781
#> 2 South 3653.345 910.0975 0.2491135 2181.595 6117.971 17.96263
#> 3 Total 15834.764 4834.2848 0.3052957 8388.423 29891.167 15.25496
#> 
#> Density:
#> Label Estimate se cv lcl ucl df
#> 1 North 0.01931774 0.007356106 0.3807954 0.008722023 0.04278538 12.96781
#> 2 South 0.04311546 0.010740641 0.2491135 0.025746391 0.07220207 17.96263
#> 3 Total 0.02213674 0.006758251 0.3052957 0.011726878 0.04178736 15.25496

dht2(
 ddf = detfn,
 flatfile = minke,
 strat_formula = ~ Region.Label,
 stratification = "geographical"
)
#> Abundance estimates from distance sampling
#> Stratification : geographical 
#> Variance : R2, n/L 
#> Multipliers : none 
#> Sample fraction : 1 
#> 
#> 
#> Summary statistics:
#> Region.Label Area CoveredArea Effort n k ER se.ER cv.ER
#> North 630582 4075.14 1358.38 49 12 0.036 0.013 0.365
#> South 84734 1453.23 484.41 39 13 0.081 0.018 0.225
#> Total 715316 5528.37 1842.79 88 25 0.048 0.011 0.237
#> 
#> Abundance estimates:
#> Region.Label Estimate se cv LCI UCI df
#> North 12181 4638.628 0.381 5500 26980 12.968
#> South 3653 910.097 0.249 2182 6118 17.963
#> Total 15835 4834.285 0.305 8388 29891 15.255
#> 
#> Component percentages of variance:
#> Region.Label Detection ER
#> North 7.94 92.06
#> South 18.56 81.44
#> Total 17.07 82.93
```

Created on 2025-02-15 with [reprex v2.1.1](https://reprex.tidyverse.org)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dht2() with stratification causes error when one or more strata have no observations - includes fix #194

1. Summary

2. Details

3. Reprex

4. Proposed fix

5. Example showing fixed version

7. One final niggle?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

dht2() with stratification causes error when one or more strata have no observations - includes fix #194

Description

1. Summary

2. Details

3. Reprex

4. Proposed fix

5. Example showing fixed version

7. One final niggle?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions