-
Notifications
You must be signed in to change notification settings - Fork 0
/
day3_fitlingmodel.Rmd
930 lines (768 loc) · 41.8 KB
/
day3_fitlingmodel.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
---
title: "Fitting a model to data"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache=TRUE, autodep=TRUE, warning = FALSE,message = FALSE)
```
# Fitting a model to data
## Introducing likelihood functions
In Gadget, a likelihood function is specified by supplying likelihood components within the likelihood
file, which is referred to by the main file. Individual likelihood component scores are summed together
after being multiplied by weights to form a total likelihood score, and each likelihood score is defined by the type of likelihood component specified.
Component types define both the kind of data and likelihood function being calculated for that component.
Although we refer to 'likelihood scores' throughout this description, it should be kept in mind that these
are more accurately sums of log likelihoods, so that the objective function used for model fitting attempts to minimize the total likelihood score.
## Incorporating data (Ling case study cont.)
In introducing likelihood components, we will first go through those used within the ling example
in detail, but only mention the other types as they follow similar patterns of implementation.
To use Rgadget to set up likelihood files, the first consideration that should be made is what
data are available that would be useful for creating a likelihood component. These have been
provided for you here: in the 'data_provided' folder, you will find catch data (as introduced
with the fleets), as well as survey indices, catch distribution, and 'stock' distribution data.
Catch distributions include numbers at length as well as numbers at age and length by fleet.
These therefore include both biological samples in fishery data as well as survey data.
Each length- or age-bin has been defined as having the lower bound only inclusive, except that
the lower bound of the smallest bin and the upper bound of the largest bin are open-ended to
include all values lower or higher, respectively. These numbers are then used within Gadget
to form proportions among defined bins, which are compared to the same proportions among the
same bins simulated in the model. Stock distribution data are numbers within each stock, which
for ling is immature or mature, but could for example be gender-specific stocks. Stock distribution
numbers are similarly used to form proportions within Gadget.
As we saw when incorporating catch data into fleet files (day 2), certain attributes need to
be present in order for the proper and internally consistent Gadget aggregation files to be
created alongside these data files. For catch and stock distribution data, these include area,
length, and age aggregations. Even when there is only a single bin, these aggregations need
to be specified. For example a single area and a single age group is needed for length
distributions, in addition to the length bins. Note that bins need not be evenly distributed
across the total length range; instead they are highly flexible and can be specified as any
grouping. Therefore the user should be careful to ensure that the definitions make sense and
that bins do not overlap. In our example below, we show how to add those necessary attributes,
although in reality to do this we also use another R package named 'mfdb' that is aimed at
database management and data extraction [more on this in day 5].
```{r likelihood, message=FALSE}
library(tidyverse)
library(Rgadget)
#assuming you are beginning in a fresh session; this information should be in an
#initialization script
base_dir <- 'ling_model'
vers <- c('01-base')
gd <- normalizePath(gadget.variant.dir(sprintf(paste0("%s/",vers),base_dir)))
ling.imm <- gadgetstock('lingimm',gd)
ling.mat <- gadgetstock('lingmat',gd)
area_file <- read.gadget.file(gd, 'Modelfiles/area')
minage <- ling.imm[[1]]$minage
maxage <- ling.mat[[1]]$maxage
maxlength <- ling.mat[[1]]$maxlength
minlength <- ling.imm[[1]]$minlength
dl <- ling.imm[[1]]$dl
create_intervals <- function (prefix, vect) {
x<-
structure(vect[1:(length(vect)-1)], names = paste0(prefix, vect[1:(length(vect)-1)])) %>%
as.list(.) %>%
purrr::map(~structure(seq(.,vect[-1][which(vect[1:(length(vect)-1)]==.)],1)[-length(seq(.,vect[-1][which(vect[1:(length(vect)-1)]==.)],1))],
min = .,
max = vect[-1][which(vect[1:(length(vect)-1)]==.)]))
x[[length(x)]] <- c(x[[length(x)]], attributes(x[[length(x)]])$max) %>%
structure(.,
min = min(.),
max = max(.))
return(x)
}
#just to see what this function does:
create_intervals('len',seq(minlength, maxlength, dl)) %>% .[1:3]
```
```{r}
#catch distribution data available:
#length distributions do not regard age bins
ldist.surv <- structure(read_csv('data_provided/ldist_sur.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len',seq(minlength, maxlength, dl)),
age = create_intervals('all',seq(minage, maxage, maxage-minage))
)
ldist.lln <- structure(read_csv('data_provided/ldist_lln.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len',seq(minlength, maxlength, dl)),
age = create_intervals('all',seq(minage, maxage, maxage-minage))
)
ldist.bmt <- structure(read_csv('data_provided/ldist_bmt.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len',seq(minlength, maxlength, dl)),
age = create_intervals('all',seq(minage, maxage, maxage-minage))
)
ldist.gil <- structure(read_csv('data_provided/ldist_gil.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len',seq(minlength, maxlength, dl)),
age = create_intervals('all',seq(minage, maxage, maxage-minage))
)
attributes(ldist.surv) %>%
purrr::map(head)
```
```{r}
#age bins needed for age-length distributions
aldist.surv <- structure(read_csv('data_provided/aldist_sur.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len',seq(minlength, maxlength, dl)),
age = create_intervals('age',c(minage:11, maxage))
)
aldist.lln <- structure(read_csv('data_provided/aldist_lln.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len',seq(minlength, maxlength, dl)),
age = create_intervals('age',c(minage:11, maxage))
)
aldist.bmt <- structure(read_csv('data_provided/aldist_bmt.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len',seq(minlength, maxlength, dl)),
age = create_intervals('age',c(minage:11, maxage))
)
aldist.gil <- structure(read_csv('data_provided/aldist_gil.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len',seq(minlength, maxlength, dl)),
age = create_intervals('age',c(minage:11, maxage))
)
attributes(aldist.surv)$age %>% head(3)
```
```{r}
#stock distribution data available:
matp.surv <- structure(read_csv('data_provided/matp_sur.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len',seq(minlength, maxlength, dl*2)),
age = list(mat_ages = ling.mat[[1]]$minage:ling.mat[[1]]$maxage)
)
#attributes(matp.surv)
```
For ling, length-based indices were used by choosing 7 'slices' of abundance indices across the
length range of ling observed. Index bounds are defined as above, generally inclusive of the
lower bound only, but open-ended for the smallest and largest bins. It is important to note
that only area and length attributes are included as attributes on these data frames, rather than
age, as in the distributional components. The length attribute allows `gadget_update` to recognize
these data as length-based indices, rather than age-based or acoustic indices, for example. It is
also important to keep in mind
here that each slice will have its own estimated catchability per survey (single data frame below)
and area combination (multiple areas may be within a data frame).
```{r}
#survey index data available:
slices <-
list('len20' = c(20,52), 'len52' = c(52,60), 'len60' = c(60,72),
'len72' = c(72,80), 'len80' = c(80,92), 'len92' = c(92,100),
'len100' = c(100,160) )
# just to illustrate slices:
ldist.surv %>%
mutate(length = substring(length, 4) %>% as.numeric) %>%
filter(year %in% c(2000,2005,2010,2015)) %>%
group_by(year,length) %>%
summarise(n = sum(number)) %>%
ggplot(aes(x = length, y = n)) + geom_line() +
geom_vline(aes(xintercept = length),
lty = 2, col = 'orange',
data = slices %>%
bind_rows() %>%
t() %>% .[-1,1] %>%
tibble(length = .)) +
facet_wrap(~year)
```
```{r}
# set up aggregation info for length-based survey indices
SI1 <- structure(read_csv('data_provided/SI1.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len', slices[[1]]))
SI2 <- structure(read_csv('data_provided/SI2.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len', slices[[2]]))
SI3 <- structure(read_csv('data_provided/SI3.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len', slices[[3]]))
SI4 <- structure(read_csv('data_provided/SI4.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len', slices[[4]]))
SI5 <- structure(read_csv('data_provided/SI5.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len', slices[[5]]))
SI6 <- structure(read_csv('data_provided/SI6.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len', slices[[6]]))
SI7 <- structure(read_csv('data_provided/SI7.csv'),
area = list(area_file[[1]]$areas) %>% set_names(.),
length = create_intervals('len', slices[[7]]))
```
### Exercise
* Change the aggregation levels of the age-length distribution data so that data are compared to model results within the bins 3 - 6, 7, 8, 9, 10 - 11, and 12 - 15.
## Setting up likelihood components
Using Rgadget, writing the files that specify the likelihood componenets is straightforward.
When calling `gadgetlikelihood` and then writing it with `write.gadget.file`, a likelihood file
is created within R and written to disc. The likelihood file contains a list of likelihood
component specifications, and within each component is where the likelihood function, data, and
weighting of the likelihood component is specified.
Here we start with the distributional components. In all cases, each component receives a distinct
name (to be referenced later so keep them short and distinct), weights (1 to yield equal weights
for now), data as imported above, and the fleet names and stock names to which these data apply.
As a default, Rgadget specifies a sum of squares function to calculate the likelihood score
(`sumofsquares`), although a variety of other functions are possible, including one for stratified
sum of squares (`stratified`), multinomial function (`multinomial`), pearson function (`pearson`),
gamma function (`gamma`), log function (`log`), multivariate normal function (`mvn`) or a
multivariate logistic function (`mvlogistic`). Many of these functions have been tailored for
specific data situations (e.g., including tagging or predation data into a likelihood component).
More on these functions can be read in the [Gadget User Guide](https://hafro.github.io/gadget/docs/userguide). To change the default setting to
the stratified sum of squares, for example, one would include an additional argument `function = 'stratified',` to the `gadget_update` call.
```{r}
lik <-
gadgetlikelihood('likelihood',gd,missingOkay = TRUE) %>%
gadget_update("catchdistribution",
name = "ldist.surv",
weight = 1,
data = ldist.surv,
fleetnames = c("surv"),
stocknames = c("lingimm","lingmat")) %>%
gadget_update("catchdistribution",
name = "aldist.surv",
weight = 1,
data = aldist.surv,
fleetnames = c("surv"),
stocknames = c("lingimm","lingmat")) %>%
gadget_update("catchdistribution",
name = "ldist.lln",
weight = 1,
data = ldist.lln,
fleetnames = c("lln"),
stocknames = c("lingimm","lingmat")) %>%
gadget_update("catchdistribution",
name = "aldist.lln",
weight = 1,
data = aldist.lln,
fleetnames = c("lln"),
stocknames = c("lingimm","lingmat")) %>%
gadget_update("catchdistribution",
name = "ldist.gil",
weight = 1,
data = ldist.gil,
fleetnames = c("gil"),
stocknames = c("lingimm","lingmat")) %>%
gadget_update("catchdistribution",
name = "aldist.gil",
weight = 1,
data = aldist.gil,
fleetnames = c("gil"),
stocknames = c("lingimm","lingmat")) %>%
gadget_update("catchdistribution",
name = "ldist.bmt",
weight = 1,
data = ldist.bmt,
fleetnames = c("bmt"),
stocknames = c("lingimm","lingmat")) %>%
gadget_update("catchdistribution",
name = "aldist.bmt",
weight = 1,
data = aldist.bmt,
fleetnames = c("bmt"),
stocknames = c("lingimm","lingmat")) %>%
gadget_update("stockdistribution",
name = "matp.surv",
weight = 1,
data = matp.surv,
fleetnames = c("surv"),
stocknames =c("lingimm","lingmat"))
lik
```
Note here that Rgadget automatically creates references to necessary data files ('Data/...') and
aggregation files ('Agg/...') that define the bin structure as defined by data attributes
(see previous section). When `write.gadget.file` is finally called after we are done adding
all likelihood components, these data and agg files will be written alongside this likelihood
file, and a reference to this likelihood file will also be drawn within the 'main' file.
You can also see how individual aggregation files will be printed by e.g.:
```{r}
lik[[2]]$ageaggfile
```
which gives you the age aggregation for the second likelihood component.
### Setting up survey indices
Incorporating survey index components is very similar except that 1) the fact that the survey
index is length-based is determined by the length attribute of the data frame, which sets the
argument `si_type` to a value of `lengths`, and 2) a catchability function is also specified
under the argument `'fittype'`. In addition, the argument `biomass` has a default setting of 0,
signifying a numbers-based index (rather than a biomass-based index, which has a value of 1).
Other options for `si_type` can be read about in the [Gadget User Guide](https://hafro.github.io/gadget/docs/userguide), and `ages`, `fleets`, `acoustic`, and `effort`.
The catchability function defines the type of linear regression to be used to calculate
the likelihood score. Options include a linear fit on the orignal or log scale (`linearfit`
or `loglinearfit`), linear fit with a fixed slope on the original or log scale (`fixedslopelinearfit`
or `fixedslopeloglinearfit`), linear fit with a fixed intercept on the original or log scale
(`fixedinterceptlinearfit` or `fixedinterceptlinearfit`), or a linear/loglinear function with
both the slope and intercept fixed (`fixedlinearfit` or `fixedloglinearfit`). In the first set
of cases (linear or loglinear fits), two parameters would be estimated; in the next two sets of
cases, only a single parameter is estimated (slope or intercept, whichever is not fixed). In
the last set of cases (fixed liner or loglinear fits), no parameters are estimated. Whenever a
parameter is fixed, its value needs to be included as an additional argument to `gadget_update`.
For example, catchability of the indices of smaller sized fish below are estimated using a
log linear fit with both slopes and intercepts estimated, but catchabilities for indices of
larger sized fish are calculated using log linear fits with fixed slopes. The slopes are
fixed to a value of 1, so the argument `slope = 1` is included in the `gadget_update` call.
Notice that, unlike the distributional likelihood components, a fleet is not specified
for the survey indices, unless `si_type` is set to `fleets`, `effort`, `acoustic` (rather than `length` as in our example).
When indices are not length- or age-based, a single catchability is estimated, rather
than the series of catchabilities estimated for each age or length index series.
Each survey index is
matched to stock size observations at a given time and area; therefore, if there are two
length- or age-based survey index series with the same time and area, they can each be
incorporated as separate likelihood components as long as they have different names.
```{r, results='hide'}
lik <-
lik %>%
gadget_update("surveyindices",
name = paste0("si.", names(slices)[1]),
weight = 1,
data = SI1,
fittype = 'loglinearfit',
stocknames = c("lingimm","lingmat")) %>%
gadget_update("surveyindices",
name = paste0("si.", names(slices)[2]),
weight = 1,
data = SI2,
fittype = 'loglinearfit',
stocknames = c("lingimm","lingmat")) %>%
gadget_update("surveyindices",
name = paste0("si.", names(slices)[3]),
weight = 1,
data = SI3,
fittype = 'fixedslopeloglinearfit',
slope=1,
stocknames = c("lingimm","lingmat")) %>%
gadget_update("surveyindices",
name = paste0("si.", names(slices)[4]),
weight = 1,
data = SI4,
fittype = 'fixedslopeloglinearfit',
slope=1,
stocknames = c("lingimm","lingmat")) %>%
gadget_update("surveyindices",
name = paste0("si.", names(slices)[5]),
weight = 1,
data = SI5,
fittype = 'fixedslopeloglinearfit',
slope=1,
stocknames = c("lingimm","lingmat")) %>%
gadget_update("surveyindices",
name = paste0("si.", names(slices)[6]),
weight = 1,
data = SI6,
fittype = 'fixedslopeloglinearfit',
slope=1,
stocknames = c("lingimm","lingmat")) %>%
gadget_update("surveyindices",
name = paste0("si.", names(slices)[7]),
weight = 1,
data = SI7,
fittype = 'fixedslopeloglinearfit',
slope=1,
stocknames = c("lingimm","lingmat"))
lik
```
Finally, there are two components to the likelihood that do not incorporate data but
are instead used to ensure that the optimisation search algorithm remains within
reasonable ranges of the defined model. The first component provides penalties to for
exceeding parameter bounds as set by the user in the input parameter file.
In this example, the weight is set to 0.5,
but see the troubleshooting page for cases in which it may be wise to set this weight
higher. The data frame provided to this likelihood component specifies the power and
parameter weights supplied if the parameter value exceeds its upper and lower bounds.
If the likelihood of a parameter exceeds either the upper or lower bound,
the distance between the bound and the parameter value is squared and multiplied by the
parameter weights 10 000; then the sum of these scores across all parameters is
multiplied by the 0.5 weighting before being summed along with the other
likelihood components to form the total.
### Penalty functions
Understocking introduces a penalty when there are insufficent prey to met the requirements
of the predators. For example, if landings of the fleets exceeds available biomass,
then the penalty is applied. The understocking penalty likelihood is then defined as
the sum of the number of prey that are 'missing', raised to a power set by default
to 2, and multiplied by the likelihood weight of 100 as indicated below.
```{r}
lik <-
lik %>%
gadget_update("penalty",
name = "bounds",
weight = "0.5",
data = data.frame(
switch = c("default"),
power = c(2),
upperW=10000,
lowerW=10000,
stringsAsFactors = FALSE)) %>%
gadget_update("understocking",
name = "understocking",
weight = "100")
lik %>%
write.gadget.file(gd)
```
Notice that when the likelihood file is written, so are all the supporting
aggregation files under the 'Aggfiles' folder.
## Fitting a Gadget model to data
Now that all data and likelihood components are incorporated, it is possible to run `gadget_optimise` to run a fitting procedure. Like any other optimisation routine, what is still needed are 1) starting values for the parameters, and 2) optimisation routine settings.
### Create starting values
If we start here with the same simple `gadget_evaluate` simulation as in day 2 of the
ling model, we can see that a file called 'params.out' is automatically generated
(same timestamp as the simple_log). However, if you open this file, is shows that
each parameter was given a value of 1 with bounds set to {-9999, 9999} and no
optimisation (0). Therefore, this simulation was run with very unrealistic parameters.
Obviously this is not a helpful run in itself, but it is at least useful for
conveniently generating an initial parameter file with the correct Gadget
structure containing all the parameters specified throughout the Gadget model files.
```{r}
gadget_evaluate(gd,log = 'ling_log')
read.gadget.parameters(paste0(gd,'/params.out')) %>%
DT::datatable(.) #this last line not necessary - helps with printing to web page
```
Now that the structure is set, it is much easier to change the bounds and starting
values to more realistic ones using `init_guess`. This function uses the `grepl`
function internally to match names, so that large chunks of parameters with similar
names can have their bounds similarly changed.
```{r}
lw_pars <- c(0.00000228, 3.20) #same values as in day 2
## update the input parameters with sane initial guesses
read.gadget.parameters(sprintf('%s/params.out',gd)) %>%
init_guess('rec.[0-9]|init.[0-9]',1,0.001,1000,1) %>%
init_guess('recl',12,4,20,1) %>%
init_guess('rec.sd',5, 4, 20,1) %>%
init_guess('Linf',160, 100, 200,1) %>%
init_guess('k$',90, 40, 100,1) %>%
init_guess('bbin',6, 1e-08, 100, 1) %>%
init_guess('alpha', 0.5, 0.01, 3, 1) %>%
init_guess('l50',50,10,100,1) %>%
init_guess('walpha',lw_pars[1], 1e-10, 1,0) %>%
init_guess('wbeta',lw_pars[2], 2, 4,0) %>%
init_guess('M$',0.15,0.001,1,0) %>%
init_guess('rec.scalar',400,1,500,1) %>%
init_guess('init.scalar',200,1,300,1) %>%
init_guess('mat2',60,0.75*60,1.25*60,1) %>%
init_guess('mat1',70, 10, 200, 1) %>%
init_guess('init.F',0.4,0.1,1,1) %>%
write.gadget.parameters(.,file=sprintf('%s/params.in',gd))
read.gadget.parameters(sprintf('%s/params.in',gd)) %>%
DT::datatable(.) #this last line not necessary - helps with printing to web page
```
Now is a good point to scan the resulting 'params.in' file to make sure that
a parameter was not forgotten and left with default values and bounds.
#### Exercise
* Suppose that age sampling was done so that the same sampling effort was spread across all
ages equally (roughly 30 fish per length range to attain roughly 30 age samples per age).
Change the likelihood component function for age-length data to the appropriate function.
* Change the survey indices to be biomass indices using `gadget_update`.
* Use `init_guess` to change the bounds of only one set of selectivity parameters.
## Set optimisation routine
To run an optimisation via `gadget_optimize` is simple, the user provides an input parameter file with starting values
(below as `params.in = 'params.in'`), and providing a file name to which final parameter
values should be written (below as `params.out = 'params.init'`).
```{r cache=TRUE, eval = FALSE}
timestamp()
gadget_optimize(gd,params.in = 'params.in', params.out = 'params.init')
timestamp()
```
Now that reasonable bounds and starting values have been set for an
optimisation run, default settings can be used to run a short and simple
Gadget optimisation. If you open the `params.init` file, comments can be found
at the top that detail the computer and timing of the run (as in a simulation run),
as well as optimisation information (something like 'Hooke & Jeeves algorithm ran
for 1039 function evaluations'), the likelihood value at the end '1e+10' and a
reason for stopping ('because the maximum number of function evaluations was reached').
You can also see from the timestamps above that this run took roughly 30 seconds,
depending on the computer being used.
Obviously, the default settings for optimisation (1000 function evaluations
using the Hooke and Jeeves algorithm), are insufficient to find a reasonable
optimum. Gadget can also take quite a while to optimise a model (~ 6 hours for
ling on a very fast computer). Therefore, it is best to plan ahead of time on
which computer and when a model will be optimised, as well as which optimisation
methods will be used.
To change from the default settings, an opimisation file needs to be provided
to `gadget_optimize` via the `control` argument. This is the file we provide for
demonstration: it shows that we have implemented 3 consecutive optimisation
routines. The first begins with simulated annealing (and parameter settings
needed for it as well as a seed), followed by a Hooke and Jeeves algorithm
that begins where simulated annealing leaves off, and finally a BFGS algorithm.
The three routines are listed in this manner to consecutively to transition from
a broader global search (simulated annealing) to one that provides faster
(Hooke and Jeeves) and more precise (BFGS) results, in an attempt to make the
search more efficient. These three are the only algorithms currently implemented,
but others are in a development phase. Also note that a seed is set in the
beginning algorithm: it is good practice to both use random starting values
as well as a set seed to check for possible optimisation problems. More
information on parameters specific to each algorithm can be found in the [Gadget User Guide](https://hafro.github.io/gadget/docs/userguide).
Each algorithm also has a low maximum number of iterations set in the
optimisation file below, which is clearly not enough to be realistic: realistic
values used for ling are given as notes within the optinfofile and are not read
below. However, the each model presents new challenges to the search algorithm,
so in some complex cases it may be necessary to rely mostly on simulating annealing,
whereas in other more well-defined cases it may be possible to rely mostly on the
faster algorithms. At this point in development, speeding Gadget runs is mostly
done by using computers with multiple cores and faster processors.
```{r}
optinfo <- read.gadget.file('data_provided', 'optinfofile')
optinfo
```
Here, we have implemented a fitting procedure that begins where the
last default optimisation values left off (in the 'params.init' file).
The 'params.opt' file will therefore be the result of a longer and more
well-defined search as described by the 'optinfofile'.
```{r cache=TRUE, eval = FALSE}
timestamp()
gadget_optimize(gd,params.in = 'params.in', params.out = 'params.init', control = optinfo)
timestamp()
```
#### Exercise
* Try changing the parameter starting values to the params.forsim file within the 'data_provided' folder. Does the optimisation finish any quicker?
## Using the iterative reweighting function to fit a Gadget statistical model
The total objective function used the modeling process combines
equations \ref{eq:SSSI} to \ref{eq:SSalk} using the following formula:
\begin{equation}\label{eq:loglik}
l^{T} = \sum_g w_{gf}^{SI}
l_{g,S}^{SI} + \sum_{f\in \{S,T,G,L\}} \left(
w_{f}^{LD} l_{f}^{LD} + w_{f}^{AL}
l_{f}^{AL}\right) + w^{M}l^{M}
\end{equation}
where $f=S,T,G$ or $L$ denotes the spring survey, trawl, gillnet and
longline fleets respectively and $w$'s are the weights assigned to each likelihood components.
The weights, $w_i$, are necessary for several reasons. First of all it
is used to to prevent some components from dominating the likelihood
function. When one data source dominates, the model fits that one exactly
and ignores information from the other data sources. Another would be to
reduce the effect of low quality
data. It can be used as an a priori estimates of the variance in each
subset of the data.
Assigning likelihood weights is not a trivial matter; it has in the past
been the most time-consuming part of developing a Gadget model. Historicall this was
been done using some form of 'expert judgement'. General heuristics
have recently been developed to estimated these weights
objectively. Here the iterative re--weighting heuristic introduced by
@stefansson2003issues, and subsequently implemented in
@taylor2007simple, is used. Rgadget implements this heuristing using the `gadget.iterative` function.
The general idea behind the iterative re-weighting is to assign the
inverse variance of the fitted residuals as component weights. This pattern
suggests that the more variable the residuals are from a particular data
source, the less weight it should receive. The
variances, and hence the final weights, are calculated according the
following algorithm:
1. Calculate the initial sums of squares (SS) given the initial
parameterization for all likelihood components. Assign the inverse SS
as the initial weight for all likelihood components, resulting in a total initial score of 1 for each component.
2. For each likelihood component, do an optimization run with
the initial weighted SS for that component set to 10 000. This run represents
a model fit where essentially only that component contributes to model
fitting, and therefore that component receives the best fit achievable.
Then estimate the residual variance using the resulting SS of that component divided
by the degrees of freedom ($df^*$), i.e. $\hat{\sigma}^2 = \frac{SS}{df^*}$.
3. After the optimization of each component separately, set the final weight
for each component to be the inverse of the estimated variance from the step above
(weight $=1/\hat{\sigma}^2$).
The number of non-zero data-points ($df^{*}$) is used as a proxy for the degrees
of freedom. While this may be a satisfactory proxy for larger data sets, it could be
a gross overestimate of the degrees of freedom for smaller data sets.
In particular, if the survey indices are weighed on their own while the
annual recruitment is estimated, they could be over-fitted. In general,
problem such as these can be solved with component grouping. Component grouping
is achieved in step 2: likelihood components that should behave similarly, such
as survey indices representing similar age ranges, are weighted and optimized
together.
The `gadget.iterative` function implements the iterative reweighting heuristic and is invoked in the following manner:
```{r}
## In its current version, gadget.iterative needs to be run from within the
## directory that contains the Gadget model
old <- getwd()
file.copy(sprintf('%s/optinfofile','data_provided'),gd)
setwd(gd)
## run gadget iterative
## WARNING: Each time gadget.iterative runs with the given optinfofile,
## it takes 11 mins to run on a reasonably fast computer with 2 cores.
timestamp()
run <- gadget.iterative() # This works for Linux or Mac, but Windows throws an error due to incompatability in the parallel package
#run <- gadget.iterative(run.serial = TRUE) #This will work, but will take several times as long to run. For demonstration purposes, it is highly recommended to reduce the number of iterations in the optinfofile to a very small number for all algorithms.
timestamp()
```
This will start an iterative run that will estimate the weights for all likelihood
components, and run a final optimisation run with the resulting likelihood weights.
Penalising likelihood components, such as `boundlikelihood` and `understocking`, are
left as is. As the iterative reweighting requires several optimisation runs,
`gadget.iterative` tries to spread these runs across the available cores on your
computer. In the case of the number of cores lower than the number of runs, the excess
runs will wait until a core becomes available. The resulting parameter estimates are
then stored in a folder called `WGTS` in your model directory, where the file
`params.final` contains the final parameter estimate.
When running `gadget.iterative` with standard setting you assume that all
data sets have enough data to allow all parameters to be estimated without
overfitting the data. It is not always the case that this is appropriate.
For instance survey indices have in general only one data point per year,
step and size group, and when estimating annual recruitment this usually
leads overfitting of that particular component when heavily emphasized.
Similar problem can arise when only one year, or a few disjoint years,
worth of compositional data is supplied to the likelihood function. To
account for this issue the user can group similar likelihood components
together using the `grouping` argument in the `gadget.iterative` function.
When estimating annual recruitment, for example, this ensures that at least
two data points are available to estimate each of the recruitment parameters.
In the case of Ling, the survey indices were split into two groups, sizes 20
to 72 cm and 72 to 160 cm, and the compositional data from the commercial
trawl and gillnet fleets were grouped due to a low number of data points.
Note that when rerunning gadget.reiterative, it is best for file management
purposes to delete the old 'WGTS' folder before reruinning `gadget.iterative`,
for example by running `unlink("WGTS", recursive=TRUE)`.
Alternatively, if both sets of results should be kept, then the 'WGTS' folder
should be renamed by changing the default argument `wgts = 'WGTS'`. This is
especially important when visualizing model results in the next step using
`gadget.fit`.
```{r eval=FALSE}
run <- gadget.iterative(grouping = list(sind1=c('si.len20','si.len52','si.len60'),
sind2=c('si.len72','si.len80','si.len92',
'si.len100'),
comm=c('ldist.gil','ldist.bmt',
'aldist.gil','aldist.bmt')),
wgts = 'WGTS2') # if working on Windows, can run with argument run.serial = TRUE (see above)
```
Note that if a likelihood component is not present in the defined
grouping list it is assumed that the component is treated individually.
<!-- Sometimes this is not enough to prevent overfitting and therefore the -->
<!-- user can define a maximum weight of sorts by defining the minimum CV -->
<!-- for survey indices used for weight assignment. This is done using the -->
<!-- `cv.floor` argument: -->
<!-- ```{r eval=FALSE} -->
<!-- run <- gadget.iterative(cv.floor = 0.2, -->
<!-- wgts = 'WGTS3') # if working on Windows, can run with argument run.serial = TRUE (see above) -->
<!-- ``` -->
<!-- The `cv.floor` argument is often used after the final optimisation -->
If the final run has crashed it is convenient to be able to pick up the "thread"
where the run crashed and resume the final optimisation. The `resume.final`
switch allows for this and essentially only runs the last optimisation, thereby
assuming that the preceding steps have been completed:
```{r eval=FALSE}
run <- gadget.iterative(resume.final = TRUE,
wgts = 'WGTS') # if working on Windows, can run with argument run.serial = TRUE (see above)
```
Related to `resume.final` there are the `run.final` and `run.serial` arguments
that instruct the function to whether to run the final optimisation from the
beginning and whether all optimisations should be run on a single core respectively.
Note for Windows users: the user needs to explicitly define a cluster for `gadget.iterative` to run in parallel, see [here](https://www.r-bloggers.com/implementing-mclapply-on-windows-a-primer-on-embarrassingly-parallel-computation-on-multicore-systems-with-r/) for ways around this problem.
In post-hoc analyses of the effect of certain data sources on model fitting,
differential treatment of data is often investigated. For example, the effect
of a certain dataset can be analysed by omitting it from the estimation or changing
its grouping. `gadget.iterative` allows you to have multiple `WGTS` folders in the
same model directory by setting the `wgts` argument. In addition the user can
explicitly define (or remove) which component are to be used in the optimisation
using a combination of the `comp` and `inverse` arguments:
```{r eval=FALSE}
## only want to fit to ldist.gil, all others discarded
run <- gadget.iterative(comp = 'ldist.gil',
wgts = 'WGTS') # if working on Windows, can run with argument run.serial = TRUE (see above)
## remove ldist.gil, all others kept
run <- gadget.iterative(comp = 'ldist.gil',
inverse = TRUE,
wgts = 'WGTS') # if working on Windows, can run with argument run.serial = TRUE (see above)
```
## Visualize Gadget statistical model fit
To obtain information on the model fit and properties of the model, one can
use the `gadget.fit` function to query the model and return conveniently
structured output. Notice that there is a default argument in `gadget.fit`
that matches `gadget.reiterative`: the `wgts = 'WGTS'` argument. This means that
by default gadget.fit will load the reiterative results stored under 'WGTS',
but this can be changed. Here we load results from the first run of
`gadget.iterative` where no grouping was implemented.
```{r,message=FALSE}
setwd(gd) # Make sure we are in gd first, or put gd as an argument to gadget.fit
fit <- gadget.fit()
```
```{r message=FALSE}
theme_set(theme_light()) ## set the plot theme (optional)
library(patchwork) ## optional packages
scale_fill_crayola <- function(n = 100, ...) {
# taken from RColorBrewer::brewer.pal(12, "Paired")
pal <- c("#A6CEE3", "#1F78B4", "#B2DF8A", "#33A02C",
"#FB9A99", "#E31A1C", "#FDBF6F", "#FF7F00",
"#CAB2D6", "#6A3D9A", "#FFFF99", "#B15928")
pal <- rep(pal, n)
ggplot2::scale_fill_manual(values = pal, ...)
}
```
The `fit` object is essentially a list of data.frames that contain the likelihood data merged with the model output.
```{r}
fit %>% names()
```
and one can access those data.frames simply by calling their name:
```{r}
fit$sidat %>%
DT::datatable(.) #this last line not necessary - helps with printing to web page
```
For further information on what the relevant data.frames contain refer to the help page for `gadget.fit`.
In addition a plot routine for the `fit` object is implement in Rgadget. The input to the `plot` function is simply the `gadget.fit` object, the data set one wants to plot and the type. The default plot is a survey index plot:
```{r}
plot(fit)
```
To produce a likelihood summary:
```{r}
plot(fit,data='summary')
```
A weighted summary plot:
```{r}
plot(fit,data='summary',type = 'weighted')
```
and a pie chart of likelihood components:
```{r}
plot(fit,data='summary',type='pie')
```
To plot the fit to catch proportions (either length or age) you simply do:
```{r}
tmp <- plot(fit,data = 'catchdist.fleets')
names(tmp)
```
and then plot them one by one:
```{r}
tmp$aldist.surv
```
```{r}
tmp$ldist.surv
```
One can also produce bubble plots
```{r}
bubbles <- plot(fit,data = 'catchdist.fleets',type='bubble')
names(bubbles)
```
Age bubbles
```{r}
bubbles$aldist
```
Length bubbles
```{r}
bubbles$ldist
```
One can also illustrate the fit to growth in the model:
```{r}
grplot <- plot(fit,data = 'catchdist.fleets',type='growth')
names(grplot)
```
Illstrate the fit to the autumn survey
```{r}
grplot$aldist.surv
```
And the fit to maturity data:
```{r}
plot(fit,data='stockdist')
```
And selection by year and step
```{r}
plot(fit,data="suitability")
```
Age age compostion
```{r}
plot(fit,data='stock.std') + scale_fill_crayola()
```
And the standard ICES plots
```{r}
plot(fit,data='res.by.year',type='total') + theme(legend.position = 'none') +
plot(fit,data='res.by.year',type='F') + theme(legend.position = 'none') +
plot(fit,data = 'res.by.year',type='catch') + theme(legend.position = 'none') +
plot(fit, data='res.by.year',type='rec')
```
#### Exercise
* Try plotting the catchabilities on the normal scale (exponentiated, not log as given) across ages for each survey. What shape are they when plotted?
* What information is stored under fit$SS?
* Try changing the grouping structure above to one that makes more sense to you. Do model results fit any better?
```{r, eval = FALSE}
fit$sidat %>%
select(lower, intercept) %>%
distinct() %>%
mutate(catchability = exp(intercept)) %>%
ggplot(aes(x = lower, y = catchability)) +
geom_line()
fit$SS
```