Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission - melt: Multiple Empirical Likelihood Tests #550

Closed
12 of 13 tasks
markean opened this issue Jul 15, 2022 · 52 comments
Closed
12 of 13 tasks

Submission - melt: Multiple Empirical Likelihood Tests #550

markean opened this issue Jul 15, 2022 · 52 comments
Assignees
Labels

Comments

@markean
Copy link
Member

markean commented Jul 15, 2022

Date accepted: 2022-09-20

Submitting Author Name: Eunseop Kim
Submitting Author Github Handle: @markean
Repository: https://github.com/markean/melt
Version submitted: 1.6.0.9000
Submission type: Stats
Badge grade: silver
Editor: @Paula-Moraga
Reviewers: @pchausse, @awstringer1

Due date for @pchausse: 2022-09-05

Due date for @awstringer1: 2022-08-29
Archive: TBD
Version accepted: TBD
Language: en

  • Paste the full DESCRIPTION file inside a code block below:
Type: Package
Package: melt
Title: Multiple Empirical Likelihood Tests
Version: 1.6.0.9000
Authors@R: c(
    person("Eunseop", "Kim", , "kim.7302@osu.edu", role = c("aut", "cre")),
    person("Steven", "MacEachern", role = c("ctb", "ths")),
    person("Mario", "Peruggia", role = c("ctb", "ths"))
  )
Description: Performs multiple empirical likelihood tests for linear and
    generalized linear models. The core computational routines are
    implemented using the 'Eigen' C++ library and 'RcppEigen' interface,
    with OpenMP for parallel computation. Details of multiple testing
    procedures are given in Kim, MacEachern, and Peruggia (2021)
    <arxiv:2112.09206>.
License: GPL (>= 2)
URL: https://github.com/markean/melt, https://markean.github.io/melt/
BugReports: https://github.com/markean/melt/issues
Depends: 
    R (>= 4.0.0)
Imports: 
    graphics,
    methods,
    Rcpp,
    stats
Suggests: 
    covr,
    ggplot2,
    knitr,
    microbenchmark,
    rmarkdown,
    spelling,
    testthat (>= 3.0.0),
    withr
LinkingTo: 
    BH,
    dqrng,
    Rcpp,
    RcppEigen
VignetteBuilder: 
    knitr
Config/testthat/edition: 3
Encoding: UTF-8
Language: en-US
LazyData: true
NeedsCompilation: yes
Roxygen: list (markdown = TRUE, roclets = c ("namespace", "rd",
    "srr::srr_stats_roclet"))
RoxygenNote: 7.2.0

Pre-submission Inquiry

  • A pre-submission inquiry has been approved in issue #549

General Information

  • Who is the target audience and what are scientific applications of this package?

  • Paste your responses to our General Standard G1.1 here, describing whether your software is:
    The package attempts the first implementation of the nested bilevel optimization approach within R to compute constrained empirical likelihood. The inner layer Newton-Raphson method for empirical likelihood is written in C++, enabling faster computation than other routines written in R.

  • (If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
    Yes.

Badging

  1. Compliance with a good number of standards beyond those identified as minimally necessary.
  2. Have a demonstrated generality of usage beyond one single envisioned use case.
    In my opinion, the generality comes from the applicability of empirical likelihood methods to (generalized) linear models and the ability to test linear hypotheses of choice.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • Do you intend for this package to go on CRAN?
    The package is on CRAN.
  • Do you intend for this package to go on Bioconductor?

Code of conduct

@ropensci-review-bot
Copy link
Collaborator

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

@ropensci-review-bot
Copy link
Collaborator

🚀

Editor check started

👋

@ropensci-review-bot
Copy link
Collaborator

Checks for melt (v1.6.0.9000)

git hash: 1b9bbebd

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 99.8%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: GPL (>= 2)


1. rOpenSci Statistical Standards (srr package)

This package is in the following category:

  • Regression and Supervised Learning

✔️ All applicable standards [v0.1.0] have been documented in this package (86 complied with; 30 N/A standards)

Click to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.


2. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 146
internal melt 66
imports methods 83
imports stats 66
imports graphics 8
imports Rcpp NA
suggests covr NA
suggests ggplot2 NA
suggests knitr NA
suggests microbenchmark NA
suggests rmarkdown NA
suggests spelling NA
suggests testthat NA
suggests withr NA
linking_to BH NA
linking_to dqrng NA
linking_to Rcpp NA
linking_to RcppEigen NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

list (15), q (14), attr (13), warning (13), numeric (12), c (7), nrow (7), integer (6), length (5), print (5), as.integer (4), match.call (4), names (4), ncol (4), tryCatch (4), do.call (3), if (3), mean (3), as.vector (2), cbind (2), logical (2), match (2), rownames (2), as.matrix (1), colMeans (1), is.null (1), matrix (1), NROW (1), rbind (1), sqrt (1), sum (1), t (1), table (1)

methods

setGeneric (36), setMethod (36), setClass (11)

melt

error (13), el_control (8), validate_optim (4), calibrate (3), compute_EL (3), validate_weights (3), compute_generic_EL (2), get_max_threads (2), test_GLM (2), test_LM (2), validate_family (2), compute_bootstrap_calibration (1), compute_confidence_intervals (1), compute_confidence_region (1), compute_ELD (1), el_eval (1), el_glm (1), el_lm (1), el_mean (1), el_sd (1), get_rank (1), test_hypothesis (1), test_multiple_hypotheses (1), validate_alpha (1), validate_b (1), validate_calibrate (1), validate_cv (1), validate_hypotheses (1), validate_hypothesis (1), validate_keep_data (1), validate_level (1), validate_lhs (1), validate_x (1)

stats

formula (14), df (8), offset (6), optim (6), na.action (5), weights (5), family (4), pchisq (3), step (3), contrasts (2), model.response (2), model.weights (2), sd (2), glm.fit (1), pf (1), qchisq (1), qf (1)

graphics

par (8)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


3. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in C++ (53% in 15 files) and R (47% in 29 files)
  • 1 authors
  • 2 vignettes
  • 1 internal data file
  • 4 imported packages
  • 59 exported functions (median 4 lines of code)
  • 120 non-exported functions in R (median 7 lines of code)
  • 91 R functions (median 14 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 29 88.8
files_src 15 95.4
files_vignettes 2 85.7
files_tests 17 95.3
loc_R 1629 80.0
loc_src 1827 74.0
loc_vignettes 129 33.7
loc_tests 1243 89.7
num_vignettes 2 89.2
data_size_total 1311 61.2
data_size_median 1311 65.8
n_fns_r 179 88.1
n_fns_r_exported 59 90.2
n_fns_r_not_exported 120 87.0
n_fns_src 91 78.1
n_fns_per_file_r 3 55.1
n_fns_per_file_src 6 55.6
num_params_per_fn 2 11.9
loc_per_fn_r 7 16.0
loc_per_fn_r_exp 4 4.3 TRUE
loc_per_fn_r_not_exp 7 18.0
loc_per_fn_src 14 46.6
rel_whitespace_R 8 59.8
rel_whitespace_src 5 49.5
rel_whitespace_vignettes 8 6.5
rel_whitespace_tests 8 72.2
doclines_per_fn_exp 34 40.2
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 253 90.8

3a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


4. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

check-standard.yaml

GitHub Workflow Results

id name conclusion sha run_number date
2673111985 pages build and deployment success 6ac77c 46 2022-07-14
2673079345 pkgcheck failure 1b9bbe 51 2022-07-14
2673079344 pkgdown success 1b9bbe 49 2022-07-14
2673079342 R-CMD-check success 1b9bbe 71 2022-07-14
2673079354 test-coverage success 1b9bbe 153 2022-07-14

3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following note:

  1. checking installed package size ... NOTE
    installed size is 77.8Mb
    sub-directories of 1Mb or more:
    libs 77.2Mb

R CMD check generated the following check_fail:

  1. rcmdcheck_reasonable_installed_size

Test coverage with covr

Package coverage: 99.78

Cyclocomplexity with cyclocomp

The following function have cyclocomplexity >= 15:

function cyclocomplexity
el_glm 18

Static code analyses with lintr

lintr found the following 5 potential issues:

message number of times
Avoid library() and require() calls in packages 4
Lines should not be more than 80 characters. 1


Package Versions

package version
pkgstats 0.1.1.3
pkgcheck 0.0.3.76
srr 0.0.1.167


Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

@emilyriederer
Copy link

Hi @markean - I just wanted to provide a brief update here. I apologize for the delay in next steps here; it's not for lack of excitement to kick off this review! I hope to have a handling editor ready to assign next week.

@markean
Copy link
Member Author

markean commented Jul 22, 2022

Thanks for the update @emilyriederer!

@emilyriederer
Copy link

@ropensci-review-bot assign @Paula-Moraga as editor

@ropensci-review-bot
Copy link
Collaborator

Assigned! @Paula-Moraga is now the editor

@emilyriederer
Copy link

@markean , I'm delighted to share that @Paula-Moraga will serve as editor for this package. She is away this week but is available to commence next week. We will be in more touch soon!

@markean
Copy link
Member Author

markean commented Jul 28, 2022

@emilyriederer, thanks for the news!

@Paula-Moraga
Copy link

Hi @markean. I am pleased to handle this package. I will look for reviewers.

@Paula-Moraga
Copy link

@ropensci-review-bot seeking reviewers

@ropensci-review-bot
Copy link
Collaborator

I'm sorry @Paula-Moraga, I'm afraid I can't do that. That's something only editors are allowed to do.

@markean
Copy link
Member Author

markean commented Aug 3, 2022

Hi @Paula-Moraga. Thank you for your time and effort. Please let me know if there is anything I can help in the review process.
@emilyriederer Could you update the label?

@emilyriederer
Copy link

Done! I'll ask the bot team on Slack if they know why @Paula-Moraga couldn't herself since she is, in fact, the editor!

@Paula-Moraga
Copy link

@ropensci-review-bot assign @pchausse as reviewer

@ropensci-review-bot
Copy link
Collaborator

@pchausse added to the reviewers list. Review due date is 2022-08-25. Thanks @pchausse for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

@ropensci-review-bot
Copy link
Collaborator

@pchausse: If you haven't done so, please fill this form for us to update our reviewers records.

@Paula-Moraga
Copy link

@ropensci-review-bot set due date for @pchausse to 2022-09-05

@ropensci-review-bot
Copy link
Collaborator

Review due date for @pchausse is now 05-September-2022

@markean
Copy link
Member Author

markean commented Aug 5, 2022

Hi @pchausse. Thank you so much for reviewing our package. Please let me know if you have any question. A draft version of pdf vignette in vignettes folder might be of help.

@Paula-Moraga
Copy link

@ropensci-review-bot assign @awstringer1 as reviewer

@ropensci-review-bot
Copy link
Collaborator

@awstringer1 added to the reviewers list. Review due date is 2022-08-29. Thanks @awstringer1 for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

@ropensci-review-bot
Copy link
Collaborator

@awstringer1: If you haven't done so, please fill this form for us to update our reviewers records.

@markean
Copy link
Member Author

markean commented Aug 9, 2022

Hi @awstringer1! Thank you for reviewing the package. Please let me know if you have any question. A draft version of pdf vignette in vignettes folder might be of help.

@pchausse and @awstringer1: By an external request, a newer version of the package will be released to CRAN in August with some minor updates. It will not interfere with the review process, but I will tag it after the release. Then the reviewer comments will be reflected to the package in a subsequent release.

@ropensci-review-bot
Copy link
Collaborator

📆 @awstringer1 you have 2 days left before the due date for your review (2022-08-29).

@Paula-Moraga
Copy link

Many thanks @awstringer1 for your review! You can of course ignore the previous bot message :)

@Paula-Moraga
Copy link

@ropensci-review-bot submit review #550 (comment) time 8

@ropensci-review-bot
Copy link
Collaborator

Logged review for awstringer1 (hours: 8)

@Paula-Moraga
Copy link

Many thanks @markean for answering @awstringer1's comments so quickly. @awstringer1 can tell you whether he is satisfied with them or has additional comments. @pchausse will submit his review soon. Thanks all!

@pchausse
Copy link

pchausse commented Sep 2, 2022

melt-review-chausse-exported.pdf

Here is my review. I was trying to make some of my points with some R codes, but I think it does not work here. I will also attach the pdf.

My review is based on Version 1.7.0 from CRAN.

Package Review

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Compliance with Standards

  • This package complies with a sufficient number of standards for
    a silver badge

  • This grade of badge is the same as what the authors wanted to
    achieve

The following standards currently deemed non-applicable (through tags
of @srrstatsNA) could potentially be applied to future versions of
this software:

G1.6 Software should include code necessary to compare performance
claims with alternative implementations in other R packages.
: Other
packages compute empirical likelihood tests and confidence
intervals. We would like to see how the melt package compares with
the others. For example, my momentfit (or gmm) package produces
the same type of confidence intervals or confidence areas for linear
models and the mean. The package
emplik now includes
regression models.

  • This package extends beyond minimal compliance with standards in
    the following ways: (please describe)

This is clearly not a package suited for a unique application. the
algorithm to solve restricted empirical likelihood models is efficient
and can be applied to many more models not yet implemented by the
package. For example, The author seems to discard the idea of applying
the method to over-identified models (from the NOT TO DO section of
CONTRIBUTING.md), but it could easily be implemented since constrained
empirical likelihood is in fact over-identified. I could also see
other packages being based on melt for specific applications.

Comments

"@srrstats {RE1.4} Some of the core distributional assumptions are
documented."
: The list of regularity conditions is not that long. We
do see some distribution assumptions in the EL documentation where
the EL method is presented for general estimating equations
$E[g(X,\theta)]=0$, but it would be important to add at least the
assumption on the variance of $g(X,\theta)$. Users using only
el_mean or el_sd should know what distribution assumptions are
required. For these two functions, the detail section is empty. At
least, we should see what the estimating equations are and then be
given a link to the EL documentation for more details.

"@srrstats {RE4.11} logLik() method extracts the empirical
log-likelihood value which can be interpreted as a measure of
goodness-of-fit"
: It is really not clear how to interpret the
value. Can we have a more detailed explanation? First, it is the
empirical likelihood value of what? For example, if we estimate a
linear model:

 set.seed(112233)
 x <- rnorm(40); y <- rnorm(40)
 fit <- el_lm(y~x)
 print(logLik(fit), digits=10)

This is the log-likelihood of what? If I understand correctly (not
clear from the documentation), It is not the one from the model
that restrict the slope coefficient to be 0, which is what we have
in the slot logl or if we sum the log of the implied
probabilities:

print(fit@logl, digits=10)
print(sum(fit@logp), digits=10)

After more testing, it seems to be the one from the unrestricted model
for which the implied probabilities are all equal to $1/n$:

print(40*log(1/40), digits=10)

I think this needs to be explained in more details.

"@srrstats {RE6.0} plot method is available for an ELD object
that is returned by the eld() method."
: The authors may consider
adding some diagnostic-type plot methods for EL (or any other)
class as we have for lm objects. It is great to have a plot for
eld() to detect outliers, but it would be even better I think if
the plot was applied directly to EL class. For example, a Cook
Distance exists separately for lm, but we can plot it directly
from lm objects by selecting which=4:

## clearly not the best method, but you see the idea
setMethod("plot", "EL", function(x, which = 1, ...) {
    if (which==1)
        plot(eld(x))
    else
        cat("other diagnostic plots if any\n")
})

```{r, out.width='60%', fig.align='center', fig.height=5}
plot(fit, which=1)

General Review

This is a very well designed package. The outputs of the different
estimating or testing methods are easy to read and the algorithms seem
to be quite efficient. The following are mostly suggestions that in my
view could improve the package.

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README

I think we should see a more detailed presentation of why one would
need to use the melt package to perform hypothesis tests or to
construct confidence intervals. For example, when the empirical
likelihood starts to produce confidence intervals with better coverage
than lm does for linear regression? In the example section of
el_lm, we see an application with iid and normally distributed
variables. In this case, even when the same size is very small, a
least square confidence interval has an exact coverage while EL does
not. So why should we use the melt package?

  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING
  • The documentation is sufficient to enable general use of the package beyond one specific use case

There is room for improvement here.

The optim slot

We see that a slot optim exists in the LM object created by el_lm
through the CEL-class. But it is not clear from the documentation what
these values are.

set.seed(5649)
df <- data.frame(y = rnorm(50), x = rnorm(50))
fit <- el_lm(y ~ x, df)
slotNames(fit)

From the CEL-class documentation, it says that par is $\hat{\theta}$
(subject to the constraints which we do not have here) and lambda is
$\hat{\lambda}$. But it is not:

coef(fit)
fit@optim$par
fit@optim$lambda

What is par? What is lambda? Since the model is just-identified,
the latter should be a vector of zeroes when there are no
constraints. Are they the estimates when we restrict all coefficients
to be zero? (except for the intercept). Can that be added to the
documentation? If we explore further, we learn that it seems to be the
case:

set.seed(5649)
df2 <- data.frame(y = rnorm(50), x = rnorm(50), z = rnorm(50))
fit2 <- el_lm(y ~ x+z, df2)
fit2@optim$par

I think it is important information that may be useful for some users.

In the el_lm and el_glm (maybe others?) it is written in the
details that the test for the null
$H_0:~\theta_1=\theta_2=\cdots=\theta_{p-1}=0$ is in optim and the
test parameter by parameter is in sigTests, but there is no test
result in optim. It only includes the parameters, the lambdas, the
number of iterations and the convergence code. Also, why is it
important to return the implied probabilities (the logp slot) and
the lambdas for the joint test and not for the individual ones?

The other EL slots

The slots are documented in EL, but it is not easy to get it for other
objects. For example, in the el_glm documentation, the Value is "An
object of class GLM". If we click in it, we only see that it inherits
from LM. If we click on it, we see some slots, but not all. It
inherits from CEL, so we click on it and only the optim slot is
defined. We finally get the description of all slots by clicking on
EL. It is not an easy process from someone like me who only work in
terminals. It requires to call help() several times. It would be
helpful if the slots were defined for all objects that inherits from
EL.

Also, it is not always clear what the slots are. In EL, we see "A
numeric vector of the log probabilities obtained from empirical
likelihood" for the logp slot. But which probabilities? I figured
out for el_lm that it is for the restricted model when all
parameters are 0 except the intercept. It should be specified. It is
the same for logl, loglr, etc. This is something we see across the
different help files. It is another reason the author should define
all slots for all functions that return an EL object.

The Empirical Likelihood method

I would expect a little more details about the EL method. It is well
presented in the EL documentation for general estimating equations
$g(X,\theta)$, but I think it would be helpful for the method to also
be explained in el_lm, el_glm, el_sd, el_mean and even
el_eval. The EL method does not need to be repeated, but at least we
should see the specific estimating equation used for each model and a
link to EL for the details. For example, it is not clear to me which
estimating equation is used for the standard deviation or for
generalized linear models.

It is not clear what is the purpose of the el_eval. It says that it
is for general estimating equation, but the input is a matrix, so it
is an estimating equation without parameters. The exact same result
can be obtained with 'el_mean': the following two produced identical
optim, logp, logl and loglr:

x <- matrix(rnorm(200), 50, 4)
fit <- el_eval(x)
fit2 <- el_mean(x,rep(0,4))

So, why el_eval does not simply call el_mean? If the output has to
be a list, it can returned as a such.


Other comments

As a suggestion, the package can be greatly improved by adding more
general nonlinear regression models. In el_glm is a nonlinear model,
so it should bot be to difficult to include nonlinear regression like
in nls or even more general cases. With a method to estimate general
nonlinear models, it becomes straightforward to test nonlinear
restrictions, if they are limited to restrictions that can be written as
$H_0:~\theta_i = f(\theta_{-i})$. if not too difficult to implement
because it only implies replacing $\theta_i$ by $f(\theta_i)$ in the
estimating equation function $g(X,\theta)$.

Many recent applications rely on nonlinear estimating equations that
are often over-identified. It is the case of all methods recently
developed in the missing value literature (or the causal effect
literature, which is just a special case).

Algorithms

The confint method

It seems that all methods for the EL, QGLM and SD classes are
almost identical and can be reduced to a single method applied to a
union class. I can see some benefits from avoiding repeating the same
codes in several methods, especially if changes need to be applied to
them.

Why the getMethodEL(object) is used to get the type of method for
the classes EL and QGLM and not for the class SD? Is it because
the getMethodEL method does not exist for this class? For the
latter, the type is explicitly added (see line 214 of
confint-methods.R)

The elt method

Would be nice to see the hypothesis being tested when the result is
printed (see for example the linearHypothesis function from the
car package).

Testing

Don't see any issue with the different tests.

Visualisation (where appropriate)

No issue here other than the above suggestion about the plot method.

Package Design

  • As I mentioned above, it is a well designed package. If we are
    looking for EL confidence intervals for traditional models, which is
    exactly the purpose as described by the author, it does a great
    job. It is also one of the most efficient algorithm I have tried to
    estimate restricted empirical likelihood model.
  • External Design:
  • In relation to External Design: The objects created by most
    estimating functions have methods that produce output comparable to
    lm glm and nls. We can then easily compare the results.

  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Estimated hours spent reviewing: 16

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

@Paula-Moraga
Copy link

Thanks so much @pchausse for your review!

@Paula-Moraga
Copy link

@ropensci-review-bot submit review #550 (comment) time 16

@ropensci-review-bot
Copy link
Collaborator

Logged review for pchausse (hours: 16)

@markean
Copy link
Member Author

markean commented Sep 8, 2022

Hi @pchausse, thank you for your review and feedback. Your comments have greatly improved the documentation. Some notable updates on the code include: a new logProb() function is added to extract the logp slot, plot() is directly applicable to EL class, elt() allows a symbolic description for specifying a hypothesis, print() shows the tested hypothesis when applied to ELT class. The repository currently has v1.7.0.9600 of the package. The CRAN version will be updated after the review process. Please find the responses below.

Response to @pchausse

Compliance with Standards

G1.6 Software should include code necessary to compare performance claims with alternative implementations in other R packages.: Other packages compute empirical likelihood tests and confidence intervals. We would like to see how the melt package compares with the others. For example, my momentfit (or gmm) package produces the same type of confidence intervals or confidence areas for linear models and the mean. The package emplik now includes regression models.

Since we do not make performance claims compared to other packages, please check the benchmark result below for reference only. We constructed EL confidence intervals for the mean with 100 observations using the melt, momentfit, and emplik packages. Default settings were used for each package.

library(microbenchmark)
library(melt)
library(emplik)
library(momentfit)
set.seed(134095)
x <- rnorm(100)
fit <- el_mean(x, 0)
out <- microbenchmark(
  ci_melt = confint(fit),
  ci_momentfit = confint(x, gelType = "EL"),
  ci_emplik = findUL(fun = function(theta, x) {
    el.test(x, mu = theta)
  }, MLE = mean(x), x = x),
  times = 1000
)
print(out, unit = "relative")
Unit: relative
         expr      min        lq      mean    median        uq       max neval cld
      ci_melt  1.00000  1.000000  1.000000  1.000000  1.000000  1.000000  1000 a  
 ci_momentfit  6.18093  5.738152  5.672995  5.549923  5.304345  5.693226  1000  b 
    ci_emplik 33.49088 29.969031 31.332941 30.790802 31.826105 69.604799  1000   c



Comments

"@srrstats {RE1.4} Some of the core distributional assumptions are documented.": The list of regularity conditions is not that long. We do see some distribution assumptions in the EL documentation where the EL method is presented for general estimating equations $E[g(X, \theta)] = 0$, but it would be important to add at least the assumption on the variance of $g(X, \theta)$. Users using only el_mean or el_sd should know what distribution assumptions are required. For these two functions, the detail section is empty. At least, we should see what the estimating equations are and then be given a link to the EL documentation for more details.

The updated documentation of el_mean() and el_sd() state in Details section the optimization problems with the estimating functions. Assumptions on the variance are stated as well. The See Also section gives a link to the EL documentation. Links to relevant documentation have been added to all the other user-facing exported functions as well.

"@srrstats {RE4.11} logLik() method extracts the empirical log-likelihood value which can be interpreted as a measure of goodness-of-fit": It is really not clear how to interpret the value. Can we have a more detailed explanation?

logLik() only gives $-n\log n$ with $p_i = 1/n$ for all $i$. We have expanded the Details section of logLik() (line 436 of AllGenerics.R) with more details. logLik() was originally designed to offer other functionality, but its role has been superseded by logL(). The srr tag has also been moved to logL-methods.R.

"@srrstats {RE6.0} plot method is available for an ELD object that is returned by the eld() method.": The authors may consider adding some diagnostic-type plot methods for EL (or any other) class as we have for lm objects. It is great to have a plot for eld() to detect outliers, but it would be even better I think if the plot was applied directly to EL class. For example, a Cook Distance exists separately for lm, but we can plot it directly from lm objects by selecting which=4:

eld() is now applicable to EL objects directly (line 35 of plot-methods.R). It implicitly calls eld() since the package currently provides no other diagnostic plot. More options like which can be added in a future version.


General Review

Documentation

I think we should see a more detailed presentation of why one would need to use the melt package to perform hypothesis tests or to construct confidence intervals. For example, when the empirical likelihood starts to produce confidence intervals with better coverage than lm does for linear regression? In the example section of el_lm, we see an application with iid and normally distributed variables. In this case, even when the same size is very small, a least square confidence interval has an exact coverage while EL does not. So why should we use the melt package?

Please check the updated README file. We have added more details in the Overview section. The new Main functions section introduces some available functions with short descriptions. More examples have also been added in the Usage section. The examples of el_lm() now rely on a new internal dataset thiamethoxam.

The optim slot

What is par? What is lambda? Since the model is just-identified, the latter should be a vector of zeroes when there are no constraints. Are they the estimates when we restrict all coefficients to be zero? (except for the intercept). Can that be added to the documentation?

The updated Details sections of LM, GLM, and QGLM contain more information on the optim slot and its components. The following slots are all related to the overall test: optim, logp, logl, loglr, statistic, df, and pval.

In the el_lm and el_glm (maybe others?) it is written in the details that the test for the null $H_0 : \theta_1 = \theta_2 = \cdots = \theta_{p - 1} = 0$ is in optim and the test parameter by parameter is in sigTests, but there is no test result in optim. It only includes the parameters, the lambdas, the number of iterations and the convergence code. Also, why is it important to return the implied probabilities (the logp slot) and the lambdas for the joint test and not for the individual ones?

Besides optim, the overall test result needs to involve logp, logl, loglr, statistic, df, and pval. Since they are all derived from optim, we stated that the overall test result is in optim. In order to avoid confusion, we have removed the original sentence "The test results are returned as optim and sigTests, respectively." from el_lm() and el_glm(). The updated documentation of el_lm() and el_glm() now focuses on what it does rather than specific slots of its return value which are not explained.

We can include the full results(optim, logp, logl, loglr, statistic, df, and pval) for each significance test, but it will make the sigTests slot have nested structure with possibly large size. Currently, sigTest() applied to SummaryLM returns the estimates, the test statistics, and the $p$-values. elt() can be if the user wants more information on a specific test.

Additionally, we have added a new accessor method logProb() that extracts logp.

The other EL slots

The slots are documented in EL, but it is not easy to get it for other objects. For example, in the el_glm documentation, the Value is "An object of class GLM". If we click in it, we only see that it inherits from LM. If we click on it, we see some slots, but not all. It inherits from CEL, so we click on it and only the optim slot is defined. We finally get the description of all slots by clicking on EL. It is not an easy process from someone like me who only work in terminals. It requires to call help() several times. It would be helpful if the slots were defined for all objects that inherits from EL.

Also, it is not always clear what the slots are. In EL, we see "A numeric vector of the log probabilities obtained from empirical likelihood" for the logp slot. But which probabilities? I figured out for el_lm that it is for the restricted model when all parameters are 0 except the intercept. It should be specified. It is the same for logl, loglr, etc. This is something we see across the different help files. It is another reason the author should define all slots for all functions that return an EL object.

The updated documentation (in AllClasses.R) lists all slots for all classes with some class-specific details.

The Empirical Likelihood method

I would expect a little more details about the EL method. It is well presented in the EL documentation for general estimating equations $g(X, \theta)$, but I think it would be helpful for the method to also be explained in el_lm, el_glm, el_sd, el_mean and even el_eval. The EL method does not need to be repeated, but at least we should see the specific estimating equation used for each model and a link to EL for the details. For example, it is not clear to me which estimating equation is used for the standard deviation or for generalized linear models.

Each model-building function (including el_eval()) now states estimating equations and the associated EL problem in the Details section.

el_eval() is a function that can be used to compute EL when the user has a specific estimating function not covered by the package. For a just-identified estimating function $g$ with a specified parameter $\theta$, the user needs to prepare the input matrix g computed with data. el_eval() only solves the optimization problem and returns the result. Since the main functions, such as elt() and elmt(), cannot be used with an arbitrary estimating equation, the return value is a list with the same components of EL.

Other comments

As a suggestion, the package can be greatly improved by adding more general nonlinear regression models. In el_glm is a nonlinear model, so it should bot be to difficult to include nonlinear regression like in nls or even more general cases. With a method to estimate general nonlinear models, it becomes straightforward to test nonlinear restrictions, if they are limited to restrictions that can be written as $H_0: \theta_i = f(\theta_{-i})$. if not too difficult to implement because it only implies replacing $\theta_i$ by $f(\theta_i)$ in the estimating equation function $g(X, \theta)$.

Although we agree that nls is probably the most natural way for an extension of the package, the development is currently focused on adding families and link functions for el_glm(). The next release version (v1.8.0) will be shipped with identity link option for quasipoisson family, which was added recently in C++ files. So it will take several more releases for el_glm() to provide most of the options in our roadmap.

Based on your comments above, nls methods could be implemented by improving el_eval() or making new functions such as el_nls() or el_g(). These functions should take a formula or function argument for estimating functions and an additional argument for gradient functions. Then, in principle, we could construct confidence intervals or perform hypothesis testing. The main concern would be performance. The package exclusively relies on the Eigen C++ library and OpenMP directives for parallel computing, and user-defined R functions have no simple thread-safe way of accessing them. We will revisit this issue when the development of el_glm() is finished.

Algorithms

The confint method

It seems that all methods for the EL, QGLM and SD classes are
almost identical and can be reduced to a single method applied to a
union class. I can see some benefits from avoiding repeating the same
codes in several methods, especially if changes need to be applied to
them.

Why the getMethodEL(object) is used to get the type of method for
the classes EL and QGLM and not for the class SD? Is it because
the getMethodEL method does not exist for this class? For the
latter, the type is explicitly added (see line 214 of
confint-methods.R)

confint() methods for QGLM and SD classes have been removed. One method for EL class can be applied to all other classes using a new internal generic getEstimates() (line 663 of AllGenerics.R). We have applied the same procedure to confreg() and elmt() methods and reduced the amount of duplicate code.

The elt method

Would be nice to see the hypothesis being tested when the result is
printed (see for example the linearHypothesis function from the
car package).

pirnt() method now prints the tested hypothesis. A new internal function describe_hypothesis() in describe_hypothesis.R translates the hypothesis matrix into a character vector for printing. Additionally, elt() now accepts a character vector for the argument lhs, allowing a symbolic description of a hypothesis. The updated documentation of elt.R shows different ways to specify a hypothesis with new examples.

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Please check the updated DESCRIPTION file. We have also added you and @awstringer1 in the Acknowledgments section of the pdf vignette in vignettes/article.pdf.

@Paula-Moraga
Copy link

Many thanks @markean for working on the comments so quickly! @awstringer1 and @pchausse, can you please take a look and say whether your comments were well addressed or you have additional questions? Thanks!

@awstringer1
Copy link

Thanks to @markean for the response. I was waiting until both reviews and responses were in, to avoid redundancy. All my major comments have been addressed.

My only minor comment is about the optimization. I don't think it is true that "the objective function is strictly concave without any constraint. Thus, the Newton–Raphson method converges quadratically to the unique solution."– see Nocedal and Wright's Numerical Optimization, Theorem 3.5, page 46. There are non-trivial initial conditions, namely that the starting value has to be "close enough" to the optimum, else it will diverge. If Newton with step-halving has worked so far, then that's good enough for me (and perhaps there is some additional theory here about why it works in this context that I do not know), but if it ever becomes a practical problem in the future, the authors could consider trust region optimization. See Chapter 4 in that same book, or the trust or trustOptim R packages. I have found this very useful in cases where Newton was not working satisfactorily.

@pchausse
Copy link

Thanks @markean for your response. All my comments have also been addressed.

@markean
Copy link
Member Author

markean commented Sep 13, 2022

Thanks again, @pchausse and @awstringer1.

As for the optimization, it can be seen from Art Owen's Empirical Likelihood (2001, page 63) that the objective function is twice continuously differentiable with the Lipschitz continuous Hessian by using a pseudo-logarithm function. The initial $\lambda$ is set to the zero vector. This is related to the theoretical result that $\lambda(\theta_0) = o_p(n^{-1/2})$, where $\theta_0$ is the true value. Although there have been some suggestions for using a data-dependent initial value for $\lambda$ (e.g., Hjort, McKeague, and van Keilegom 2009, page 1104), we have not found non-convergence issues during the unit tests when the convex hull constraint is satisfied. We will keep track of the issue and consider other alternatives as we add more estimating functions and methods, including the Trust Region procedure.

@markean
Copy link
Member Author

markean commented Sep 19, 2022

Hi @Paula-Moraga. Can we proceed with this at your discretion if the reviewers have no further comments?

@mpadge mpadge added the stats label Sep 20, 2022
@Paula-Moraga
Copy link

Yes! Thanks so much @markean @awstringer1 @pchausse for your time and work. Very happy to approve the melt package!

@Paula-Moraga
Copy link

@ropensci-review-bot approve melt

@ropensci-review-bot
Copy link
Collaborator

Approved! Thanks @markean for submitting and @pchausse, @awstringer1 for your reviews! 😁

To-dos:

  • Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You will need to enable two-factor authentication for your GitHub account.
    This invitation will expire after one week. If it happens write a comment @ropensci-review-bot invite me to ropensci/<package-name> which will re-send an invitation.
  • After transfer write a comment @ropensci-review-bot finalize transfer of <package-name> where <package-name> is the repo/package name. This will give you admin access back.
  • Fix all links to the GitHub repo to point to the repo under the ropensci organization.
  • Delete your current code of conduct file if you had one since rOpenSci's default one will apply, see https://devguide.ropensci.org/collaboration.html#coc-file
  • If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,
    • deactivate the automatic deployment you might have set up
    • remove styling tweaks from your pkgdown config but keep that config file
    • replace the whole current pkgdown website with a redirecting page
    • replace your package docs URL with https://docs.ropensci.org/package_name
    • In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar, https://github.com/ropensci/foobar
  • Fix any links in badges for CI and coverage to point to the new repository URL.
  • Increment the package version to reflect the changes you made during review. In NEWS.md, add a heading for the new version and one bullet for each user-facing change, and each developer-facing change that you think is relevant.
  • We're starting to roll out software metadata files to all rOpenSci packages via the Codemeta initiative, see https://docs.ropensci.org/codemetar/ for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
  • You can add this installation method to your package README install.packages("<package-name>", repos = "https://ropensci.r-universe.dev") thanks to R-universe.

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent).

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. They will get in touch about timing and can answer any questions.

We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.

Last but not least, you can volunteer as a reviewer via filling a short form.

@markean
Copy link
Member Author

markean commented Sep 20, 2022

Thank you @Paula-Moraga!

@markean
Copy link
Member Author

markean commented Sep 20, 2022

@ropensci-review-bot finalize transfer of melt

@ropensci-review-bot
Copy link
Collaborator

Transfer completed.
The melt team is now owner of the repository and the author has been invited to the team

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants