You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During my review, I tested the package's functionality on a few different simulated conditional densities (gamma/beta, etc.) with different estimator hyperparameter settings. All outputs appeared reasonable -- great job implementing a relatively complex statistical procedure in R. However, I do have one somewhat major criticism regarding the functionality of the software: it is relatively difficult to obtain conditional density predictions for each unit of a given dataset.
To elaborate, based on the code examples, it seems as though the API is oriented towards visualization of an entire conditional density for a fixed $x$ value. For instance, consider the following simulation setup:
The grid-based approach is nice for visualization if estimating the entire density curve is the end goal. However, many practical applications necessitate estimating conditional density values for each unit of a dataset to perform some downstream analysis. For example, many counterfactual mean estimators for a continuous treatment in causal inference require first estimating the conditional density of the treatment given covariates for each unit, and then reweighting an outcome based on the conditional density (for example, Diaz and van der Laan (2012), Haneuse and Rotnizky (2013) or even something like Schindl, Shen, and Kennedy (2024)).
Based on the API, one would think this could be accomplished by replacing y_grid with y_data and x with x_data. However, doing so runs into multiple issues. First, x does not seem to be able to take in a vector input -- I get this result:
> model1 = lpcde::lpcde(x_data=x_data, y_data=y_data, y_grid=y_grid, x=matrix(c(0.3, 0.6), ncol = 2), bw = 0.5)
> Error in s_mat %*% e_vec : non-conformable arguments
In addition: Warning messages:
1: In sweep(x_data, 2, x) :
STATS is longer than the extent of 'dim(x)[MARGIN]'
2: In sweep(x_sorted, 2, x) :
STATS is longer than the extent of 'dim(x)[MARGIN]'
Maybe I am providing the input incorrectly, but either way, it is not obvious how to compute estimates for multiple x values. Second, consider the following line, which takes about 30 seconds to run on my machine:
It yields the exact same predictions as the following, which runs in under 2 seconds:
preds = vector(length = 100)
for(i in 1:100){
model2 = lpcde::lpcde(x_data=x_data, y_data=y_data, y_grid=y_data[i], x=0.5, bw = 0.5)
preds[i] = model2$Estimate[3] # or extract whatever statistics you like...
}
which leads me to believe something redundant is going on under the hood that I don't quite understand. Third, there seems to be no way to, after the model is fit, apply it to obtain estimates on a new set of data. In any case, I would recommend implementing something like a predict method to perform inference on an arbitrary dataset, in order to separate these two uses cases and allow obtaining estimates on a new set of data.
The text was updated successfully, but these errors were encountered:
Raising as part of JOSS review openjournals/joss-reviews#7241
During my review, I tested the package's functionality on a few different simulated conditional densities (gamma/beta, etc.) with different estimator hyperparameter settings. All outputs appeared reasonable -- great job implementing a relatively complex statistical procedure in R. However, I do have one somewhat major criticism regarding the functionality of the software: it is relatively difficult to obtain conditional density predictions for each unit of a given dataset.
To elaborate, based on the code examples, it seems as though the API is oriented towards visualization of an entire conditional density for a fixed$x$ value. For instance, consider the following simulation setup:
It is very easy to fit the model across a small grid of$y$ values with just two lines of code:
The grid-based approach is nice for visualization if estimating the entire density curve is the end goal. However, many practical applications necessitate estimating conditional density values for each unit of a dataset to perform some downstream analysis. For example, many counterfactual mean estimators for a continuous treatment in causal inference require first estimating the conditional density of the treatment given covariates for each unit, and then reweighting an outcome based on the conditional density (for example, Diaz and van der Laan (2012), Haneuse and Rotnizky (2013) or even something like Schindl, Shen, and Kennedy (2024)).
Based on the API, one would think this could be accomplished by replacing
y_grid
withy_data
andx
withx_data
. However, doing so runs into multiple issues. First,x
does not seem to be able to take in a vector input -- I get this result:Maybe I am providing the input incorrectly, but either way, it is not obvious how to compute estimates for multiple
x
values. Second, consider the following line, which takes about 30 seconds to run on my machine:It yields the exact same predictions as the following, which runs in under 2 seconds:
which leads me to believe something redundant is going on under the hood that I don't quite understand. Third, there seems to be no way to, after the model is fit, apply it to obtain estimates on a new set of data. In any case, I would recommend implementing something like a
predict
method to perform inference on an arbitrary dataset, in order to separate these two uses cases and allow obtaining estimates on a new set of data.The text was updated successfully, but these errors were encountered: