You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ability to have non-normal error distributions for BART RV's like the pm.Gamma distribution that is used, and
the ability to model functions other than the mean like the modeling of non-constant variance.
While the first ability is easily understood and a small step from existing work, the latter ability is both novel and amazing. I would love to use this ability in my work, I just think the implementation feels a little wonky to me for lack of a better word. Additionally, the documentation in pymc-bart’s paper [Quiroga et al., 2022] does not completely explain how this works.
The Issue
It is my understanding that when growing trees for computing estimated variance, every proposed tree has leaf values drawn from $N(\mu_{pred},\epsilon^2)$ where $\mu_{pred}$ is computed as the mean of the current sum of trees divided by the number of trees $m$. The problem I see is that $\mu_{pred}$ is (at least initially) based off of $Y$ values that seem inappropriate as initial values; the $Y$ values in the example are initial guesses at "sales" and are likely to be very BAD initial guesses for "standard deviation of sales".
For the , these initial guesses of "sales" are much higher than one would guess for initial guesses of standard deviation. Hence, in this picture:
it is not surprising that the 94% HDI is too wide. We should expect around 12 observations falling outside of the 94% HDI band, but in reality only 2 or 3 observations fall outside of the band. I am guessing the estimate of standard deviation is systematically too high because of the initial conditions.
Thoughts on Implementation
So, with all of the above I am requesting:
Allow for a PYMC RV prior (or multiple priors when size > 1) to be specified on leaf-node values, and/or
Better document the existing functionality on how leaf-nodes are computed so at least the mechanism is more transperent.
Thanks for all the hard-work, plugging BART models in as components of larger probabilistic programs is very much a winning idea and should dominate applied workflows where uncertainty quantification is important. I would love to see this cleaned up a little bit and I am happy to help with documentation or code changes, but need more direction on the math/sampling side of things. Thanks again.
The text was updated successfully, but these errors were encountered:
Thanks for the comments and interest in using and improving PyMC-BART, your feedback is very useful.
I have been thinking about this for some time, and I am still not sure about a good solution. Having to pass Y is already a source of confusion, and as you mention the mean and std of Y are not always good guesses for the initial values of the mean function and standard deviation of the proposal distribution at the leaf nodes. Having said that I remember the results being relatively robust to those initial guesses.
maybe a not-too-difficult-to-implement alternative could be to replace the "Y" argument by two arguments init_mean and init_sd, that can be scalars or vectors
I also agree that we need to improve the documentation. Any help you can offer will be more than welcome.
In the meantime, and if you have time, you can explore passing other values of Y using a made-up vector instead of Y, something like np.log(pz.Gamma(mu=3, sigma=1.5).rvs(200)), instead of np.log(Y).
Cool. Thanks for the reply. I will experiment with passing other values of Y to see what happens. I will also see if I can make a contribution to the documentation. Will keep you posted!
Background
The showcases two amazing abilities of PyMC-BART:
pm.Gamma
distribution that is used, andWhile the first ability is easily understood and a small step from existing work, the latter ability is both novel and amazing. I would love to use this ability in my work, I just think the implementation feels a little wonky to me for lack of a better word. Additionally, the documentation in pymc-bart’s paper [Quiroga et al., 2022] does not completely explain how this works.
The Issue
It is my understanding that when growing trees for computing estimated variance, every proposed tree has leaf values drawn from$N(\mu_{pred},\epsilon^2)$ where $\mu_{pred}$ is computed as the mean of the current sum of trees divided by the number of trees $m$ . The problem I see is that $\mu_{pred}$ is (at least initially) based off of $Y$ values that seem inappropriate as initial values; the $Y$ values in the example are initial guesses at "sales" and are likely to be very BAD initial guesses for "standard deviation of sales".
For the , these initial guesses of "sales" are much higher than one would guess for initial guesses of standard deviation. Hence, in this picture:
it is not surprising that the 94% HDI is too wide. We should expect around 12 observations falling outside of the 94% HDI band, but in reality only 2 or 3 observations fall outside of the band. I am guessing the estimate of standard deviation is systematically too high because of the initial conditions.
Thoughts on Implementation
So, with all of the above I am requesting:
size > 1
) to be specified on leaf-node values, and/orThanks for all the hard-work, plugging BART models in as components of larger probabilistic programs is very much a winning idea and should dominate applied workflows where uncertainty quantification is important. I would love to see this cleaned up a little bit and I am happy to help with documentation or code changes, but need more direction on the math/sampling side of things. Thanks again.
The text was updated successfully, but these errors were encountered: