Better Workflow Surrounding Specification of `Y` Argument #122

flyaflya · 2023-10-11T21:01:30Z

Background

The showcases two amazing abilities of PyMC-BART:

The ability to have non-normal error distributions for BART RV's like the pm.Gamma distribution that is used, and
the ability to model functions other than the mean like the modeling of non-constant variance.

While the first ability is easily understood and a small step from existing work, the latter ability is both novel and amazing. I would love to use this ability in my work, I just think the implementation feels a little wonky to me for lack of a better word. Additionally, the documentation in pymc-bart’s paper [Quiroga et al., 2022] does not completely explain how this works.

The Issue

It is my understanding that when growing trees for computing estimated variance, every proposed tree has leaf values drawn from $N(\mu_{pred},\epsilon^2)$ where $\mu_{pred}$ is computed as the mean of the current sum of trees divided by the number of trees $m$. The problem I see is that $\mu_{pred}$ is (at least initially) based off of $Y$ values that seem inappropriate as initial values; the $Y$ values in the example are initial guesses at "sales" and are likely to be very BAD initial guesses for "standard deviation of sales".

For the , these initial guesses of "sales" are much higher than one would guess for initial guesses of standard deviation. Hence, in this picture:

it is not surprising that the 94% HDI is too wide. We should expect around 12 observations falling outside of the 94% HDI band, but in reality only 2 or 3 observations fall outside of the band. I am guessing the estimate of standard deviation is systematically too high because of the initial conditions.

Thoughts on Implementation

So, with all of the above I am requesting:

Allow for a PYMC RV prior (or multiple priors when size > 1) to be specified on leaf-node values, and/or
Better document the existing functionality on how leaf-nodes are computed so at least the mechanism is more transperent.

Thanks for all the hard-work, plugging BART models in as components of larger probabilistic programs is very much a winning idea and should dominate applied workflows where uncertainty quantification is important. I would love to see this cleaned up a little bit and I am happy to help with documentation or code changes, but need more direction on the math/sampling side of things. Thanks again.

The text was updated successfully, but these errors were encountered:

aloctavodia · 2023-10-25T15:26:21Z

Thanks for the comments and interest in using and improving PyMC-BART, your feedback is very useful.

I have been thinking about this for some time, and I am still not sure about a good solution. Having to pass Y is already a source of confusion, and as you mention the mean and std of Y are not always good guesses for the initial values of the mean function and standard deviation of the proposal distribution at the leaf nodes. Having said that I remember the results being relatively robust to those initial guesses.

maybe a not-too-difficult-to-implement alternative could be to replace the "Y" argument by two arguments init_mean and init_sd, that can be scalars or vectors

I also agree that we need to improve the documentation. Any help you can offer will be more than welcome.

In the meantime, and if you have time, you can explore passing other values of Y using a made-up vector instead of Y, something like np.log(pz.Gamma(mu=3, sigma=1.5).rvs(200)), instead of np.log(Y).

flyaflya · 2023-11-07T21:26:24Z

Cool. Thanks for the reply. I will experiment with passing other values of Y to see what happens. I will also see if I can make a contribution to the documentation. Will keep you posted!

aloctavodia · 2023-11-08T13:46:48Z

Thanks @flyaflya

As soon as I finish with some urgent tasks I will come back to this issue and rethinking a better, more flexible API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Workflow Surrounding Specification of `Y` Argument #122

Better Workflow Surrounding Specification of `Y` Argument #122

flyaflya commented Oct 11, 2023

aloctavodia commented Oct 25, 2023

flyaflya commented Nov 7, 2023

aloctavodia commented Nov 8, 2023

Better Workflow Surrounding Specification of Y Argument #122

Better Workflow Surrounding Specification of Y Argument #122

Comments

flyaflya commented Oct 11, 2023

Background

The Issue

Thoughts on Implementation

aloctavodia commented Oct 25, 2023

flyaflya commented Nov 7, 2023

aloctavodia commented Nov 8, 2023

Better Workflow Surrounding Specification of `Y` Argument #122

Better Workflow Surrounding Specification of `Y` Argument #122