-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
missing data I/O and imputation / partial data structures #36
Comments
Comment by kkmann Any comments/progress on this one? Personally, an easy way of using data with arbitrary missingness patterns is the single most important reason which holds me back from using stan as my first-choice analysis tool. After all, Baysian analysis is most useful in situations with a lot of missing data, is it not? |
Comment by ariddell I'm about to start a model which has a rather large amount of missingness. I was going to take the approach bob suggested (without knowing, until now, that he suggested it):
|
Comment by robertgrant That's the approach I've been taking. Honestly, I prefer the Stan explicit specification of what to do with the missing/coarse data to the BUGS black box. Even with a fill_missing_data function, I'd probably carry on writing it out, and I don't know how many people are discouraged by it. If anything, I'd say it's because stan devs tend to be quite hard on stan when giving presentations, saying it can't do missing data. Here's an excerpt from a real-life coarsened example where student marks are either a percentage (Nmark), or just recorded as failed (N0) or capped at 40% on resubmission (N40):
So I like Bob's database form because it would still make the user slow down a little and think about what's going on. |
Comment by betanalpha Bayesian inference gives you a fantastic framework for modeling missingness. On Feb 19, 2016, at 12:24 PM, kkmann notifications@github.com wrote:
|
Comment by kkmann Yeah, fully agreed. But there is also a canonical way of treating missing at random variables in a Bayesian model (one of the great advantages if you ask me). So why not combine the best of both worlds by treating missing data as missing at random by default if not specified explicitly. I think of a large clinical database with few patients and lots of variables. Often missing at random is not too far from being a realistic assumption and hand-coding this in STAN right now is really a downer. |
Comment by betanalpha No, there is not a canonical way of treating missingness! Missing at random isn’t even On Feb 19, 2016, at 1:37 PM, kkmann notifications@github.com wrote:
|
Comment by syclik +1. I was in the middle of writing the same thing. |
Comment by kkmann So why is MAR + prior for each independent variable insufficient? It specifies a full and consistent joint distribution (given the rest of the model is not inconsistent). Anyway, any kind of tool to handle missing data more conveniently inside stan would be much appreciated from my side! |
Comment by betanalpha
|
Comment by andrewgelman We should have some missing-data imputation in Stan. By which I mean, not an “mi” function, at least not right away, but some examples in the manual that people can them imitate for their own work.
|
Comment by bob-carpenter Chapter 8. Missing Data & Partially Known Parameters. It's not that Stan can't do it, it's that it's syntactically Suppose you have the following BUGS model (with dot_product sigma ~ inv_gamma(1, 1) If some of the predictors x[n, k] are missing, the model But, if you add this: for (n in 1:N) then you're good to go in BUGS and the values for missing x[n, k]
and of course give mu and tau priors. You can get exactly the same behavior in Stan (as explained in
|
Comment by kkmann +1, the core of the problem seems to be due to STAN distinguishing between parameters and data. From a theoretical perspective there is no need to do so under the Bayesian paradigm and it would be nice to have a sleeker interface to get this JAGS-like behaviour for those who want it and know what they are doing. |
Comment by bob-carpenter Yes, it'd be nice to have such an interface. We'd
Then we need to
And of course, we need
We're always open to volunteers. It's not on any of our existing
|
Comment by syclik With the compound declare and define available, @mbrubake's original suggestion should be do-able. We'd need two functions and we could do this:
I guess we could have done that before too. Questions: is |
Issue by bob-carpenter
Monday May 12, 2014 at 09:00 GMT
Originally opened as stan-dev/stan#646
From Marcus:
Roughly speaking I was thinking of being able to write models like this;
Here find_missing_values looks for NaN or something like that. If we didn't want to make it such a specific value, we call it something like
find_nans
so that users are clear in understanding what's going on here. Similar functions for the case where observed values are passed simply as index/value pairs would also be straightforward.[We also need to] fix the dump file parsing to handle nan/inf.
My followup:
One attractive feature of this proposal is that we don't have to change the model for different models of missingness by column (as long as the constraints don't vary).
The one thing I was waffling back and forth on in my head was the issue of whether to take a full vector/matrix of predictors with NA-like placeholders or whether to have just the existing data passed in in "database" form:
There are tighter data structures for anything but the sparsest matrices, but this representation seems easiest --- it's what you need for a "partial" matrix data structure.
The text was updated successfully, but these errors were encountered: