Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing data I/O and imputation / partial data structures #36

Open
alashworth opened this issue Mar 12, 2019 · 14 comments
Open

missing data I/O and imputation / partial data structures #36

alashworth opened this issue Mar 12, 2019 · 14 comments

Comments

@alashworth
Copy link
Owner

Issue by bob-carpenter
Monday May 12, 2014 at 09:00 GMT
Originally opened as stan-dev/stan#646


From Marcus:

Roughly speaking I was thinking of being able to write models like this;

data{
 matrix[N,M] Data;
}
transformed data {
  int missing_inds[];
  missing_inds <- find_missing_values(Data);
}
parameters {
  real missing_Data[length(missing_inds)];
}
transformed parameters{
  matrix[N,M] full_Data;

  full_Data <- fill_missing_values(Data,missing_inds,missing_Data);
}
model {
...
}

Here find_missing_values looks for NaN or something like that. If we didn't want to make it such a specific value, we call it something like find_nans so that users are clear in understanding what's going on here. Similar functions for the case where observed values are passed simply as index/value pairs would also be straightforward.

[We also need to] fix the dump file parsing to handle nan/inf.

My followup:

One attractive feature of this proposal is that we don't have to change the model for different models of missingness by column (as long as the constraints don't vary).

The one thing I was waffling back and forth on in my head was the issue of whether to take a full vector/matrix of predictors with NA-like placeholders or whether to have just the existing data passed in in "database" form:

  int n_rows;
  int n_cols;
  int n_observed;
  int observed_row[n_observed];
  int observed_column[n_observed];
  real observed_value[n_observed];

There are tighter data structures for anything but the sparsest matrices, but this representation seems easiest --- it's what you need for a "partial" matrix data structure.

@alashworth
Copy link
Owner Author

Comment by kkmann
Friday Feb 19, 2016 at 12:24 GMT


Any comments/progress on this one? Personally, an easy way of using data with arbitrary missingness patterns is the single most important reason which holds me back from using stan as my first-choice analysis tool. After all, Baysian analysis is most useful in situations with a lot of missing data, is it not?

@alashworth
Copy link
Owner Author

Comment by ariddell
Friday Feb 19, 2016 at 12:44 GMT


I'm about to start a model which has a rather large amount of missingness. I was going to take the approach bob suggested (without knowing, until now, that he suggested it):

  int n_rows;
  int n_cols;
  int n_observed;
  int observed_row[n_observed];
  int observed_column[n_observed];
  real observed_value[n_observed];

@alashworth
Copy link
Owner Author

Comment by robertgrant
Friday Feb 19, 2016 at 13:25 GMT


That's the approach I've been taking. Honestly, I prefer the Stan explicit specification of what to do with the missing/coarse data to the BUGS black box. Even with a fill_missing_data function, I'd probably carry on writing it out, and I don't know how many people are discouraged by it. If anything, I'd say it's because stan devs tend to be quite hard on stan when giving presentations, saying it can't do missing data. Here's an excerpt from a real-life coarsened example where student marks are either a percentage (Nmark), or just recorded as failed (N0) or capped at 40% on resubmission (N40):

data {
      int Nmark;
      int N0;
      int N40;
      int clinical1[Nmark];
      int clinical0[N0];
      int clinical40[N40];
      real mark[Nmark];
}
    parameters {
      real beta[2];
      real<lower=0> sigma;
      real<lower=0, upper=39.5> latentmark0[N0];
      real<lower=39.5, upper=100> latentmark40[N40];
}
model {
      real mu[Nmark];
      real mu0[N0];
      real mu40[N40];
      sigma ~ normal(0,30);
      beta ~ normal(0,20);
      for (i in 1:Nmark) {
        mu[i] <- beta[1] + beta[2]*clinical1[i];
        mark[i] ~ normal(mu[i],sigma);
      }
      for (i in 1:N0) {
        mu0[i] <- beta[1] + beta[2]*clinical0[i];
        latentmark0[i] ~ normal(mu0[i],sigma);
      }      for (i in 1:N40) {
        mu40[i] <- beta[1] + beta[2]*clinical40[i];
        latentmark40[i] ~ normal(mu40[i],sigma);
      }
}

So I like Bob's database form because it would still make the user slow down a little and think about what's going on.

@alashworth
Copy link
Owner Author

Comment by betanalpha
Friday Feb 19, 2016 at 13:26 GMT


Bayesian inference gives you a fantastic framework for modeling missingness.
A priori we have no idea how the missingness is patterned in your model, and
the only way you can define that pattern in a self-consistent fashion is to split
up your data into the different patterns and model an explicit likelihood. Anything
else is an approximation that may not even be self-consistent.

On Feb 19, 2016, at 12:24 PM, kkmann notifications@github.com wrote:

Any comments/progress on this one? Personally, an easy way of using data with arbitrary missingness patterns is the single most important reason which holds me back from using stan as my first-choice analysis tool. After all, Baysian analysis is most useful in situations with a lot of missing data, is it not?


Reply to this email directly or view it on GitHub.

@alashworth
Copy link
Owner Author

Comment by kkmann
Friday Feb 19, 2016 at 13:37 GMT


Yeah, fully agreed. But there is also a canonical way of treating missing at random variables in a Bayesian model (one of the great advantages if you ask me). So why not combine the best of both worlds by treating missing data as missing at random by default if not specified explicitly. I think of a large clinical database with few patients and lots of variables. Often missing at random is not too far from being a realistic assumption and hand-coding this in STAN right now is really a downer.

@alashworth
Copy link
Owner Author

Comment by betanalpha
Friday Feb 19, 2016 at 13:48 GMT


No, there is not a canonical way of treating missingness! Missing at random isn’t even
enough to fully define a model without specifying the distribution of the observations in
the first place. Ultimately there are all kinds of subtle problems that arise as soon as
you get into even the simplest details and there’s no way for us to robustly automate
this process. There may be components that we can offer that simply the modeling
process, but we haven’t figured those out yet.

On Feb 19, 2016, at 1:37 PM, kkmann notifications@github.com wrote:

Yeah, fully agreed. But there is also a canonical way of treating missing at random variables in a Bayesian model (one of the great advantages if you ask me). So why not combine the best of both worlds by treating missing data as missing at random by default if not specified explicitly. I think of a large clinical database with few patients and lots of variables. Often missing at random is not too far from being a realistic assumption and hand-coding this in STAN right now is really a downer.


Reply to this email directly or view it on GitHub.

@alashworth
Copy link
Owner Author

Comment by syclik
Friday Feb 19, 2016 at 13:49 GMT


+1. I was in the middle of writing the same thing.

@alashworth
Copy link
Owner Author

Comment by kkmann
Friday Feb 19, 2016 at 13:59 GMT


So why is MAR + prior for each independent variable insufficient? It specifies a full and consistent joint distribution (given the rest of the model is not inconsistent). Anyway, any kind of tool to handle missing data more conveniently inside stan would be much appreciated from my side!

@alashworth
Copy link
Owner Author

Comment by betanalpha
Friday Feb 19, 2016 at 14:50 GMT


So why is MAR + prior for each independent variable insufficient? It specifies a full and consistent joint distribution (given the rest of the model is not inconsistent

Yes, MAR + independence can be treated somewhat easily. Although you might
as well throw the missing observations away because under MAR they are not
informative at all, you can readily code this up using a loop and a vector of
observed indicators, which is exactly what Allen recommended! Imputing the
missing observations is another story because the model and generated quantities
blocks are completely independent and do not affect each other, so there’s no
way to automatically generate quantities from any specification in the model block.

@alashworth
Copy link
Owner Author

Comment by andrewgelman
Friday Feb 19, 2016 at 22:06 GMT


We should have some missing-data imputation in Stan. By which I mean, not an “mi” function, at least not right away, but some examples in the manual that people can them imitate for their own work.

On Feb 19, 2016, at 8:27 AM, Michael Betancourt notifications@github.com wrote:

Bayesian inference gives you a fantastic framework for modeling missingness.
A priori we have no idea how the missingness is patterned in your model, and
the only way you can define that pattern in a self-consistent fashion is to split
up your data into the different patterns and model an explicit likelihood. Anything
else is an approximation that may not even be self-consistent.

On Feb 19, 2016, at 12:24 PM, kkmann notifications@github.com wrote:

Any comments/progress on this one? Personally, an easy way of using data with arbitrary missingness patterns is the single most important reason which holds me back from using stan as my first-choice analysis tool. After all, Baysian analysis is most useful in situations with a lot of missing data, is it not?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub stan-dev/stan#646 (comment).

@alashworth
Copy link
Owner Author

Comment by bob-carpenter
Friday Feb 19, 2016 at 22:24 GMT


Chapter 8. Missing Data & Partially Known Parameters.

It's not that Stan can't do it, it's that it's syntactically
ugly. What users want (I presume) is something like what
BUGS/JAGS do.

Suppose you have the following BUGS model (with dot_product
replaced by whatever you have to do in BUGS to multiply vectors
and get them into the location parameter of a normal):

sigma ~ inv_gamma(1, 1)
for (k in 1:K) beta[k] ~ normal(0, 0.01)
for (n in 1:N) y[n] ~ normal(dot_product(x[n], beta), sigma)

If some of the predictors x[n, k] are missing, the model
won't compile because you haven't defined the missing x[n, k]
as either deterministic or stochastic nodes.

But, if you add this:

for (n in 1:N)
for (k in 1:K)
x[n, k] ~ normal(0, 0.01);

then you're good to go in BUGS and the values for missing x[n, k]
will be imputed like any other parameter. You could even model those
x[n, k] by predictor with something like:

 x[n, k] ~ normal(mu[k], tau[k]);

and of course give mu and tau priors.

You can get exactly the same behavior in Stan (as explained in
manual Chapter 8), but the code's rather a mess. We also don't
have any kind of NA input, so you have to indicate the known and
unknown values in some other way (parallel array of boolean indicators,
or just inputting the known ones in long form and then putting
everything back together.

  • Bob

On Feb 19, 2016, at 5:06 PM, Andrew Gelman notifications@github.com wrote:

We should have some missing-data imputation in Stan. By which I mean, not an “mi” function, at least not right away, but some examples in the manual that people can them imitate for their own work.

On Feb 19, 2016, at 8:27 AM, Michael Betancourt notifications@github.com wrote:

Bayesian inference gives you a fantastic framework for modeling missingness.
A priori we have no idea how the missingness is patterned in your model, and
the only way you can define that pattern in a self-consistent fashion is to split
up your data into the different patterns and model an explicit likelihood. Anything
else is an approximation that may not even be self-consistent.

On Feb 19, 2016, at 12:24 PM, kkmann notifications@github.com wrote:

Any comments/progress on this one? Personally, an easy way of using data with arbitrary missingness patterns is the single most important reason which holds me back from using stan as my first-choice analysis tool. After all, Baysian analysis is most useful in situations with a lot of missing data, is it not?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub stan-dev/stan#646 (comment).


Reply to this email directly or view it on GitHub.

@alashworth
Copy link
Owner Author

Comment by kkmann
Saturday Feb 20, 2016 at 11:28 GMT


+1, the core of the problem seems to be due to STAN distinguishing between parameters and data. From a theoretical perspective there is no need to do so under the Bayesian paradigm and it would be nice to have a sleeker interface to get this JAGS-like behaviour for those who want it and know what they are doing.

@alashworth
Copy link
Owner Author

Comment by bob-carpenter
Saturday Feb 20, 2016 at 18:01 GMT


Yes, it'd be nice to have such an interface. We'd

  • need a new language in which we could specify a model that's agnostic to
    parameters vs. modeled data distinction (the data has to be modeled
    in the sense of having a distribution --- it can't just be constants
    or predictors; that is, a stochastic node in BUGS terms). This could be
    the existing language with some preprocessing based on data.

Then we need to

  • generate a C++ model class like we currently generate, because
    from a computational perspective, we (and everyone else) do need
    to distinguish between parameters and data.

And of course, we need

  • some way to deal with the I/O, ideally cross-platform rather than
    something specific to just R or some other interface.

We're always open to volunteers. It's not on any of our existing
developer's agenda as far as I know.

  • Bob

On Feb 20, 2016, at 6:28 AM, kkmann notifications@github.com wrote:

+1, the core of the problem seems to be due to STAN distinguishing between parameters and data. From a theoretical perspective there is no need to do so under the Bayesian paradigm and it would be nice to have a sleeker interface to get this JAGS-like behaviour for those who want it and know what they are doing.


Reply to this email directly or view it on GitHub.

@alashworth
Copy link
Owner Author

Comment by syclik
Wednesday Nov 30, 2016 at 15:31 GMT


With the compound declare and define available, @mbrubake's original suggestion should be do-able. We'd need two functions and we could do this:

data{
 matrix[N,M] Data;
}
transformed data {
  int missing_inds[length_missing_values(Data)] = find_missing_values(Data);
}
parameters {
  real missing_Data[length(missing_inds)];
}
transformed parameters{
  matrix[N,M] full_Data;

  full_Data <- fill_missing_values(Data,missing_inds,missing_Data);
}
model {
...
}

I guess we could have done that before too.

Questions: is find_missing_values() appropriate for a name? This will depend on #637.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant