Stan - Why is imputation of bernoulli response variables in Stan resul - Enhance your coding expertise with Benny Borremans on @onlycoders.net

1 year ago

#304505

Benny Borremans

Stan - Why is imputation of bernoulli response variables in Stan resulting in biased inference?

I am trying to impute a binary response variable using a hierarchical model in Stan, but am getting biased estimates.

The data consist of a number of sessions that each have a different prevalence (proportion viral shedding in this case) that is being estimated in the model. These estimates are then used to impute shedding status of individuals that were measured in a given session but for which shedding status could not be verified. [E.g. individual 1 was sampled in session 1, and session 1 has a prevalence of 0.65, so the probability for individual 1 to receive a positive shedding status is 0.65.]

If I understand correctly, this imputation needs to be done by marginalizing over the imputation values, and the recommended approach for this is through a mixture of the two possibilities (e.g. here and here).

When implementing this however, I get biased prevalence results. I don't know whether this is because of my coding (I'm fairly new to stan) or due to the approach (is the imputation somehow affecting the overall likelihood and biasing the estimation of prevalence?), so am looking for help on this.

(Note: the model setup is now extremely simple, but the goal is to set up this hierarchical model to allow regression of different covariates on individual shedding status, and this is one step towards testing the modeling framework).

Here is simulated data, and my code (I'm using R and rstan), to illustrate my problem:

Simulate data:

# number of sessions
n.sess = 10

# number of individuals per session   
n.ind = 20

# create dataframe to store simulated data
shed.sim = data.frame(session = rep(1:n.sess,each = n.ind),
                      ind = rep(1:n.ind,n.sess),
                      prev = NA,
                      shed = NA)

# generate UR prevalences, any value between 0 and 1    
set.seed(1982)  # set random seed so results can be duplicated   
q_t = runif(n = n.sess, min = 0, max = 1)
shed.sim$prev = rep(q_t, each = n.sess)

# for each session, generate a shedding status for each individual    

for(i in 1:nrow(shed.sim)){
       shed.sim$shed[i] = rbinom(n = 1, size = 1, prob = shed.sim$prev[i])
}

==> Session prevalences are 0.75, 0.61, 0.50, 0.65, 0.22, 0.46, 0.97, 0.88, 0.35, 0.83.

Model code:

mod1 = 
       "
       data {
              int<lower=1> N;             // number of individuals
              int<lower=1> N_sess;        // number of sessions
              int<lower=1, upper=N_sess> S[N];      // session number for each individual
              int<lower=0> N_pos_sess[N_sess];   // number positive in each session
              int<lower=1> N_total_sess[N_sess]; // number of UR sheets in each session
       }
       
       parameters {
              real<lower=0,upper=1> p_sess[N_sess];     // prevalence in each session
       }
       
       model {
              
              for(n_sess in 1:N_sess) {
                     p_sess[n_sess] ~ beta(1,1);   // uninformative prior for uniform distribution range (0,1)
                     N_pos_sess[n_sess] ~ binomial(N_total_sess[n_sess],p_sess[n_sess]);   // estimate prevalence p_sess for each session, assuming a binomial distribution
              }
              

              for(n in 1:N) {
                     target += log_mix(p_sess[S[n]], bernoulli_lpmf(1 | p_sess[S[n]]), bernoulli_lpmf(0 | p_sess[S[n]]));
              }
       }
       
       generated quantities {
              int y_miss[N];    // missing observations to be imputed

              for(n in 1:N){
                     y_miss[n] = bernoulli_rng(p_sess[S[n]]);  // bernoulli_rng(alpha) generates a bernoulli variate with chance of success alpha
              }
       }
       "

model.fit = stan(model_code = mod1,
                 data = list(N = nrow(shed.sim),
                             N_sess = n.sess,
                             S = shed.sim$session,
                             N_pos_sess = round(q_t * 30),   # using a sample size of 30 for each session
                             N_total_sess = rep(30,n.sess)),
                             chains = 2,
                             warmup = 5000,
                             iter = 10000,
                             cores = 1)

This results in biased estimates of prevalence:

Simulated ("observed"): 0.75, 0.61, 0.50, 0.65, 0.22, 0.46, 0.97, 0.88, 0.35, 0.83.

Estimated: 0.88, 0.76, 0.50, 0.79, 0.12, 0.37, 0.97, 0.93, 0.21, 0.91.

Given these biased estimates, individual shedding status for each session is imputed correctly, so that part of the model at least works fine.

Could this bias be related to the issue raised here?

Any help would be greatly appreciated!

Thank you for reading and have a wonderful day.

Benny

PS: A similar approach but without the mixture method works well, but doesn't enable adding a regression model so is not useful for the purposes of this project. I include it for completeness:

"
       data {
              int<lower=1> N;             // number of individuals
              int<lower=1> N_sess;        // number of sessions
              int<lower=1, upper=N_sess> S[N];      // session number for each individual
              int<lower=0> N_pos_sess[N_sess];   // number positive in each session
              int<lower=1> N_total_sess[N_sess]; // number of UR sheets in each session
       }
       
       parameters {
              real<lower=0,upper=1> p_sess[N_sess];     // prevalence in each session
       }
       
       model {
       
              for(n_sess in 1:N_sess) {
                     p_sess[n_sess] ~ beta(1,1);   // uninformative prior for uniform distribution range (0,1)
                     N_pos_sess[n_sess] ~ binomial(N_total_sess[n_sess],p_sess[n_sess]);   // estimate prevalence p_sess for each session, assuming a binomial distribution
                     
              }
       }
       
       generated quantities {
              int y_miss[N];

              for(n in 1:N){
                     y_miss[n] = bernoulli_rng(p_sess[S[n]]);  // bernoulli_rng(alpha) 'generates a bernoulli variate with chance of success alpha' (stan manual) 
              }
       }
       "

imputation

stan

rstan

multilevel-analysis

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs