1 year ago
#304505
Benny Borremans
Stan - Why is imputation of bernoulli response variables in Stan resulting in biased inference?
I am trying to impute a binary response variable using a hierarchical model in Stan, but am getting biased estimates.
The data consist of a number of sessions that each have a different prevalence (proportion viral shedding in this case) that is being estimated in the model. These estimates are then used to impute shedding status of individuals that were measured in a given session but for which shedding status could not be verified. [E.g. individual 1 was sampled in session 1, and session 1 has a prevalence of 0.65, so the probability for individual 1 to receive a positive shedding status is 0.65.]
If I understand correctly, this imputation needs to be done by marginalizing over the imputation values, and the recommended approach for this is through a mixture of the two possibilities (e.g. here and here).
When implementing this however, I get biased prevalence results. I don't know whether this is because of my coding (I'm fairly new to stan) or due to the approach (is the imputation somehow affecting the overall likelihood and biasing the estimation of prevalence?), so am looking for help on this.
(Note: the model setup is now extremely simple, but the goal is to set up this hierarchical model to allow regression of different covariates on individual shedding status, and this is one step towards testing the modeling framework).
Here is simulated data, and my code (I'm using R and rstan), to illustrate my problem:
Simulate data:
# number of sessions
n.sess = 10
# number of individuals per session
n.ind = 20
# create dataframe to store simulated data
shed.sim = data.frame(session = rep(1:n.sess,each = n.ind),
ind = rep(1:n.ind,n.sess),
prev = NA,
shed = NA)
# generate UR prevalences, any value between 0 and 1
set.seed(1982) # set random seed so results can be duplicated
q_t = runif(n = n.sess, min = 0, max = 1)
shed.sim$prev = rep(q_t, each = n.sess)
# for each session, generate a shedding status for each individual
for(i in 1:nrow(shed.sim)){
shed.sim$shed[i] = rbinom(n = 1, size = 1, prob = shed.sim$prev[i])
}
==> Session prevalences are 0.75, 0.61, 0.50, 0.65, 0.22, 0.46, 0.97, 0.88, 0.35, 0.83.
Model code:
mod1 =
"
data {
int<lower=1> N; // number of individuals
int<lower=1> N_sess; // number of sessions
int<lower=1, upper=N_sess> S[N]; // session number for each individual
int<lower=0> N_pos_sess[N_sess]; // number positive in each session
int<lower=1> N_total_sess[N_sess]; // number of UR sheets in each session
}
parameters {
real<lower=0,upper=1> p_sess[N_sess]; // prevalence in each session
}
model {
for(n_sess in 1:N_sess) {
p_sess[n_sess] ~ beta(1,1); // uninformative prior for uniform distribution range (0,1)
N_pos_sess[n_sess] ~ binomial(N_total_sess[n_sess],p_sess[n_sess]); // estimate prevalence p_sess for each session, assuming a binomial distribution
}
for(n in 1:N) {
target += log_mix(p_sess[S[n]], bernoulli_lpmf(1 | p_sess[S[n]]), bernoulli_lpmf(0 | p_sess[S[n]]));
}
}
generated quantities {
int y_miss[N]; // missing observations to be imputed
for(n in 1:N){
y_miss[n] = bernoulli_rng(p_sess[S[n]]); // bernoulli_rng(alpha) generates a bernoulli variate with chance of success alpha
}
}
"
model.fit = stan(model_code = mod1,
data = list(N = nrow(shed.sim),
N_sess = n.sess,
S = shed.sim$session,
N_pos_sess = round(q_t * 30), # using a sample size of 30 for each session
N_total_sess = rep(30,n.sess)),
chains = 2,
warmup = 5000,
iter = 10000,
cores = 1)
This results in biased estimates of prevalence:
Simulated ("observed"): 0.75, 0.61, 0.50, 0.65, 0.22, 0.46, 0.97, 0.88, 0.35, 0.83.
Estimated: 0.88, 0.76, 0.50, 0.79, 0.12, 0.37, 0.97, 0.93, 0.21, 0.91.
Given these biased estimates, individual shedding status for each session is imputed correctly, so that part of the model at least works fine.
Could this bias be related to the issue raised here?
Any help would be greatly appreciated!
Thank you for reading and have a wonderful day.
Benny
PS: A similar approach but without the mixture method works well, but doesn't enable adding a regression model so is not useful for the purposes of this project. I include it for completeness:
"
data {
int<lower=1> N; // number of individuals
int<lower=1> N_sess; // number of sessions
int<lower=1, upper=N_sess> S[N]; // session number for each individual
int<lower=0> N_pos_sess[N_sess]; // number positive in each session
int<lower=1> N_total_sess[N_sess]; // number of UR sheets in each session
}
parameters {
real<lower=0,upper=1> p_sess[N_sess]; // prevalence in each session
}
model {
for(n_sess in 1:N_sess) {
p_sess[n_sess] ~ beta(1,1); // uninformative prior for uniform distribution range (0,1)
N_pos_sess[n_sess] ~ binomial(N_total_sess[n_sess],p_sess[n_sess]); // estimate prevalence p_sess for each session, assuming a binomial distribution
}
}
generated quantities {
int y_miss[N];
for(n in 1:N){
y_miss[n] = bernoulli_rng(p_sess[S[n]]); // bernoulli_rng(alpha) 'generates a bernoulli variate with chance of success alpha' (stan manual)
}
}
"
r
imputation
stan
rstan
multilevel-analysis
0 Answers
Your Answer