Bayesian estimation for ecology

đŸ§‘đŸ»â€đŸ’» Masatoshi Katabuchi @ XTBG, CAS

  • mattocci27@gmail.com
  • @mattocci
  • github.com/mattocci27/bayes-afec
  • https://mattocci27.github.io



November 10, 2024 XTBG AFEC

Objectives

  • We Learn

    • Why Bayesian estimation is useful

    • Why multilevel models are important

  • We Do Not Learn

    • How to use and code Bayesian models in Stan and R

Likelihood

Assuming everyone knows the concept of likelihood

Likelihood and probability density distribution

\(f(x \mid \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\mathrm{exp} \bigl[{-\frac{1}{2}\bigl\{\frac{x-\mu}{\sigma}}\bigr\}^2\bigr]\)

  • A likelihood function is a probability function that shows how likely it is to observe the data for a specific set of parameters.

  • e.g., when your model is \(x \sim \mathcal{N}(\mu = 0, \sigma^2 = 1)\) (normal distribution with mean 0 and variance 1), what is the probability density at x = 1.96?

1 / sqrt(2 * pi * 1) * exp(-1/2 * (1.96 - 0)^2 / 1)
  [1] 0.05844094
  • \(P(x = 1.96 \mid \mu = 0, \sigma = 1)\)
dnorm(1.96, mean = 0, sd = 1)
  [1] 0.05844094

Likelihood

When your data is x = {-1.5, 0, 1.5} and your model is \(x \sim \mathcal{N}(0, 1)\), what is the probability of observing x?

  • \(L = P(-1.5 \mid 0, 1) \times P(0 \mid 0, 1) \times P(1.5 \mid 0, 1)\)
  • \(\mathrm{ln}\;L = \mathrm{ln}\;P(-1.5 \mid 0, 1) + \mathrm{ln}\;P(0 \mid 0, 1) + \mathrm{ln}\;P(1.5 \mid 0, 1)\)
dnorm(-1.5, mean = 0, sd = 1, log = TRUE) + dnorm(0, mean = 0, sd = 1, log = TRUE) + dnorm(1.5, mean = 0, sd = 1, log = TRUE)
  [1] -5.006816

Likelihood

  • Because we usually don’t know the true parameters (\(\mu\) and \(\sigma\)), we need to find the parameters that maximize the likelihood function (Maximum Likelihood Estimation).

    • e.g., for a liner model \(y = ax + b\), we usually assume that \(y \sim \mathcal{N}(\mu = ax + b, \sigma^2)\), and we want to find the parameters \(a\), \(b\), and \(\sigma\) that maximize the likelihood function: \(P(y \mid ax + b, \sigma)\)

Maximum Likelihood Estimation (MLE)

  • 2 survivors out of 5 seedlings: What is the survival probability of seedlings?

  • \(p\): survival rates, \(1-p\): mortality rate

  • \(L = {}_5C_2 p^2(1-p)^3\) (Binomial distribution)

  • \(\mathrm{ln}\;L = \mathrm{ln}\;{}_5C_2 + 2\mathrm{ln}\;p + 3\mathrm{ln}(1-p)\)

  • \(\frac{d\mathrm{ln}\;L}{dt} = \frac{2}{p} - \frac{3}{1-p} = 0\)

  • \(p = \frac{2}{5}\)

Outline

  • Multilevel Model

  • Conditional Probability and Bayes’s Theorem

  • Prior

  • Multilevel Model Revisited

Multilevel Models

Negative density dependence (NDD)

“
rare species suffered more from the presence of conspecific neighbors than common species did, suggesting that conspecific density dependence shapes species abundances in diverse communities.”

- Comita et al. 2010 -

Multilevel model (NDD: verbal model)

There is negative density dependence (NDD) of seedling survival rate, and the strength of NDD varies among species. The strength of NDD depends on species abundance.

  • Model survival rates as a function of conspecific seedling density (individual-level).
  • Model the strength of NDD (i.e., slopes) as a function of species abundance (group-level).

Multilevel model (tree allometry: verbal model)

There is a power-law relationship (\(y = ax^b\)) between tree diameter (DBH) and crown area, and the power-law exponent varies among species. Those relationships depend on wood density.

  • Model crown area as a function of DBH (individual-level).

  • Model the coefficient b as a function of wood density (group-level).

Multilevel model (MLE: non-verbal ‘dummy’ model 1)

Individual-level model: GLMM

fit_ind <- lme4::glmer(
  cbind(suv, n - suv) ~  cons + (1 + cons | sp),
  data = dummy, family = binomial)

We need to extract the slope coefficients for each species from fit_ind and make new_data.

Group-level model: LM

fit_gr <- lm(slope_coef ~  abund, data = new_data)
  • This method have low statistical power

Multilevel model (MLE: non-verbal ‘dummy’ model 2)

fit_all <- lme4::glmer(
  cbind(suv, n - suv) ~  cons + abund + cons:abund + (1 + cons | sp),
  data = dummy, family = binomial)
  • This doesn’t work for complicated models

Multilevel model (Bayes: non-verbal model) đŸ€ą

data{
  int<lower=1> N; // number of samples
  int<lower=1> J; // number of sp
  int<lower=1> K; // number of tree-level preditors (i.e, CONS, HETS,...)
  int<lower=1> L; // number of sp-level predictors (i.e., interecept, SLA,...)
  matrix[N, K] x; // tree-level predictor
  matrix[L, J] u; // sp-level predictor
  array[N] int<lower=0, upper=1> suv; // 1 or 0
  array[N] int<lower=1, upper=J> sp; // integer
}

parameters{
  matrix[K, L] gamma;
  matrix[K, J] z;
  cholesky_factor_corr[K] L_Omega;
  vector<lower=0, upper=pi() / 2>[K] tau_unif;
}

transformed parameters{
  matrix[K, J] beta;
  vector<lower=0>[K] tau;
  for (k in 1:K) tau[k] = 2.5 * tan(tau_unif[k]);
  beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;
}

model {
  vector[N] mu;
  to_vector(z) ~ std_normal();
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(gamma) ~ normal(0, 2.5);
  for (n in 1:N) {
    mu[n] = x[n, ] * beta[, sp[n]];
  }
  suv ~ bernoulli_logit(mu);
}

generated quantities {
  vector[N] log_lik;
  corr_matrix[K] Omega;
  Omega = multiply_lower_tri_self_transpose(L_Omega);
  for (n in 1:N) {
    log_lik[n] = bernoulli_logit_lpmf(suv[n] | x[n, ] * beta[, sp[n]] +
      phi[plot[n]] + xi[census[n]] + psi[tag[n]]);
  }
}
  • \(s_{i,j} \sim \mathcal{B}(p_{i, j})\): likelihood

  • \(\mathrm{logit}(p_{i,j}) = \boldsymbol{x_{i}} \cdot \boldsymbol{\beta_{j}}\): individual-level model

  • \(\boldsymbol{\beta_j} = \boldsymbol{\gamma_k} \cdot \boldsymbol{u_j} + \mathrm{diag}(\boldsymbol{\sigma})\cdot \boldsymbol{L} \cdot \boldsymbol{z}\): species-level model

  • \(L \sim \mathrm{LkjCholesky}(\eta)\): prior (?)

  • \(z_j \sim \mathcal{N}(0, 1)\): prior (?)

  • \(\tau \sim \mathrm{Cauchy}(0, 2.5)\): prior (?)

  • \(\gamma_k \sim \mathcal{N}(0, 2.5)\): prior (?)

Conditional Probability and Bayes’s Theorem

Conditional Probability and Bayes’s Theorem

\[ P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)} \]

  • Conditional probability

  • Bayes’s Theorem

  • Forward / inverse problems

  • Bayes revisit

Probability

Probility of A:

\[ P(A) = \frac{A}{U} \]

e.g., probability of rolling a dice and getting an odd number is 3/6 = 1/2

Conditional Probability

Probability of A occurring given B has already occurred:

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]

e.g.,

  • P(hangover) = 4%
  • P(hangover | beer) = 6%.
  • P(hangover | baiju) = 85%.

Bayes’s Theorem

\[ P(B \mid A) = \frac{P(A \cap B)}{P(A)} \]

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]

\[ P(A \mid B) \times P(B) = P(B \mid A) \times P(A) \]

\[ P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)} \]

Why is this useful?

Forward and Inverse Problems

  • X: 3 red balls, 5 white balls

  • Y: 1 red balls, 3 white balls

  • Randomly choose a bag X or Y

  • \(P(A)\): Choosing X

  • \(P(B)\): Drawing a red ball

  • \(P(B \mid A)\): Getting a red ball from X

  • \(P(A \cap B)\): Choosing X and drawing a red ball

  • \(P(A \mid B)\): Picked X, because you got a red ball(?)

  • \(P(A) = 1/2\)

  • \(P(B \mid A) = 3/8\)

  • \(P(B)\) = 1/2 \(\times\) 3/8 + 1/2 \(\times\) 1/4 = 5 /16

  • \(P(A \cap B)\) = 1/2 \(\times\) 3/8 = 3/16

  • \(P(A \mid B)\) = \(P(A \cap B) / P(B)\) = (3/16) / (5/16) = 3/5

\[ P(\mathrm{Parameter} \mid \mathrm{Data}) = \frac{P(\mathrm{Data} \mid \mathrm{Parameter}) \times P(\mathrm{Parameter})}{P(\mathrm{Data})} \]

  • \(P(\mathrm{Parameter} \mid \mathrm{Data})\)

    • When you got your data, what were the parameters behind the data (e.g., coefficients of regressions)?
  • \(P(\mathrm{Data} \mid \mathrm{Parameter})\)

    • When you know your parameters, what is the probability of getting your data? (i.e., likelihood)
  • \(P(\mathrm{Parameter})\)

    • Probability to get your parameters. This is what you assume for your parameter before you see any data (i.e., prior).
  • \(P(\mathrm{Data})\)

    • Independent with parameters (i.e., constant)

Prior

Prior (coins)

MLE

  • A: 2 head out of 3 tosses -> 2/3 = 0.666
  • B: 60 heads out of 100 tosses -> 60/100 = 0.6

Bayesian

  • \(L_A = {}_3C_2 p^2 (1-p)^1\)

  • \(L_B = {}_{100}C_{60} p^{60} (1-p)^{40}\)

  • \(\mathrm{Prior} \propto p^{50} (1-p)^{50}\)

    • Beta distribution with mean 0.5 and small variance

Prior (coins)

  • \(Post_A \propto p^2 (1-p)^1 \times p^{50} (1-p)^{50}\)

    • \(Post_A \propto p^{52} (1-p)^{51}\)

    • \(Post_A' = 52p^{51}(1-p)^{51}-\) \(51p^{52}(1-p)^{50}\)

    • p = 52/103 = 0.5048

  • \(Post_B \propto p^{60} (1-p)^{40} \times p^{50} (1-p)^{50}\)

    • \(Post_B \propto p^{110} (1-p)^{90}\)

    • p = 110/200 = 0.55

  • \(Like_B\): 60 heads out of 100 tosses
  • \(Prior_B\): P(H) is about 50%, a reasonable assumption based on existing knowledge of coins
  • \(Post_B\): P(H) is likely between 50% and 60% (55% in this example)
  • We usually have some sense of a scale about parameters, we can legally use that information

Priors and ecology (simple linear model; y = ax + b)

  • Considering variables x = {-3, 
, 3} and y = {-6, 
, 4}. At this point, we don’t know if there is a correlation between x and y.
  • However, given the similar scales of x and y, it’s reasonable to guess that a falls within a narrow range (-1000 < a < 1000 or -5 < a < 5 ?).
  • For example, y = 100 x + 2 doesn’t work (blue line).

Priors and ecology (simple linear model; y = ax + b)

  • Considering variables x = {-3, 
, 3} and y = {-6, 
, 4}. At this point, we don’t know if there is a correlation between x and y.
  • However, given the similar scales of x and y, it’s reasonable to guess that a falls within a narrow range (-1000 < a < 1000 or -5 < a < 5 ?).

Priors and ecology (multilevel model: group-level differences)

  • \(y_i = ax_i + b_j\)
    • If the parameter \(b_j\) is similar within each group (e.g., species differences, site differences):
    • Likelihood: \(y_i \sim \mathcal{N}(ax_i + b_j, \sigma)\)
    • Prior: \(b_j \sim \mathcal{N}(\mu_b, \tau)\)

Priors and ecology (multilevel model: autocorealtion)

  • If the data \(y_i\) is similar to the surrounding samples (e.g., spatial autocorrelation):
  • Likelihood: \(y_i \sim \mathcal{N}(\mu + \tilde{r_i}, \sigma)\)
  • Prior: \(\tilde{r_i} = r_{m, n} \sim \mathcal{N}(\phi_{m,n}, \tau)\)
  • \(\phi_{m,n} = (r_{m-1, n-1} + r_{m, n-1} + r_{m+1, n-1} +\) \(r_{m-1, n} + r_{m+1, n} +\) \(r_{m+1, n-1} + r_{m+1, n} + r_{m+1, n+1})/8\)

Multilevel model

Baseball statistics

Eight school problem

Eight species problem

What is the survival rate?

sp n suv
A 41 9
B 45 18
C 32 6
D 18 5
E 33 8
F 26 8
G 46 11
H 16 8
  • suv \(\sim\) B(n, p) (Binomial distribution)

  • p = suv / n (Maximum likelihood estimation: MLE)

sum(dummy_simple$suv) / sum(dummy_simple$n)
  [1] 0.2840467
  • If we pool all the data, the survival rate will be about 0.284

Eight species problem (separate estimates)

What is the survival rate?

sp n suv p_like
A 41 9 0.22
B 45 18 0.40
C 32 6 0.19
D 18 5 0.28
E 33 8 0.24
F 26 8 0.31
G 46 11 0.24
H 16 8 0.50
  • If we estimate each species separately, survival rates will be p_like

Eight species problem (separate estimates)

What is the survival rate?

sp n suv p_like p_true
A 41 9 0.22 0.34
B 45 18 0.40 0.30
C 32 6 0.19 0.34
D 18 5 0.28 0.36
E 33 8 0.24 0.25
F 26 8 0.31 0.23
G 46 11 0.24 0.25
H 16 8 0.50 0.34
  • p_true ranges [0.23, 0.36]

  • p_like ranges [0.19, 0.5]

  • The estimate shows the larger variation

  • Because of the small sample size (common in ecological studies)

Two extreme cases (pooled estimates)

  • \(S_i \sim \mathcal{B}(N_i, p)\)

  • This model doesn’t consider the variation among species

Two extreme cases (separate estimates)

  • \(S_i \sim \mathcal{B}(N_i, p_i)\)

  • This model assumes that survival rates are 100% independent among species

More realistic estimates (multilevel models)

  • \(S_i \sim \mathcal{B}(N_i, p_i)\)

  • \(z_i \sim \mathcal{N}(\mu, \sigma)\) where \(z_i = \mathrm{logit}(p_i) = \mathrm{log}\frac{p_i}{1 - p_i}\)

\(\sigma\) determines species variation

The overall survival rate is 0.5 in this example. We have some sense of a scale for \(\sigma\).

Stan code for a simple multilevel logistic model (non-verbal model)

data {
  int<lower=1> I; // number of species
  array[I] int<lower=0> suv; // number of survivors
  array[I] int<lower=0> N; // number of individuals
  array[I] int<lower=1, upper=I> ii; // integer
}

parameters {
  real mu;
  real<lower=0> sigma;
  vector[I] z_tilde;
}

transformed parameters {
  vector[I] z;
  z = mu + sigma * z_tilde;
}

model {
  mu ~ normal(0, 5);
  z_tilde ~ std_normal();
  sigma ~ std_normal();
  for (i in 1:I)
    suv[i] ~ binomial_logit(N[i], z[ii[i]]);
}

generated quantities {
  vector[I] log_lik;
  vector[I] p = inv_logit(z);
  for (i in 1:I)
    log_lik[i] = binomial_logit_lpmf(suv[i] | N[i], z[ii[i]]);
}

Centered parameterization

  • \(S_i \sim \mathcal{B}_{logit}(N_i, z_i)\): likelihood

  • \(z_i \sim \mathcal{N}(\mu, \sigma)\) : prior

  • \(\sigma \sim \mathcal{N}(0, 1)\): prior

  • \(\mu \sim \mathcal{N}(0, 5)\): prior

Non-Centered parameterization

  • \(z_i = \mu + \sigma \cdot \tilde{z_i}\)

  • \(\tilde{z_i} \sim \mathcal{N}(0, 1)\): prior

Multilevel models yield better estimates

sp n suv p_like p_true p_bayes
A 41 9 0.22 0.34 0.26
B 45 18 0.40 0.30 0.33
C 32 6 0.19 0.34 0.26
D 18 5 0.28 0.36 0.28
E 33 8 0.24 0.25 0.27
F 26 8 0.31 0.23 0.29
G 46 11 0.24 0.25 0.27
H 16 8 0.50 0.34 0.32

  • Closed symbols (p_bayes) align more closely with the 1:1 line, indicating more accurate estimates.

  • This model compensates for limited data by using prior knowledge that species responses are somehow similar and compensates for limited data.

MLE vs. Bayesian estimation

Bayesian estimation (stan)

data {
  int<lower=1> I; // number of species
  array[I] int<lower=0> suv; // number of survivors
  array[I] int<lower=0> N; // number of individuals
  array[I] int<lower=1, upper=I> ii; // integer
}

parameters {
  real mu;
  real<lower=0> sigma;
  vector[I] z_tilde;
}

transformed parameters {
  vector[I] z;
  z = mu + sigma * z_tilde;
}

model {
  mu ~ normal(0, 5);
  z_tilde ~ std_normal();
  sigma ~ std_normal();
  for (i in 1:I)
    suv[i] ~ binomial_logit(N[i], z[ii[i]]);
}

generated quantities {
  vector[I] log_lik;
  vector[I] p = inv_logit(z);
  for (i in 1:I)
    log_lik[i] = binomial_logit_lpmf(suv[i] | N[i], z[ii[i]]);
}

MLE (lme4::glmer)

fit <- lme4::glmer(
   cbind(suv, n - suv) ~  (1 | sp),
   data = dummy_simple,
   family = binomial)
  • We can do MLE for these kind of simple models.
  • When a model is complicated, MLE often does not work well.
  • Model flexibility for MLE is limited.

MLE vs. Bayesian estimation

Bayesian estimation (brms)

fit <- brms::brm(
  suv | trials(n) ~ 1 + (1 | sp),
  data = dummy_simple,
  family = binomial())

MLE (lme4::glmer)

fit <- lme4::glmer(
   cbind(suv, n - suv) ~  (1 | sp),
   data = dummy_simple,
   family = binomial)
  • Several packages, like brms and rstanarm, offer Bayesian estimation using syntax similar to lme4.

MLE vs. Bayesian estimation

MLE (e.g., lme4)

\(L(\mu, \sigma) = \prod_i \int_{-\infty}^{\infty} \mathcal{B}(S_i \mid N_i, p_i) \times \mathcal{N}(\mathrm{logit}(p_i) \mid \mu, \sigma) dp_i\)

  • Analytically find \(\mu\) and \(\sigma\) to maximize \(L\)

  • An analytical solution is often not available (this example is easy though)

Bayesian estimation

\(P(\mu, \sigma \mid S_i, N_i) \propto \prod_i \mathcal{B}(S_i \mid N_i, p_i) \times \prod_i \mathcal{N}(\mathrm{logit}(p_i) \mid \mu, \sigma) \times\) \(\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\mathcal{N}(\mu \mid 0, 5) \times \mathcal{N}(\sigma \mid 0, 2.5)\)

  • Numerically find \(\mu\) and \(\sigma\) to maximize \(P\) (aka MCMC)

  • MCMC works even if an analytical solution is not available

  • Bayes’s theorem supports the use of MCMC

Summary

Why Bayesian estimation is useful

  • We can use a priori information about parameters in our model

  • Models are flexible

  • MCMC works even if models are complicated

Why multilevel models are important

  • Multilevel models have a good balance between pooled estimates and separate estimates, which is useful for practical sample sizes

  • Multilevel models handle nested or hierarchical data structures, a common and important scenario in ecological research (e.g., trees within species, community within sites, etc.).

References