
Bayesian estimation for ecology
đ§đ»âđ» Masatoshi Katabuchi @ XTBG, CAS
November 3, 2025 Experimental Design and Statistics in R, AFEC, XTBG
We Learn
Why Bayesian estimation is useful
Why multilevel models are important
How to use and code Bayesian models in Stan and R
Likelihood
Multilevel Models (Conceptual Motivation)
Conditional Probability and Bayesâ Theorem
Prior
Multilevel Model Revisited
Tools for Bayesian Modeling
Stan Coding Session (separate HTML material)
Assuming everyone knows the concept of likelihood

\(f(x \mid \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\mathrm{exp} \bigl[{-\frac{1}{2}\bigl\{\frac{x-\mu}{\sigma}}\bigr\}^2\bigr]\)
P(data | parameters): Probability of observing the data given the parameters.
e.g., when your model is \(x_i \sim \mathcal{N}(\mu = 0, \sigma^2 = 1)\) (normal distribution with mean 0 and variance 1), what is the probability density at x = 1.96?

\(f(k \mid \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}\)
P(data | parameters): Probability of observing the data given the parameters.
e.g., when your model is \(k_i \sim \mathcal{Pois}(\lambda = 5.4)\) (Poisson distribution with mean and variance = 5.4), what is the probability mass at k = 8?
When your data is x = {-1.5, 0, 1.5} and your model is \(x \sim \mathcal{N}(0, 1)\), what is the probability of observing x?
Because we usually donât know the true parameters (\(\mu\) and \(\sigma\)), we need to find the parameters that maximize the likelihood function (MLE: Maximum Likelihood Estimation).
e.g., for a linear model \(y_i = ax_i + b\), we usually assume that \(y_i \sim \mathcal{N}(\mu_i = ax_i + b, \sigma^2)\), and we want to find the parameters \(a\), \(b\), and \(\sigma\) that maximize the likelihood function: \(P(y_i \mid ax_i + b, \sigma)\)
In the previous example, \(\mu\) = 0 makes the likelihood maximum.


\(p\): survival rates, \(1-p\): mortality rate
\(L = {}_5C_2 p^2(1-p)^3\) (Binomial distribution)
\(\mathrm{ln}\;L = \mathrm{ln}\;{}_5C_2 + 2\mathrm{ln}\;p + 3\mathrm{ln}(1-p)\)
\(\frac{d}{dp}\mathrm{ln}\;L = \frac{2}{p} - \frac{3}{1-p} = 0\)
Solve: \(p = \frac{2}{5}\)

ââŠrare species suffered more from the presence of conspecific neighbors than common species did, suggesting that conspecific density dependence shapes species abundances in diverse communities.â
- Comita et al. 2010 -
NDD can maintain diversity in communities by preventing common species from dominating the community.
Example inspired by Comita et al. 2010 Science and Song & Katabuchi et al. 2024 Ecology
Conspecific density and species abundance are scaled (mean = 0, sd = 1).
H: Rare species tend to be rare because they suffer stronger conspecific NDD than common species.
There is negative density dependence (NDD) of seedling survival rate, and the strength of NDD varies among species. The strength of NDD depends on species abundance.
Example inspired by Nguyen & Katabuchi 2025 Journal of Forestry Research
Wood density is scaled (mean = 0, sd = 1).
H: Dense wood can support taller trees than less dense wood.
There is a power-law relationship (\(y =ax^b\)) between tree diameter (DBH) and tree maximum height, and the power-law exponent varies among species. Those relationships depend on wood density.
Tree height y ~ DBH x (individual-level)
Slope b ~ wood density (species-level)
(1 + cons | sp) â random intercept and slope across species.
We need to extract the slope coefficients for each species from fit_ind and make new_data.
e.g., Chen et al. 2019 Science
cons:abund â species-level predictor (abundance) modifying NDD strength.(1 + cons | sp) â random intercept and slope across species.brmslme4::glmer().e.g., Comita et al. 2010 Science, Song & Katabuchi et al. 2024 Ecology, Nguyen & Katabuchi 2025 Journal of Forestry Research
data{
int<lower=1> N; // number of samples
int<lower=1> J; // number of sp
int<lower=1> K; // number of tree-level preditors (i.e, CONS, HETS,...)
int<lower=1> L; // number of sp-level predictors (i.e., interecept, SLA,...)
matrix[N, K] x; // tree-level predictor
matrix[L, J] u; // sp-level predictor
array[N] int<lower=0, upper=1> suv; // 1 or 0
array[N] int<lower=1, upper=J> sp; // integer
}
parameters{
matrix[K, L] gamma;
matrix[K, J] z;
cholesky_factor_corr[K] L_Omega;
vector<lower=0, upper=pi() / 2>[K] tau_unif;
}
transformed parameters{
matrix[K, J] beta;
vector<lower=0>[K] tau;
for (k in 1:K) tau[k] = 2.5 * tan(tau_unif[k]);
beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;
}
model {
vector[N] mu;
to_vector(z) ~ std_normal();
L_Omega ~ lkj_corr_cholesky(2);
to_vector(gamma) ~ normal(0, 2.5);
for (n in 1:N) {
mu[n] = x[n, ] * beta[, sp[n]];
}
suv ~ bernoulli_logit(mu);
}
generated quantities {
vector[N] log_lik;
corr_matrix[K] Omega;
Omega = multiply_lower_tri_self_transpose(L_Omega);
for (n in 1:N) {
log_lik[n] = bernoulli_logit_lpmf(suv[n] | x[n, ] * beta[, sp[n]] +
phi[plot[n]] + xi[census[n]] + psi[tag[n]]);
}
}\(s_{i,j} \sim \mathcal{B}(p_{i, j})\): likelihood
\(\mathrm{logit}(p_{i,j}) = \boldsymbol{x_{i}} \cdot \boldsymbol{\beta_{j}}\): individual-level model
\(\boldsymbol{\beta_j} = \boldsymbol{\gamma_k} \cdot \boldsymbol{u_j} + \mathrm{diag}(\boldsymbol{\sigma})\cdot \boldsymbol{L} \cdot \boldsymbol{z}\): species-level model
\(L \sim \mathrm{LkjCholesky}(\eta)\): prior (?)
\(z_j \sim \mathcal{N}(0, 1)\): prior (?)
\(\tau \sim \mathrm{Cauchy}(0, 2.5)\): prior (?)
\(\gamma_k \sim \mathcal{N}(0, 2.5)\): prior (?)
I wonât explain this now â but this is the kind of model weâll build after learning Bayesâs theorem, priors, and a bit of linear algebra.
\[ P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)} \]
Conditional probability
Bayesâ Theorem
Forward / inverse problems
Revisiting Bayes

Probility of A:
\[ P(A) = \frac{A}{U} \]
e.g., probability of rolling a dice and getting an odd number is 3/6 = 1/2

Probability of A occurring given B has already occurred:
\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]
e.g.,
\[ P(B \mid A) = \frac{P(A \cap B)}{P(A)} \]
\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]
\[ P(A \mid B) \times P(B) = P(B \mid A) \times P(A) \]
\[ P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)} \]






âĄïžForward problem: If you are infected, what is the probability that PCR test is positive?
\(P(\text{Positive} \mid \text{Infected})\): 95%
This is what the lab test is designed to estimate â how likely we are to observe a positive signal given the known infection status (e.g., 95 out of 100 infected samples test positive).
đInverse problem (Bayesian): If the PCR test is positive, what is the probability that you are infected?
This isnât intuitive, but itâs the probability we actually care about â how likely someone is to be infected, given a positive test result.

\[ \textcolor{#d95f02}{ P(\mathrm{Parameter} \mid \mathrm{Data}) } = \frac{ \textcolor{#1b9e77} { P(\mathrm{Data} \mid \mathrm{Parameter}) } \times \textcolor{#7570b3} { P(\mathrm{Parameter})} } {P(\mathrm{Data})} \]
\(\textcolor{#d95f02}{P(\mathrm{Parameter} \mid \mathrm{Data})}\)
\(\textcolor{#1b9e77}{P(\mathrm{Data} \mid \mathrm{Parameter})}\)
\(\textcolor{#7570b3}{P(\mathrm{Parameter})}\)
\(P(\mathrm{Data})\)

We compare two coins, A and B, with the same prior but different likelihood (due to data sizes) to illustrate how prior beliefs influence posterior estimates.
\(\textcolor{#1b9e77}{L_A = {}_3C_2 p^2 (1-p)^1}\)
\(\textcolor{#1b9e77}{L_B = {}_{100}C_{60} p^{60} (1-p)^{40}}\)
\(\textcolor{#7570b3}{\mathrm{Prior} \propto p^{50} (1-p)^{50}}\)

Posterior \(\propto\) Likelihood \(\times\) Prior
Parameter p that maximizes the posterior of coin A. \(\textcolor{#d95f02}{Post_A}\propto\textcolor{#1b9e77}{p^2 (1-p)^1} \times\textcolor{#7570b3}{p^{50} (1-p)^{50}}\) \(p = 52/103 \approx 0.505\)
Parameter p that maximizes the posterior of coin B. \(\textcolor{#d95f02}{Post_B} \propto \textcolor{#1b9e77}{p^{60} (1-p)^{40}} \times \times\textcolor{#7570b3}{p^{50} (1-p)^{50}}\) \(p = 110/200 = 0.55\)
We often use weakly informative priors in ecology.


i indexes samples (e.g., individuals), and j indexes groups (e.g., species, sites).
Each observation i is associated with a grid cell (\(m,n\)), so \(\tilde{r}i = r_{m,n}\).

\[ \begin{align*} \phi_{m,n} = \frac{1}{8} (& r_{m-1,n-1} + r_{m-1,n} + r_{m-1,n+1} + \\ & r_{m,n-1} + r_{m,n+1} + \\ & r_{m+1,n-1} + r_{m+1,n} + r_{m+1,n+1}) \end{align*} \]
Using weakly informative priors, we donât have to explore the entire parameter space.
Bayesian methods are inherently designed for hierarchical (multilevel) modeling through priors.
Coaching effects in eight schools.
[1] Rubin, D. B. Estimation in parallel randomized experiments. Journal of Educational Statistics 6, 377â401 (1981).
[2] Gelman, A. et al. Bayesian Data Analysis, Third Edition. (Chapman & Hall/CRC, 2013).
What is the survival rate?
What is the survival rate?
| id | n | suv | p_like |
|---|---|---|---|
| A | 41 | 9 | 0.22 |
| B | 45 | 18 | 0.40 |
| C | 32 | 6 | 0.19 |
| D | 18 | 5 | 0.28 |
| E | 33 | 8 | 0.24 |
| F | 26 | 8 | 0.31 |
| G | 46 | 11 | 0.24 |
| H | 16 | 8 | 0.50 |
p_likeWhat is the survival rate?
| id | n | suv | p_like | p_true |
|---|---|---|---|---|
| A | 41 | 9 | 0.22 | 0.34 |
| B | 45 | 18 | 0.40 | 0.30 |
| C | 32 | 6 | 0.19 | 0.34 |
| D | 18 | 5 | 0.28 | 0.36 |
| E | 33 | 8 | 0.24 | 0.25 |
| F | 26 | 8 | 0.31 | 0.23 |
| G | 46 | 11 | 0.24 | 0.25 |
| H | 16 | 8 | 0.50 | 0.34 |
p_true ranges [0.23, 0.36]
p_like ranges [0.19, 0.5]
The estimate shows the larger variation
Because of the small sample size (common in ecological studies)

We know p_true because we generated this data using p_true as a known parameter. In practical scenarios, we donât know p_true.


\(S_i \sim \mathcal{B}(N_i, p_i)\): Likelihood
\(\mathrm{logit}(p_i) \sim \mathcal{N}(\mu, \sigma)\): Group-level model (hierarchical prior)
\(\mu \sim \mathcal{N}(0, 2.5)\): Hyperprior
\(\sigma \sim \text{Half-Cauchy}(0, 1)\): Hyperprior

The overall survival rate is 0.5 in this figure. We have some sense of a scale for \(\sigma\).


| sp | n | suv | p_like | p_true | p_bayes |
|---|---|---|---|---|---|
| A | 41 | 9 | 0.22 | 0.34 | 0.26 |
| B | 45 | 18 | 0.40 | 0.30 | 0.32 |
| C | 32 | 6 | 0.19 | 0.34 | 0.25 |
| D | 18 | 5 | 0.28 | 0.36 | 0.28 |
| E | 33 | 8 | 0.24 | 0.25 | 0.27 |
| F | 26 | 8 | 0.31 | 0.23 | 0.29 |
| G | 46 | 11 | 0.24 | 0.25 | 0.27 |
| H | 16 | 8 | 0.50 | 0.34 | 0.32 |

Closed symbols (p_bayes) align more closely with the 1:1 line, indicating more accurate estimates.
This model compensates for limited data by using prior knowledge that tree responses are somehow similar and compensates for limited data.
lme4)\(L(\mu, \sigma \mid S_i, N_i) = \prod_i \int_{-\infty}^{\infty} \mathcal{B}(S_i \mid N_i, p_i) \times \mathcal{N}(\mathrm{logit}(p_i) \mid \mu, \sigma) dp_i\)
Analytically (or numerically) find \(\mu\) and \(\sigma\) that maximize \(L\).
MLE often requires numerical integration, and this becomes difficult for complex models.
No prior; estimates can be unstable or biased with small data.
\(P(\mu, \sigma \mid S_i, N_i) \propto \prod_i \mathcal{B}(S_i \mid N_i, p_i) \times \prod_i \mathcal{N}(\mathrm{logit}(p_i) \mid \mu, \sigma) \times\) \(\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\mathcal{N}(\mu \mid 0, 2.5) \times \text{Half-Cauchy}(0, 1)\)
Use simulations (MCMC) to appoximate \(P\).
Baesyian methods work better for complex models because they explore the full distribution instead of just finding a single best value.
Priors help stabilize estimates, especially with limited data.
Why Bayesian estimation is useful.
Why multilevel models are important.
Goal: Approximate the posterior distribution of parameters given the data instead of calculating it directly.
For simple models, we can analytically find parameters that maximize the posterior distribution.
Coin B example again:
\(\textcolor{#d95f02}{Post_B}\propto\textcolor{#1b9e77}{p^{60} (1-p)^{40}} \times\textcolor{#7570b3}{p^{50} (1-p)^{50}}\)
\(p = 110/200 = 0.55\)


Sampling \(\theta_1\), \(\theta_2\) jointly is hard: \(p(\theta_1, \theta_2 \mid \text{data}) \propto p(\text{data} \mid \theta_1, \theta_2) \times p(\theta_1, \theta_2)\)
Instead, we sample \(\theta_1\) and \(\theta_2\) sequentially:
\(p(\theta_1 \mid \theta_2, \text{data}) \propto p(\text{data} \mid \theta_1, \theta_2) \times p(\theta_1 \mid \theta_2)\)
\(p(\theta_2 \mid \theta_1, \text{data}) \propto p(\text{data} \mid \theta_1, \theta_2) \times p(\theta_2 \mid \theta_1)\)
\(\theta_1^{(t)} \sim p(\theta_1 \mid \theta^{(t-1)}_2, \text{data})\)
\(\theta_2^{(t)} \sim p(\theta_2 \mid \theta^{(t)}_1, \text{data})\)
Gibbs sampling approximates the joint posterior by sampling from these full conditionals.
Implemented in BUGS and JAGS software.


\(H(\theta, r) = U(\theta) + r^2/2\) (Hamiltonian: potential energy + kinetic energy = constant)
Instead of sampling from \(p(\theta \mid \text{data})\), HMC samples from a joint distribution over both \(\theta\) and \(r\).
Itâs equivalent to sampling from \(p(\theta \mid \text{data}) \propto \text{exp}(-U(\theta))\).
HMS explores very different \(\theta\) along the same trajectory (i.e. leapfrog path), which is much more efficient than random sampling.
True \(\theta \approx\) 0.77.


Within each iteration, HMC simulates a trajectory using leapfrog steps.
max_treedepth in Stan).The Hamiltonian stays almost constant, but momentum \(r\) is resampled each iteration.
Only the final parameter \(\theta\) of each trajectory is kept as a posterior sample.
True \(\theta \approx\) 0.77.

| Feature | Metropolis(-Hastings) | Gibbs Sampling | Hamiltonian Monte Carlo (HMC) |
|---|---|---|---|
| How it works | Try random moves, accept/reject | Update one parameter at a time | Simulate smooth movement using gradients |
| Efficient in high dimensions? | â Often slow | â If math is simple | â Very efficient |
| Flexible? | â Yes | â No (conjugate priors) | â Yes |
| Typical software | JAGS, custom R | JAGS, BUGS | Stan |

Bayesian Modeling
Flexible and Scalable
Multiplatform
CmdStan: Command line interface (you write Stan code and run it via terminal).
rstan, cmdstanr: R interface to Stan/CmdStan (you write Stan code and run it within R).
brms, rstanarm: Higher-level R packages that depend on rstan (you write R code and run models within R, without writing Stan code).
In this course, we will use cmdstanr and brms.
CmdStan is based on the latest version of Stan, whereas rstan is based on an older version.
