---
title: "Nested Sampling with ernest"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Nested Sampling with ernest}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
---
```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r srr-tags, eval = FALSE, echo = FALSE}
#' roxygen_block_name
#' 
#' @srrstats {G1.3} Vignette presents consistent statistical terminology to be
#' used throughout ernest.
#' @srrstats {BS1.2, BS1.2b} Contains examples for specifying priors.
#' @srrstats {BS1.3, BS1.3a} Describes how to save and continue previous runs
#' with `generate()`.
```

This vignette introduces the nested sampling algorithm and demonstrates how to use the ernest R package to estimate model evidence for a Bayesian model. After reading this, you will know more about the foundations of nested sampling, and you will be able to use ernest to perform a basic nested sampling run over a regression model and review the results.

We focus on providing a basic overview of the theory behind nested sampling: For more information, consider the more in-depth explorations offered in @buchner2023 and @ashton2022.

```{r, message=FALSE, echo=FALSE}
library(ernest)
library(brms)
library(posterior)
```

## Bayes' Theorem and Model Evidence

Bayesian inference uses probability to quantify how our knowledge of a statistical model changes as new data become available. Consider a model $M$ with $d$ unknown parameters, $\theta$. The prior distribution, $\Pr(\theta|M) = \pi(\theta)$, encodes beliefs about $\theta$ before seeing the data. Bayes' theorem describes how to update the prior after observing data $D$ [@skilling2006]:

$$
\begin{aligned}
  \Pr(D|M,\theta) \times Pr(\theta|M) &= \Pr(\theta|M,D) \times \Pr(D|M) \\
  \text{Likelihood} \times \text{Prior} &= \text{Posterior} \times \text{Evidence} \\
  \mathbf{L}(\theta) \times \pi(\theta) &= P(\theta) \times \mathbf{Z}
\end{aligned}
$$

The likelihood function, $\mathbf{L}(\theta)$, measures how probable the data $D$ are, given the model and parameter value. Bayesian computation uses $P(\theta)$ and $\mathbf{L}(\theta)$ to estimate the posterior, $P(\theta)$, which reflects our updated knowledge of $\theta$ after observing $D$, and the model evidence, $\mathbf{Z}$.

On its own, $\mathbf{Z}$ is the normalizing constant for the posterior, ensuring $P(\theta)$ integrates to 1. However, evidence also plays a central role in Bayesian model selection: By rearranging Bayes' theorem, we can express $\mathbf{Z}$ as the marginal likelihood, found by integrating $\mathbf{L}(\theta)$ over all possible values in the parameter space $\theta \in \Theta$:

$$
  \mathbf{Z} = \int_{\theta \in \Theta} \mathbf{L}(\theta) \pi(\theta) d\theta
$$

$\mathbf{Z}$ represents the probability of observing the data, averaged across all possible parameter values. This provides a powerful method to compare models: For two models $M_1$ and $M_2$ explaining the same data, their relative plausibility is expressed by their *posterior odds*, which factor into the prior odds and the ratio of their evidences:

$$
  \frac{\Pr(\theta|M_1, D)}{\Pr(\theta|M_2, D)} = \frac{\Pr(\theta|M_1)}{\Pr(\theta|M_2)} \times \frac{\Pr(D|M_1)}{\Pr(D|M_2)}
$$

This ratio, called the *Bayes factor* [@jeffreys1998], quantifies support for one model over another. Bayes factors and related tools, such as posterior model probabilities [@gronau], are central to model selection.

A key challenge in calculating or estimating model evidence comes from our need to perform an integration over a highly-dimensional parameter space: even basic models used by statisticians can have tens of parameters, while hierarchical models can have hundreds or thousands. This makes finding $\mathbf{Z}$ through direct integration computationally unfeasible.

This computational complexity explains why evidence is often neglected in Bayesian computation, with most methods instead focusing on estimating the posterior. Posterior estimates generated with methods such as Markov chain Monte Carlo (MCMC) [@metropolis1953] can be used to estimate $\mathbf{Z}$ through techniques such as bridge sampling [@gronau]. However, these techniques are typically highly sensitive to sampling variation in the posterior estimate, requiring one to provide very dense samples and conduct sensitivity analyses on the produced evidence estimates [@brms].

## Estimating Evidence with Nested Sampling

Nested sampling is a Bayesian computational algorithm for estimating the model evidence integral, introduced by John Skilling [@skilling2004, @skilling2006, @skilling2007]. The goal is divide the parameter space $\Theta$ into a series of nested contours or shells defined by their likelihood values. If we take a fixed $N$ number of samples from $\Theta$ and ordered them such that $L(\theta_1)$ was the point in the sample with the worst likelihood, we can express the volume $V$ of $\Theta$ that contains points with likelihood greater than $\mathbf{L}(\theta_1) = L_1$ as [@skilling2007, @buchner2023]:

$$
\begin{aligned}
    V(L_1) &= \Pr(\mathbf{L}(\theta) > L_1) \\
    &= \int_{\mathbf{L}(\theta) > L_1} \pi(\theta) d\theta
\end{aligned}
$$

We can define the other shell's volumes similarly. Should we create a great many shells across the parameter space, we can rewrite the evidence integral as [@skilling2006]:

\begin{equation}
    \mathbf{Z} = \int_{0}^{1} L(V_i) d V
    \label{eq:evid_ns}
\end{equation}

To estimate the volume of each shell $V_i = V(L_i)$, we can note that the sequence $V_1, V_2, \ldots, V_N$ is strictly decreasing. By the probability integral transform of the survival function $\Pr(\mathbf{L}(\theta) > L_1)$, these volumes follow the standard uniform distribution. From this, we infer that our $N$ samples we used to construct the nested shells divide the parameter space into uniformly distributed volumes along the likelihood function. This allows us to use the uniform order statistics to estimate the volume belonging to the worst sampled point as $V_i \sim \text{Beta}(N, 1)$ [@skilling2004].

## Nested Sampling with `ernest`

Nested sampling creates a series of shells by generating an i.i.d. sample from the prior, then repeatedly replacing the worst point in the sample with a new point from a *likelihood-restricted prior sampler* (LRPS). At each iteration $i$, the worst point is replaced by a new, independently sampled point, such that $\mathbf{L}(\theta_{new}) > \mathbf{L}(\theta_{min})$. Each replacement slightly contracts the parameter space, which we estimate as $V_{i+1}/V_i \sim \text{Beta}(N,1)$ with $V_0 = 1$. Once we have sampled through the bulk of the posterior, we can estimate $\mathbf{Z}$ analytically.

To demonstrate nested sampling in ernest, we fit a well-known model of how seizure counts in patients with epilepsy change when exposed to an anti-convulsant therapy [@thall1990]. This data, available in the brms [@brms] R package, is fit using a Poisson model with a log link:

``` r
> fit1 <- brm(
+   count ~ zAge + zBase * Trt,
+   prior = set_prior("normal(0, 2.5)", class = "b") +
+     set_prior("normal(0, 2.5)", class = "Intercept"),
+   data = epilepsy,
+   family = poisson()
+ )
> fit1
Family: poisson 
Links: mu = log 
Formula: count ~ zAge + zBase * Trt 
  Data: epilepsy (Number of observations: 236) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept      1.94      0.04     1.86     2.01 1.00     3221     2717
zAge           0.15      0.03     0.10     0.20 1.00     3405     2480
zBase          0.57      0.02     0.52     0.62 1.00     2541     2374
Trt1          -0.20      0.05    -0.31    -0.09 1.00     3065     2858
zBase:Trt1     0.05      0.03    -0.01     0.11 1.00     2482     2248
```

### Initialization

As with most Bayesian computation, nested sampling requires a likelihood function and prior for the model parameters. In ernest, this is done through the `create_likelihood` and `create_prior` functions. We also need to select an LRPS method to be used during the run.

#### Likelihood

In ernest, model likelihoods are provided as log-density functions, which take a vector of parameter values and return the corresponding log-likelihood. For details on extracting or creating likelihood functions for your models, see the documentation in the enrichwith [@enrichwith] or the tfprobability [@tfprobability] packages.

The following code loads the epilepsy dataset and prepares the design matrix and response vector for use in the likelihood function.

```{r}
data("epilepsy")
frame <- model.frame(count ~ zAge + zBase * Trt, epilepsy)
X <- model.matrix(count ~ zAge + zBase * Trt, epilepsy)
Y <- model.response(frame)
```

We first build a simple function for calculating the log-density of the model. The `poisson_log_lik` function factory takes the design matrix and response vector and returns a function with one argument. This function expects a vector of five regression coefficients and uses them to compute the linear predictor, apply the inverse link, and return the sum of log-likelihoods given the observed data.

```{r}
poisson_log_lik <- function(predictors, response, link = "log") {
  force(predictors)
  force(response)
  link <- make.link(link)
  
  function(theta) {
    eta <- predictors %*% theta
    mu <- link$linkinv(eta)
    sum(dpois(response, lambda = mu, log = TRUE))
  }
}

epilepsy_log_lik <- create_likelihood(poisson_log_lik(X, Y))
epilepsy_log_lik

epilepsy_log_lik(c(1.94, 0.15, 0.57, -0.20, 0.05))
```

`create_likelihood` wraps user provided log-likelihood functions, ensuring that they always return a finite double value or `-Inf` when they are used within a run. By default, ernest will warn the user about any non-compliant values produced by a log-likelihood function, before replacing them with `-Inf`.

```{r}
epilepsy_log_lik(c(1.94, 0.15, 0.57, -0.20, Inf))
```

For improved performance, especially in high dimensions, ernest allows you to provide a vectorized log-likelihood function. This function accepts a matrix of parameters (where each row is a sample) and returns a vector of log-likelihoods. In the future, ernest will take advantage of these functions to improve the efficiency of specific LRPS methods.

```{r}
poisson_vec_lik <- function(predictors, response, link = "log") {
  force(predictors)
  force(response)
  link <- make.link(link)

  function(theta_mat) {
    eta_mat <- predictors %*% t(theta_mat)
    mu_mat <- link$linkinv(eta_mat)
    colSums(apply(mu_mat, 2, \(col) dpois(response, lambda = col, log = TRUE)))
  }
}

epilepsy_vec_lik <- create_likelihood(vectorized_fn = poisson_vec_lik(X, Y))
epilepsy_vec_lik
```

Regardless of whether we provide a vectorized or scalar likelihood function, `create_likelihood` allows us to provide parameter matrices to functions to evaluate multiple log-likelihoods simultaneously.

```{r}
theta_mat <- matrix(
  c(1.94, 0.15, 0.57, -0.20, 0.05, 1.94, 0.0, 0.0, 0.0, 0.0),
  byrow = TRUE,
  nrow = 2
)
epilepsy_vec_lik(theta_mat)
epilepsy_log_lik(theta_mat)
```

#### Prior Distributions

Like other nested sampling software, ernest performs its sampling within a unit hypercube, then transforms these points to the original parameter space using a *prior transformation function*. In ernest, we use the `create_prior` function or one of its specializations to specify our prior transformation as well as the dimensionality of our model.

Priors can be constructed from a custom transformation function and a vector of unique parameter names. Functions should take in a vector of values within the interval $(0, 1)$ and return a same-length vector in the scale of the original parameter space. If the marginals are independent, we can use quantile functions to construct the transformation. The following example defines a custom prior where each element is independently distributed as normal with standard deviation of 2.5. The resulting object is an `ernest_prior`, which contains a tested transformation function and metadata for the sampler.

```{r}
coef_names <- c("Intercept", "zAge", "zBase", "Trt1", "zBase:Trt1")

norm_transform <- function(unit) {
  qnorm(unit, sd = 2.5)
}
custom_prior <- create_prior(norm_transform, names = coef_names)
custom_prior
```

Like `create_likelihood`, you can use the `vectorized_fn` argument to provide a vectorized transformation function. This accepts a matrix of unit hypercube samples and returns a matrix of points in the original parameter space.

ernest also provides built-in helpers for common priors, such as parameter spaces defined by marginally independent (and possibly truncated) normal distributions. We can use this to produce a prior for our model, while taking advantage of its vectorized prior transformation function:

```{r}
model_prior <- create_normal_prior(names = coef_names, sd = 2.5)
model_prior
```

#### Likelihood-Restricted Prior Sampler

A likelihood-restricted prior sampler (LRPS) proposes new points from the prior, subject to the constraint that their likelihood exceeds a given threshold. The choice of LRPS is critical for the efficiency and robustness of nested sampling, as it determines how well the algorithm can explore complex or multimodal likelihood surfaces.

ernest provides several built-in LRPS implementations, each with different strategies for exploring the constrained prior region. These samplers are S3 objects inheriting from `ernest_lrps`, and are specified along with the likelihood and prior objects when initializing a run. Briefly, these options are:

-   `unif_cube()`: Uniform rejection sampling from the entire prior. This is highly inefficient for most problems, but useful for testing and debugging.
-   `unif_ellipsoid(enlarge = ...)`: Uniform sampling within a bounding ellipsoid around the live points, optionally enlarged. This is efficient for unimodal, roughly elliptical posteriors.
-   `multi_ellipsoid(enlarge = ...)`: Uniform sampling within a union of multiple ellipsoids, which can better handle multimodal or non-convex regions.
-   `slice_rectangle(enlarge = ...)`: Slice sampling within a bounding rectangle, with optional enlargement. This can be effective for high-dimensional or correlated posteriors.
-   `rwmh_cube(steps = ..., target_acceptance = ...)`: Random-walk Metropolis-Hastings within the unit hypercube, with configurable step count and acceptance target.

Select and configure an LRPS based on the geometry and complexity of your model's likelihood surface. For most problems, `multi_ellipsoid` or `rwmh_cube` are recommended starting points. We can configure these before providing them to a sampler by adjusting their arguments:

```{r}
multi_ellipsoid()
multi_ellipsoid(enlarge = 1.5)

rwmh_cube()
rwmh_cube(steps = 30, target_acceptance = 0.4)
```

The LRPS interface is extensible, allowing advanced users to implement custom samplers by following the S3 conventions described in the package's internal documentation.

### Generating Samples

To initialize a sampler, call `ernest_sampler` with the likelihood, prior, and LRPS objects. This creates an `ernest_sampler` object, which contains (among other metadata) an initial live set of points ordered by their likelihood values. Calling this function also tests your likelihood and prior transformation functions and reports unexpected behaviour.

```{r}
sampler <- ernest_sampler(
  epilepsy_log_lik,
  model_prior,
  sampler = rwmh_cube(),
  nlive = 300,
  seed = 42
)
sampler
```

The `generate` function runs the nested sampling loop and controls when a run will stop. For example, you can perform 1000 sampling iterations by setting `max_iterations`:

```{r}
run_1k <- generate(sampler, max_iterations = 1000, show_progress = FALSE)
run_1k
```

`generate` produces an `ernest_run` object, which inherits from `ernest_sampler` and contains data describing the run. It is not recommended to overwrite values within this object, but you can explore its internals. For example, ernest calculates the contribution of each shell to the final log-evidence estimate as

$$
w_i = f(L_i) * \Delta V_i
$$

where $\Delta V_i = V_i - V_{i-1}$ and $f(L_i) = (L_{i-1} + L_i)/2$. Summing these unnormalized log-weights gives the model evidence estimate, which we can view by accessing them within the `run_1k` object:

```{r}
run_1k$samples$log_weight |> summary()
```

## Properties of Nested Sampling Runs

Beyond estimating model evidence, nested sampling is substantially different from other Bayesian methods like MCMC. Understanding these differences is important for knowing when and how to use nested sampling in your analyses. This section additionally demonstrates how we can use `ernest_run`'s S3 methods to better understand a run's results.

### Robustness

Traversing and integrating over an arbitrary parameter space $\Theta$ is challenging, especially due to high dimensionality—the main motivation for nested sampling [@skilling2004]. Even in low dimensions, $\Theta$ may have features such as non-convexity or multimodality that hinder exploration [@buchner2023]. Such pathologies often confound MCMC and related Monte Carlo methods [@freeman2013].

Nested sampling avoids many of these problems: by generating samples from the entire prior region using LRPS, it is less sensitive to local behaviour and can globally explore the parameter space, rather than getting stuck on features in the likelihood, such as local maxima.

### Complexity

Nested sampling is computationally intensive, mainly due to its need to explore the entire parameter space. The exact complexity of a run is problem-specific, depending on the LRPS method and the shape of the likelihood-restricted prior at each step; however, @skilling2009 notes that the number of iterations needed to successfully integrate the posterior is proportional to $HN$, where $H$ is the KL divergence between the posterior and prior:

$$
H = D_\text{KL}(P(\theta) | \pi(\theta)) = \sum_{ \Theta } P(\theta) \, \log \frac{ P(\theta) }{ \pi(\theta) }
$$

This gives a rough estimate of complexity; other work refines this for specific likelihoods and priors [see: @allison2014, @skilling2009]. In practice, uninformative or weakly informative priors greatly increase run time, as more iterations are spent traversing uninformative regions.

### Stopping Criteria

We can use naive methods to stop a nested sampling run, such as by setting `max_iterations` or `max_evaluations` when calling `generate`. Fortunately, nested sampling provides a more nuanced method for terminating a run. At each iteration, we can estimate the amount of evidence that remains un-integrated as [@speagle2020]:

\begin{equation}
    \Delta \hat{\mathbf{Z}_{i}} = L_{max} V_i
\end{equation}

This estimate assumes the remaining space can be represented as a single shell with likelihood $L_{max}$ and volume $V_i$. As the run continues, this contribution shrinks. Once it is small relative to the current evidence estimate, the run can be considered to have successfully traversed the informative region of the parameter space. Numerically, this is represented as a log-ratio [@skilling2006]:

\begin{equation}
    \Delta \ln{\epsilon_i} = \ln{(\hat{\mathbf{Z}_i} - \Delta \hat{\mathbf{Z}_i})} - \ln{\hat{\mathbf{Z}_i}}
\end{equation}

Since $\Delta \hat{\mathbf{Z}_i}$ tends to overestimate the remaining contribution, this criterion results in more iterations than strictly necessary, allowing the sampler to better detect irregularities in the likelihood surface.

In ernest, we set this criterion with the `min_logz` argument in `generate`. By default, `generate` will produce samples until `min_logz` falls below $0.05$. We can call `generate` on the earlier-produced `run_1k` object to continue sampling until the run satisfies $\Delta \ln{\epsilon_i} < 0.05$.

```{r}
run <- generate(run_1k, show_progress = FALSE)
run
```

### Uncertainty

Evidence estimates produced by nested sampling have two main sources of error: (a) error in $\Delta V_i$, due to uncertainty in assigning volumes to shells over $P(\theta)$; and (b) error in $f(L_i)$, from using a point estimate for the likelihood across a shell. A key advantage of nested sampling is that both sources of error can be estimated without repeating the sampling procedure.

ernest provides methods for estimating uncertainty due to $\Delta V_i$ with both experimental and analytic means. For the former, ernest reports a "cheap'n'cheerful" [@skilling2009] approximation for the variance in a log-evidence estimate:

$$
\mathbf{V}(\hat{\ln \mathbf{Z}}) = \sqrt{\frac{\hat{H}}{N}}
$$

where $\hat{H}$ is the estimated information, calculated from a nested sampling run:

$$
\hat{H} = -\ln{\mathbf{Z}} + \frac{1}{\mathbf{Z}} \int_0^1 \mathbf{L}(\theta) \ln{\mathbf{L}(\theta)}dx
$$

You can view this uncertainty by calling `summary` on an `ernest_run` object. We can additionally visualize how the log weights at each iteration changed across the run by using the `plot` method.

```{r}
summary(run)
plot(run, which = c("weight", "likelihood"))
```

Additionally, since the shrinkage in $V_i$ between iterations follows $\text{Beta}(N, 1)$, it is relatively inexpensive to simulate different shrinkage sequences to create a bootstrapped statistic. In ernest, we can do this using `calculate`, which has a similar plot method.

```{r}
sim_run <- calculate(run, ndraws = 1000)
sim_run
plot(sim_run, which = c("weight", "likelihood"))
```

Uncertainty within $L_i$ is harder to quantify, as it may arise from systematic errors in LRPS. In the future, ernest will allow for creating simulated runs through bootstrapping: a run is split into $N$ threads (each with one live point), then regrouped into synthetic runs of size $N$ with replacement [@higson2018]. Currently, users interested in bootstrapping `ernest_run` objects are referred to the nestcheck python package [@higson2019].

### Posterior Distributions

Nested sampling, while targeting model evidence, can also produce an estimated posterior distribution as a side-effect. As each point and its log-weight are discarded from the live set, we can estimate the posterior using this dead set of points, weighted by their importance weights:

\begin{align}
    p_i &\approx \frac{w_i}{\sum_{\forall i}{w_i}} \\
    &\approx \frac{w_i}{\hat{Z}}
\end{align}

In ernest, the posterior can be extracted using the `as_draws` object. After reweighting this sample by its importance weights, we can summarize the posterior using the `posterior` package, or produce visualizations.

```{r}
as_draws(run) |>
  resample_draws() |>
  summarise_draws()
visualize(run, .which = "trace")
visualize(run, -Intercept, .which = "density")
```

## Conclusion

Nested sampling is a powerful method for estimating model evidence ($Z$) directly from a proposed model, rather than by treating it as a by-product of posterior estimation. This procedure yields several features that may be more appealing than MCMC in certain scenarios: Runs can handle complexities with the likelihood surface, can halt at clearly defined criteria, and can produce error estimates without replicating the original sampling process.

ernest provides an R-based implementation of the nested sampling, along with methods for visualizing and processing its results. More information on ernest is available in its documentation.

------------------------------------------------------------------------