Tutorial 4 — Statistical Tests and Model Selection#

What you’ll learn:

How to check whether your nested-sampling run has converged
How to compute the effective sample size (ESS) from importance weights
How to calculate goodness-of-fit (reduced χ²) per observation
How to compute AIC and BIC information criteria
How to compare two ForMoSA fits using Bayes factors (log Bayes factor from logZ)
Practical worked examples: fixed vs free parameter, and different atmospheric grids

No fitting required. This tutorial loads pre-computed results from the results/ folder committed alongside this notebook.

Estimated runtime: < 1 minute (no nested sampling).

Prerequisites: ForMoSA v2.0 installed. Tutorial 5 or 7 recommended for context, but not required.

A note on autocorrelation and nested sampling#

Many MCMC tutorials warn you to check chain autocorrelation length as a convergence diagnostic. Do not do this with nested sampling. NS samples are not a Markov chain in the relevant sense — they are nested-shell draws, each weighted by an importance weight that reflects how much posterior probability it carries. Computing autocorrelation on NS samples is conceptually wrong, and the number it produces has no meaningful interpretation for NS convergence.

Use NS-appropriate diagnostics instead: the logZ uncertainty, the effective sample size (ESS), and optionally repeating the run with a different random seed.

Section 0: Setup#

[1]:

import sys
try:
    import ForMoSA
    print(f"ForMoSA {ForMoSA.__version__} — OK")
except ImportError:
    raise ImportError("pip install ForMoSA && conda install dask netCDF4 bottleneck")

import numpy as np
print(f"Python {sys.version.split()[0]}, numpy {np.__version__}")

ForMoSA 2.0.0 — OK
Python 3.11.13, numpy 1.26.4

Section 1: Loading results#

We load the pre-computed results included with this tutorial. The same NSResults object is used throughout all sections below.

If you want to use your own fit, replace RESULTS_JSON with the path to your result_path/NS_results/results_pymultinest.json.

nestle users: replace results_pymultinest.json with results_nestle.json.

[2]:

import json
from pathlib import Path
from ForMoSA.nested_sampling.results import NSResults

RESULTS_JSON = Path(".").resolve() / "results" / "NS_results" / "results_pymultinest.json"

if not RESULTS_JSON.exists():
    raise FileNotFoundError(
        f"Results file not found: {RESULTS_JSON}\n"
        "Make sure you are running this notebook from the plotting/ directory,\n"
        "or update RESULTS_JSON to point to your own result_path."
    )

with open(RESULTS_JSON) as f:
    data = json.load(f)

results = NSResults.from_dict(data)

print(f"Results loaded: {results.samples.shape[0]} total samples")
print(f"Free parameters: {results.free_parameters}")
print(f"Burn-in index: {results.burn_in}")
print(f"Post-burn-in samples: {results.samples.shape[0] - results.burn_in}")

Results loaded: 1063 total samples
Free parameters: ['Teff', 'log(g)', 'rv']
Burn-in index: 0
Post-burn-in samples: 1063

Section 2: Nested-sampling convergence diagnostics#

Before trusting your posterior, check that the sampler has converged. Two numbers tell you most of what you need: the log-evidence uncertainty and the effective sample size.

Section 2.1: Log-evidence and its uncertainty#

results.logz is a two-element list [logZ, logZ_err] produced directly by the NS algorithm (PyMultiNest or UltraNest). logZ is the log of the Bayesian evidence — the integral of the likelihood over the entire prior. logZ_err is the numerical integration uncertainty.

Physically, the evidence answers: “averaging over all physically plausible parameter combinations (the prior), how well does this model explain the data?” A higher logZ means a better-fitting model accounting for its complexity.

Rule of thumb: logZ_err ≲ 0.5 is typical for a well-converged PyMultiNest run with default npoints=500. If it is substantially larger, re-run with more live points (try npoints=1000 or npoints=2000).

[3]:

logZ, logZ_err = results.logz
print(f"log Z = {logZ:.3f} \u00b1 {logZ_err:.3f}")

if logZ_err > 0.5:
    print("WARNING: logZ_err > 0.5 — consider re-running with more live points.")
else:
    print("logZ_err looks good (\u2264 0.5).")

log Z = -780.562 ± 0.366
logZ_err looks good (≤ 0.5).

Section 2.2: Effective sample size (ESS)#

The posterior is a weighted sample — each draw has an importance weight that reflects how much probability it carries. A few high-weight samples can dominate the posterior even if you technically have tens of thousands of raw draws.

The effective sample size corrects for this. It answers: how many independent, equally-weighted samples would give the same statistical resolution as my weighted set? The formula is:

\[\mathrm{ESS} = \frac{1}{\sum_i \tilde{w}_i^2}\]

where \(\tilde{w}_i = w_i / \sum_j w_j\) are the normalised weights. A sample with one weight equal to 1 and all others zero has ESS = 1 (useless). A perfectly uniform set of \(N\) weights has ESS = N (ideal).

Interpretation:

ESS > 1000 → posterior quantiles are well-resolved at the 1σ level
ESS 100–1000 → medians are reliable; 2σ tails are noisy
ESS < 100 → re-run with more live points; your posterior is dominated by a few samples

[4]:

w      = results.weights[results.burn_in:]
w_norm = w / w.sum()

ess        = 1.0 / np.sum(w_norm**2)
efficiency = ess / len(w)

print(f"Raw post-burn-in samples: {len(w)}")
print(f"ESS:                      {ess:.0f}")
print(f"Sampling efficiency:      {efficiency:.2%}")

if ess < 100:
    print("WARNING: ESS < 100. Re-run with more live points.")
elif ess < 1000:
    print("NOTE: ESS in 100\u20131000 range. Medians are reliable; tail quantiles may be noisy.")
else:
    print("ESS looks good (> 1000).")

Raw post-burn-in samples: 1063
ESS:                      284
Sampling efficiency:      26.71%
NOTE: ESS in 100–1000 range. Medians are reliable; tail quantiles may be noisy.

Section 2.3: The most honest convergence check — re-run with a different seed#

The most rigorous convergence test is to run the same fit twice with different random-number-generator seeds and compare the results. If logZ values agree within their stated uncertainties, and the posterior medians agree within ~ 0.5σ, the run has converged. There is no programmatic shortcut — just run the fit twice and compare.

To change the PyMultiNest seed, set the seed argument in your ConfigPyMultiNest dataclass (check the ForMoSA API docs for the exact field name). For UltraNest, use seed in ConfigUltraNest.

Section 3: Goodness of fit — reduced χ²#

Reduced χ² measures how well the best-fit model explains the data, accounting for the number of free parameters.

\[\chi^2_\nu = \frac{\chi^2}{N - k}\]

where \(N\) is the total number of data points across all observations, \(k\) is the number of free parameters, and \(\chi^2 = \sum_i \left(\frac{d_i - m_i}{\sigma_i}\right)^2\).

Interpretation:

\(\chi^2_\nu \approx 1\) — fit is consistent with the stated error bars
\(\chi^2_\nu \gg 1\) — model underfits (bad model or underestimated errors)
\(\chi^2_\nu \ll 1\) — errors are overestimated, or model has too many parameters

Caveat: χ² is only meaningful for likelihoods of the form chi2 or chi2_noisescaling. For CCF-based likelihoods (e.g. HiRISE in MOSAIC mode), residuals do not reduce to standard χ² — skip or restrict this section to your χ²-likelihood observations only.

[5]:

# Standalone version: compute chi2_red from the weighted posterior log-likelihood.
# This works with just NSResults (no Analysis object needed).
#
# This is an approximation: it uses the *weighted average* log-likelihood from the
# posterior, not the true maximum-likelihood point. It is fast and useful for a
# quick sanity check. The per-observation breakdown below (which needs Analysis)
# is more informative for diagnosing multi-instrument fits.

n_data = 450    # approximate number of data points — update this to your actual count
                # (number of spectral bins after masking/cropping your observation)
n_free = len(results.free_parameters)

logl_post = results.logl[results.burn_in:]
w_post    = results.weights[results.burn_in:]
best_logL = np.average(logl_post, weights=w_post)

chi2      = -2 * best_logL
chi2_red  = chi2 / (n_data - n_free)

print(f"n_data   = {n_data}  (update this to your actual number of spectral bins)")
print(f"n_free   = {n_free}")
print(f"-2 log L = {chi2:.2f}")
print(f"\u03c7\u00b2_red   = {chi2_red:.3f}")

n_data   = 450  (update this to your actual number of spectral bins)
n_free   = 3
-2 log L = 1535.07
χ²_red   = 3.434

[6]:

# Full per-observation breakdown — requires a loaded Analysis object.
# See Tutorial 7, Section 1, Method C for how to load one.
# Uncomment after loading analysis and ns_analysis.

# chi2_total = 0.0
# ndata      = 0

# for i, obs in enumerate(analysis.ns.restricted_observations):
#     model_flux = ns_analysis.best_fit[i].flux
#     residuals  = (obs.flux - model_flux) / obs.err
#     chi2_i     = np.sum(residuals**2)
#     n_i        = len(obs.flux)

#     chi2_total += chi2_i
#     ndata      += n_i

#     print(f"{obs.name:20s}  chi2={chi2_i:10.2f}  N={n_i:5d}  chi2/N={chi2_i/n_i:.3f}")

# n_free   = len(results.free_parameters)
# dof      = ndata - n_free
# chi2_red = chi2_total / dof
# print()
# print(f"Total chi2 = {chi2_total:.2f}")
# print(f"DOF        = {dof}")
# print(f"chi2_red   = {chi2_red:.3f}")
print("Per-observation chi2: uncomment after loading analysis via Method C (Tutorial 7).")

Per-observation chi2: uncomment after loading analysis via Method C (Tutorial 7).

Section 4: Information criteria — AIC and BIC#

Information criteria penalise model complexity to prevent overfitting. Lower values are better for both. Use them to compare two fits to the same dataset with different numbers of free parameters (e.g. fixing vs freeing logg).

\[\mathrm{AIC} = 2k - 2\ln\hat{L}\]

\[\mathrm{BIC} = k\ln n - 2\ln\hat{L}\]

where \(k\) is the number of free parameters, \(n\) is the number of data points, and \(\hat{L}\) is the maximum likelihood.

AIC vs BIC:

AIC penalises complexity weakly. It tends to prefer slightly richer models.
BIC penalises complexity proportionally to \(\ln n\). For large datasets (\(n \gtrsim 8\)), BIC penalises more harshly than AIC.
In practice, report both and note if they agree.

ΔAIC interpretation (Burnham & Anderson 2002, Model Selection and Multimodel Inference, 2nd ed., Springer):

ΔAIC	Support for the higher-AIC model
0 – 2	Substantial (models nearly equivalent)
4 – 7	Considerably less
> 10	Essentially none

[7]:

# Best log-likelihood: use the maximum over the post-burn-in chain.
# (Using the max rather than the weighted average is standard for AIC/BIC.)
best_logL = results.logl[results.burn_in:].max()

k = len(results.free_parameters)
n = n_data   # from Section 3 above — update to your actual value

AIC = 2*k - 2*best_logL
BIC = k * np.log(n) - 2*best_logL

print(f"best log L = {best_logL:.3f}")
print(f"k = {k},  n = {n}")
print(f"AIC = {AIC:.2f}")
print(f"BIC = {BIC:.2f}")
print()
print("To compare two fits, compute \u0394AIC = AIC_A - AIC_B (positive means B is preferred).")

best log L = -766.312
k = 3,  n = 450
AIC = 1538.62
BIC = 1550.95

To compare two fits, compute ΔAIC = AIC_A - AIC_B (positive means B is preferred).

Section 5: Bayesian model comparison — Bayes factors#

The Bayes factor compares two models by their marginal likelihoods (= the Bayesian evidence, Z). Nested sampling gives you logZ directly, so the log Bayes factor falls straight out:

\[\ln B_{12} = \ln Z_1 - \ln Z_2\]

A positive value means model 1 is favoured. The uncertainty propagates in quadrature:

\[\sigma_{\ln B} = \sqrt{\sigma_{\ln Z_1}^2 + \sigma_{\ln Z_2}^2}\]

This is a major advantage of nested sampling over standard MCMC — evidence estimation is a first-class output of the algorithm, not a post-hoc approximation.

Jeffreys / Kass-Raftery interpretation scale:

\(\ln B_{12}\)	\(2\ln B_{12}\)	Evidence for model 1
0 – 1	0 – 2	Not worth more than a mention
1 – 3	2 – 6	Positive
3 – 5	6 – 10	Strong
> 5	> 10	Very strong

Sources (cite whichever convention you adopt):

Kass & Raftery (1995), JASA 90, 773. DOI: 10.1080/01621459.1995.10476572
Jeffreys (1961), Theory of Probability, Oxford Univ. Press, 3rd ed.
Trotta (2008), Contemp. Phys. 49, 71 (astronomy focus). DOI: 10.1080/00107510802066753

The exact thresholds vary slightly across sources. Pick one convention and cite it consistently.

[8]:

# Template for computing log B from two completed fits.
# Replace the paths with your own result_path locations.

# results_json_1 = Path(".").resolve() / "path/to/model_1/NS_results/results_pymultinest.json"
# results_json_2 = Path(".").resolve() / "path/to/model_2/NS_results/results_pymultinest.json"

# results_1 = NSResults.from_dict(json.load(open(results_json_1)))
# results_2 = NSResults.from_dict(json.load(open(results_json_2)))

# logB     = results_1.logz[0] - results_2.logz[0]
# logB_err = np.hypot(results_1.logz[1], results_2.logz[1])
# print(f"ln B_12 = {logB:.2f} \u00b1 {logB_err:.2f}")
# if logB > 5:
#     print("Very strong evidence for model 1.")
# elif logB > 3:
#     print("Strong evidence for model 1.")
# elif logB > 1:
#     print("Positive evidence for model 1.")
# elif logB > -1:
#     print("Inconclusive.")
# else:
#     print("Evidence favors model 2.")
print("Bayes factor template: fill in your paths and uncomment.")

Bayes factor template: fill in your paths and uncomment.

Section 5.1: Bayes factors are prior-dependent — always sanity-check#

logZ integrates the likelihood over the entire prior volume. If two models have very different prior ranges (or one model has extra parameters with wide, unconstrained priors), the Bayes factor reflects the Occam penalty from the unused prior volume — not necessarily a real fit-quality difference.

A concrete example: suppose you run the same fit twice, but model 2 has a logg prior of [0, 10] while model 1 has [2.5, 5.5]. If logg converges to the same value in both fits, model 1 will have a higher logZ simply because it “bet” on a smaller region that turned out to be correct. That is mathematically valid Occam’s razor, but it’s worth being explicit about it in your paper.

Always cross-check against χ² and information criteria. If logB_12 says “model 1 strongly preferred” but chi2_red and AIC say the two are nearly identical, the Bayes factor is penalising model 2’s broader prior — worth noting.

For a clean model comparison:

Use the same free-parameter set with the same prior ranges across both models (differ only in the thing you’re testing).
Use identical wavelength coverage and data points.
Report logZ, logZ_err, chi2_red, and AIC/BIC together.

Section 6: Practical use cases#

The two subsections below show the complete workflow for the most common model-comparison scenarios in atmospheric retrieval.

Section 6.1: Fixed vs free parameter (e.g. surface gravity `logg`)#

A common question in atmospheric retrieval: does the spectrum actually constrain logg, or are we just fitting noise? Answer it by running two ForMoSA fits on the same data:

Fixed fit: logg set as a constant parameter at a literature value.
Free fit: logg given a uniform prior over a physically reasonable range.

Then compare logZ values.

Reading the result:

logB > 3 and the logg posterior is well-constrained (narrow, away from prior edges) → the data constrain logg; keep it free.
logB ≈ 0 and the logg posterior spans most of the prior range → the data don’t constrain logg; fixing it is justified and the simpler model is preferred on Occam grounds.
logB < -3 (fixed strongly preferred) → check your prior range on the free fit; a too-wide prior penalises the free model unfairly and may not be physically motivated.

[9]:

# Template. Fill in your result paths.

# results_json_fixed = Path(".").resolve() / "path/to/fixed_logg/NS_results/results_pymultinest.json"
# results_json_free  = Path(".").resolve() / "path/to/free_logg/NS_results/results_pymultinest.json"

# results_fixed = NSResults.from_dict(json.load(open(results_json_fixed)))
# results_free  = NSResults.from_dict(json.load(open(results_json_free)))

# logB     = results_free.logz[0] - results_fixed.logz[0]
# logB_err = np.hypot(results_free.logz[1], results_fixed.logz[1])
# print(f"ln B(free vs fixed logg) = {logB:.2f} \u00b1 {logB_err:.2f}")

# # Show the logg posterior from the free fit to check if it's constrained
# if 'logg' in results_free.free_parameters:
#     j      = results_free.free_parameters.index('logg')
#     s      = results_free.samples[results_free.burn_in:, j]
#     w      = results_free.weights[results_free.burn_in:]
#     w_norm = w / w.sum()
#     med    = np.average(s, weights=w_norm)
#     lo     = np.percentile(s, 16)
#     hi     = np.percentile(s, 84)
#     print(f"logg (weighted median):    {med:.2f}")
#     print(f"logg (16th\u201384th perc.):    [{lo:.2f}, {hi:.2f}]")
print("Section 6.1: fill in your result paths and uncomment.")

Section 6.1: fill in your result paths and uncomment.

Section 6.2: Comparing atmospheric model grids (e.g. BT-Settl vs Sonora-Bobcat)#

Run the same observations through two different atmospheric grids and compare their logZ values. This tests which grid’s physical assumptions better explain your data.

Important caveats before comparing grids:

Same free-parameter set, same prior ranges. If grid A uses [M/H] and grid B uses [Fe/H] with a different prior range, the evidence difference reflects prior volume, not fit quality. Standardise the parameter names and prior bounds.
Identical wavelength coverage. logZ scales with the number of data points. Fitting different wavelength windows makes logZ values incomparable — always confirm that both fits used the same cropped/masked spectrum.
Also compare χ²_red. Grid differences often show up most clearly in how well the model reproduces individual spectral lines — that is easier to see in χ² than in logZ, which integrates over the whole prior volume.

[10]:

# Template. Fill in your result paths.

# results_json_A = Path(".").resolve() / "path/to/grid_A/NS_results/results_pymultinest.json"
# results_json_B = Path(".").resolve() / "path/to/grid_B/NS_results/results_pymultinest.json"

# results_A = NSResults.from_dict(json.load(open(results_json_A)))
# results_B = NSResults.from_dict(json.load(open(results_json_B)))

# logZ_A, errA = results_A.logz
# logZ_B, errB = results_B.logz

# print(f"log Z (grid A) = {logZ_A:.2f} \u00b1 {errA:.2f}")
# print(f"log Z (grid B) = {logZ_B:.2f} \u00b1 {errB:.2f}")
# print(f"ln B_AB        = {logZ_A - logZ_B:.2f}  (positive favors A)")
print("Section 6.2: fill in your result paths and uncomment.")

Section 6.2: fill in your result paths and uncomment.

Section 7: Reporting checklist#

When reporting model selection in a paper, give all four numbers together. They tell different stories, and a referee will ask for any you omit:

Quantity	What it tells you
`logZ ± logZ_err`	Bayesian evidence; accounts for prior complexity
`chi2_red`	Goodness of fit; intuitive and comparable across papers
`AIC`	Fit quality penalised weakly for complexity
`BIC`	Fit quality penalised strongly (preferred when n is large)

Also report:

The prior ranges you used (logZ is prior-dependent; readers need to reproduce it)
The number of live points (npoints) and the algorithm (PyMultiNest / UltraNest / nestle)
The ESS, so readers know how well the posterior is resolved

Section 8: Next steps#

Tutorial 7 — Advanced Plotting: For customising corner, radar, chains, and best-fit plots with ForMoSA’s config system and direct matplotlib post-processing.
API docs: docs/api/ for NSResults, NSAnalysis, and all config dataclasses.

Tutorial 4 — Statistical Tests and Model Selection

Contents