Probability
Probability theory provides the mathematical foundation for quantifying uncertainty. While classical frequentist approaches treat probability as the long-run frequency of repeatable events, Bayesian probability reframes it as a dynamic measure of belief. Through Bayes’ Theorem, initial prior assumptions are systematically updated with incoming evidence to compute a posterior distribution, enabling rigorous inference even with sparse or evolving data.
Standard probabilistic models map correlations rather than causality. While observational probability can determine the likelihood of variables co-occurring, causal inference — often formalized through structural causal models — is required to understand directional influence. This distinction is critical: calculating the likelihood of observing a specific system state requires entirely different mathematical machinery than predicting the outcome of an intervention upon that system.
The source code for this chapter is in the directory source-code/probability.
Words of Warning
Professor Carissa Véliz says in her book “Prophecy” that when you read a percentage you should first ask yourself if you are being told a fact or a prediction. If a percentage is a prediction, consciously tag it as “not a fact.”
The danger of conflating the two lies in the illusion of precision that numbers naturally provide. A percentage representing a historical measurement is a grounded, verifiable reality. A predictive percentage is fundamentally an artifact of a specific model — heavily dependent on the chosen priors, the limits of the training data, and the structural assumptions baked into the algorithm. Consciously tagging a predictive percentage as “not a fact” forces a shift from passive acceptance to active critique: What variables is the model blind to? How fragile is this prediction to out-of-distribution events?
Glossary of Terms
Before diving into the library and the worked examples, here is a reference for the statistical vocabulary used throughout this chapter.
Prior (prior probability) — Your initial belief about how likely a hypothesis is before you observe any new evidence. In the medical example below, the prior probability of disease is the prevalence rate (0.1 %). Priors can be informative (based on domain knowledge) or uninformative (deliberately vague).
Posterior (posterior probability) — Your updated belief about a hypothesis after incorporating observed evidence via Bayes’ Theorem: P(Hypothesis | Data). In the medical example, the posterior probability of disease given a positive test is approximately 1.9 %.
Likelihood — The probability of observing the evidence assuming a specific hypothesis is true: P(Evidence | Hypothesis). In the medical example, the likelihood of a positive test result given disease is 0.99 (the sensitivity).
Marginal likelihood (evidence) — The total probability of the observed evidence across all hypotheses: Σ P(Evidence | H) · P(H). It acts as the normalising constant in Bayes’ Theorem.
Bayes’ Theorem — The mathematical rule connecting prior, likelihood, and posterior: P(H | E) = P(E | H) · P(H) / P(E).
Maximum a posteriori (MAP) — The hypothesis with the highest posterior probability — the Bayesian “best guess.”
Prevalence (base rate) — The proportion of a population that has a particular condition. When the base rate is very low, even a highly accurate test produces many false positives relative to true positives.
Sensitivity (true-positive rate) — The probability that a test correctly identifies a positive case: P(Test+ | Condition+).
Specificity (true-negative rate) — The probability that a test correctly identifies a negative case: P(Test− | Condition−).
False-positive rate — The probability that a test incorrectly flags a healthy individual as positive: P(Test+ | Condition−).
Positive predictive value (PPV) — Among everyone who tested positive, the fraction who actually have the condition: TP / (TP + FP).
Z-score (standard score) — The number of standard deviations a data point lies from the mean: z = (x − μ) / σ.
P-value — The probability of observing data at least as extreme as what was measured, assuming the null hypothesis is true. Crucially, the p-value is not the probability that the hypothesis is true or false.
Null hypothesis (H₀) — The default assumption of “no effect” that a frequentist test tries to reject.
Chi-squared test — A test that compares observed counts against expected counts: Σ (O − E)² / E.
Confidence interval (CI) — A frequentist range estimate. A 95 % CI means: if you repeated the experiment many times, 95 % of the computed intervals would contain the true parameter.
Wilson score interval — A method for computing a confidence interval for a binomial proportion that is more accurate than the simple Wald interval, especially for small samples or proportions near 0 or 1.
Pearson correlation coefficient (r) — A measure of linear association between two variables, ranging from −1 to +1. It measures association, not causation.
Spearman rank correlation (ρ) — A non-parametric measure of monotonic association. More robust to outliers than Pearson-r.
A SWI-Prolog Library to Explore Probability
Design
The library provides four modules spanning both Bayesian and Frequentist reasoning:
Bayesian Inference (
bayes.pl) — Model construction from hypothesis-probability pairs, Bayes’ Theorem updates viaupdate/4, posterior queries, and MAP estimation.Correlation helpers (
correlation.pl) — Pearson-r, Spearman rank correlation, and correlation matrices. These functions explicitly measure association, not causation.Frequentist Statistics (
frequentist.pl) — z-tests, chi-squared tests, and Wilson confidence intervals for classical hypothesis testing.Worked examples —
medical_example.pldemonstrates Bayesian reasoning on a screening test;frequentist_demo.plrevisits the same scenario from the frequentist standpoint.
File layout
1 probability/
2 ├── prolog/
3 │ ├── bayes.pl # Bayesian toolkit
4 │ ├── correlation.pl # Correlation toolkit
5 │ ├── frequentist.pl # Frequentist toolkit
6 │ ├── medical_example.pl # Bayesian worked example
7 │ └── frequentist_demo.pl # Frequentist worked example
8 ├── tests/
9 │ └── test_probability.pl
10 ├── load.pl
11 ├── Makefile
12 └── README.md
Running the examples
1 $ cd source-code/probability
2 $ make run
3 === Running Bayesian medical screening example ===
4
5 === Bayesian Analysis: Medical Screening Test ===
6 Prior probabilities:
7 P(disease) = 0.0010 (0.10 %)
8 P(healthy) = 0.9990 (99.90 %)
9
10 After a POSITIVE test result:
11 P(disease) = 0.0194 (1.94 %)
12 P(healthy) = 0.9806 (98.06 %)
13
14 MAP hypothesis: healthy
15
16 Key insight: despite 99% sensitivity, a positive test
17 only yields about 1.9% probability of disease because the
18 disease is so rare (0.1% prevalence).
19
20 === Correlation Analysis (N = 100000) ===
21 Pearson r(test-result, disease) = 0.1349
22
23 === Done. ===
Walking Through the Bayesian Code
The Bayes Model
The core data structure is a normalised list of Hypothesis-Probability pairs. The constructor ensures priors sum to one:
1 make_bayes_model(PriorPairs, Model) :-
2 maplist(pair_value, PriorPairs, Priors),
3 sumlist(Priors, Total),
4 ( Total =:= 0
5 -> throw(error(all_priors_zero,
6 'All priors are zero — cannot normalise.'))
7 ; maplist(normalise_pair(Total), PriorPairs, Model)
8 ).
9
10 pair_value(_-V, V).
11
12 normalise_pair(Total, H-P, H-NP) :-
13 NP is P / Total.
Prolog’s - operator creates pairs naturally: disease-0.001 is a term -(disease, 0.001). The maplist/normalise_pair pattern applies partial application — normalise_pair(Total) is a goal with one argument already bound, and maplist supplies the remaining two.
Updating with Evidence
The update/4 predicate applies Bayes’ Theorem. It uses Prolog’s meta_predicate facility to accept a likelihood predicate as a higher-order argument:
1 :- meta_predicate update(+, +, 2, -).
2
3 %% update(+Model, +Evidence, :LikelihoodPred, -Updated)
4 %% LikelihoodPred is a predicate of arity 2: LikelihoodPred(Hypothesis,
5 %% P)
6 %% that binds P to P(Evidence | Hypothesis) when called.
7 %% Evidence is passed for documentation but not used directly.
8 %% Example: update(Model, positive, my_lik, Updated)
9 %% where my_lik(disease, 0.99) and my_lik(healthy, 0.05) are defined.
10 update(Model, _Evidence, LikelihoodPred, Updated) :-
11 maplist(unnormalised_posterior(LikelihoodPred), Model,
12 Unnormalised),
13 maplist(pair_value, Unnormalised, UnnormProbs),
14 sumlist(UnnormProbs, Marginal),
15 ( Marginal =:= 0.0
16 -> throw(error(zero_marginal,
17
18
19
20
21
22 'Marginal likelihood is zero — evidence impossible under all hypotheses.'))
23 ; maplist(normalise_pair(Marginal), Unnormalised, Updated)
24 ).
25
26 unnormalised_posterior(LikelihoodPred, H-Prior, H-UPost) :-
27 call(LikelihoodPred, H, Lik),
28 UPost is Lik * Prior.
The call/3 invocation is the key: call(LikelihoodPred, H, Lik) calls the likelihood predicate with the hypothesis and binds the likelihood value. This is Prolog’s idiomatic higher-order pattern — the meta_predicate declaration ensures proper module resolution when the likelihood predicate is defined in a different module.
The Marginal variable is the denominator in Bayes’ Theorem — dividing by it gives proper posterior probabilities that sum to one.
The Medical Screening Example
The worked example makes Bayes’ Theorem concrete. A rare disease affects 0.1 % of the population. A screening test has 99 % sensitivity and a 5 % false-positive rate. A patient tests positive — what is the probability they are actually sick?
1 prevalence(0.001).
2 sensitivity(0.99).
3 false_positive_rate(0.05).
4
5 likelihood(disease, P) :- sensitivity(P).
6 likelihood(healthy, P) :- false_positive_rate(P).
7
8 run_bayesian_analysis :-
9 prevalence(Prev),
10 Healthy is 1.0 - Prev,
11 make_bayes_model([disease-Prev, healthy-Healthy], Prior),
12 update(Prior, positive_test, likelihood, Updated),
13 ...
The likelihood predicate is a clean two-clause definition — one clause per hypothesis. Prolog’s pattern matching dispatches to the correct clause automatically. Passing likelihood (the predicate name) to update/4 lets the Bayes engine call it via call/3 for each hypothesis.
The answer is approximately 1.9 %. Despite 99 % sensitivity, the disease is so rare that the vast majority of positive results come from the 5 % false-positive rate applied to the enormous healthy population.
Correlation Analysis
The example also generates a synthetic population and computes the Pearson correlation between test results and disease status:
1 generate_synthetic_population(N, Tests, Diagnoses) :-
2 prevalence(Prev), sensitivity(Sens), false_positive_rate(FPR),
3 length(Tests, N), length(Diagnoses, N),
4 maplist(simulate_individual(Prev, Sens, FPR), Tests, Diagnoses).
5
6 simulate_individual(Prev, Sens, FPR, Test, Diag) :-
7 random(R1),
8 ( R1 < Prev
9 -> Diag = 1.0, random(R2), (R2 < Sens -> Test = 1.0 ; Test = 0.0)
10 ; Diag = 0.0, random(R3), (R3 < FPR -> Test = 1.0 ; Test = 0.0)
11 ).
The Pearson-r is positive but modest (~0.13). This illustrates a crucial point: a statistically real association does not translate into reliable individual prediction. You need Bayesian reasoning with the base rate for that.
The Correlation Module
The Pearson correlation coefficient measures linear association. The implementation uses Prolog’s maplist for the element-wise cross-deviation products:
1 pearson_r(Xs, Ys, R) :-
2 length(Xs, N),
3 length(Ys, N), % assert equal length
4 list_mean(Xs, MX),
5 list_mean(Ys, MY),
6 list_std_dev(Xs, SX),
7 list_std_dev(Ys, SY),
8 ( (SX =:= 0 ; SY =:= 0)
9 -> R = 0.0
10 ; maplist(cross_dev(MX, MY), Xs, Ys, Prods),
11 sumlist(Prods, SumProd),
12 R is SumProd / (N * SX * SY)
13 ).
14
15 cross_dev(MX, MY, X, Y, P) :-
16 P is (X - MX) * (Y - MY).
The Spearman rank correlation converts values to ranks (handling ties by averaging) and then computes Pearson-r on those ranks. This makes it robust to outliers and non-linear but monotonic relationships.
Frequentists vs. Bayesians
The deepest fault-line in probability runs between two camps that disagree on what a probability is.
What probability means
| Frequentist | Bayesian | |
|---|---|---|
| Definition | Long-run frequency over infinite trials | Degree of belief, updated with evidence |
| Parameters | Fixed but unknown constants | Random variables with distributions |
| Data | Random sample from infinite population | Fixed once observed |
| Core question | P(Data | H) — “how likely is this data?” | P(H | Data) — “how likely is this hypothesis?” |
Strengths and weaknesses
Frequentist strengths: No subjective prior required. Standardised and widely accepted in regulatory contexts. Computationally cheap for large datasets.
Frequentist weaknesses: p-values are chronically misinterpreted. Cannot directly state the probability that a hypothesis is true. Struggles with rare-event problems.
Bayesian strengths: Directly answers “how probable is my hypothesis?” Naturally incorporates prior knowledge. Produces a full posterior distribution for richer uncertainty quantification.
Bayesian weaknesses: Choice of prior is subjective. Posterior computation can be expensive for complex models. Less standardised across studies.
The modern pragmatic view: With large datasets and uninformative priors, the two frameworks converge. Most practitioners use whichever tool fits the problem.
Experimenting with Frequentist Methods
Frequentist module API
phi_approx(+Z, -CDF)— standard normal CDF using the Abramowitz & Stegun 26.2.17 rational approximation.z_score(+Observed, +Expected, +StdDev, -Z)— compute the standard z-score.z_test_proportion(+Successes, +N, +HypP, -Result)— one-sample z-test for a proportion. Returnsresult(Z, PValue).chi_squared_test(+Observed, +Expected, _, -Result)— Pearson’s chi-squared goodness-of-fit test. Returnsresult(ChiSq, DF, PValue).confidence_interval_proportion(+Successes, +N, +Confidence, -Result)— Wilson score interval. Returnsresult(Lower, Upper).
Walking Through the Frequentist Code
The normal CDF approximation is the most mathematically dense piece:
1 phi_approx(Z, CDF) :-
2 P = 0.2316419,
3 B1 = 0.319381530,
4 B2 = -0.356563782,
5 B3 = 1.781477937,
6 B4 = -1.821255978,
7 B5 = 1.330274429,
8 AZ is abs(Z),
9 TVal is 1.0 / (1.0 + P * AZ),
10 PDF is exp(-0.5 * AZ * AZ) / sqrt(2.0 * pi),
11 CDF0 is 1.0 - PDF * (B1*TVal
12 + B2*TVal^2
13 + B3*TVal^3
14 + B4*TVal^4
15 + B5*TVal^5),
16 ( Z >= 0.0
17 -> CDF = CDF0
18 ; CDF is 1.0 - CDF0
19 ).
The Wilson score confidence interval is more accurate than the simple Wald interval for extreme proportions:
1 confidence_interval_proportion(Successes, N, Confidence, result(Lower,
2 Upper)) :-
3 NF is float(N),
4 P is float(Successes) / NF,
5 z_critical(Confidence, ZC),
6 Z2 is ZC * ZC,
7 Denom is 1.0 + Z2 / NF,
8 Centre is (P + Z2 / (2.0 * NF)) / Denom,
9 Margin is (ZC * sqrt(P * (1.0 - P) / NF + Z2 / (4.0 * NF * NF))) /
10 Denom,
11 Lower is max(0.0, Centre - Margin),
12 Upper is min(1.0, Centre + Margin).
Worked Example — Frequentist Medical Screening
The frequentist demo revisits the same scenario:
- Simulates a clinical trial — 100,000 individuals screened.
- Chi-squared test — rejects independence (p < 10⁻¹⁵), but this tells you nothing about individual risk.
- Wilson CI for PPV — the 95 % interval shows PPV is only about 1–3 %.
- Side-by-side comparison — the Bayesian posterior and frequentist PPV agree.
1 $ make freq
2 === Running Frequentist medical screening example ===
3
4 ================================================================
5 FREQUENTIST ANALYSIS: Medical Screening Test
6 ================================================================
7
8 --- 1. Simulated Clinical Trial (N = 100000) ---
9 True Positives (TP): 110
10 False Positives (FP): 4938
11 True Negatives (TN): 94951
12 False Negatives (FN): 1
13
14 --- 2. Chi-Squared Test of Independence ---
15 chi-squared = 2050.74 df = 3 p-value < 1e-15
16
17 --- 3. Positive Predictive Value (PPV) ---
18 PPV = 110 / 5048 = 0.0218 (2.18 %)
19 95% Wilson CI for PPV: [0.0181, 0.0262] (1.81% - 2.62%)
20
21 --- 5. Bayesian vs. Frequentist Side-by-Side ---
22 Bayesian posterior P(disease | positive) = 0.0194 (1.94%)
23 Frequentist PPV from simulation = 0.0218 (2.18%)
24
25 Both frameworks agree: about 2% probability of illness.
26 ================================================================
27 Key lesson: statistical significance /= practical significance.
28 ================================================================
Running the tests
1 $ make test
2 % All 15 tests passed in 0.011 seconds (0.007 cpu)
Prolog-Specific Design Decisions
Higher-order predicates. The update/4 predicate accepts a likelihood predicate name and invokes it via call/3. The :- meta_predicate update(+, +, 2, -) declaration ensures SWI-Prolog resolves the predicate in the caller’s module context — essential when the likelihood is defined in a different module than the Bayes engine.
Pair representation. Prolog’s - operator provides a lightweight pair syntax: disease-0.001. Combined with maplist and partial application (normalise_pair(Total)), this gives a functional-programming flavour without external libraries.
Result terms. The frequentist predicates return structured result(...) terms rather than multiple output arguments. This bundles related values into a single unifiable term — cleaner than threading three separate variables through the caller.
Determinism. The simulation loop in frequentist_demo.pl uses explicit recursion with a cut in the base case (simulate_loop(0, ...) :- !.) to prevent choicepoint accumulation over 100,000 iterations.
Wrap Up
This chapter explored probability from both the Bayesian and frequentist perspectives, using a medical screening scenario to illustrate a result that surprises almost everyone: a highly accurate test applied to a rare condition produces a dismally low positive predictive value. The Bayesian framework makes this transparent by forcing you to account for the base rate; the frequentist framework confirms it through confidence intervals on the PPV, even as its chi-squared test screams “significant!”
The key takeaways are:
- Always consider the base rate. A 99 %-accurate test means little when the condition is rare.
- Statistical significance is not practical significance. A tiny p-value tells you an association exists; it does not tell you the association is large or useful.
- Correlation does not equal causation, and even correlation does not equal reliable individual prediction. The Pearson-r between test results and disease is real but insufficient for clinical decision-making.
- Both frameworks have their place. Bayesian methods shine when prior information matters; frequentist methods dominate regulatory and large-sample settings. Pragmatic practitioners use both.