Chapter 23: From Bayesian Inference to Neural Computation

23.1 Introduction — From Probability to Neurons

Chapter 20 introduced Bayesian BM25, transforming unbounded relevance scores into calibrated probabilities through a sigmoid likelihood model. Chapter 22 extended the same calibration framework to vector similarity signals. With multiple calibrated probability signals in hand, a natural question arises: what is the probability that a document is relevant given all available evidence?

This chapter answers that question — and discovers something unexpected. When we derive the computation for combining multiple calibrated probability signals through principled Bayesian reasoning, the resulting mathematical structure is not merely analogous to a neural network. It is one.

23.1.1 The Probabilistic Relevance Gap Revisited

Recall from Chapter 20 that Robertson's Probability Ranking Principle (1977) established optimal document retrieval requires ranking by probability of relevance. BM25 was derived from this probabilistic foundation, yet its scores are not probabilities — they are unbounded real numbers in [0,+)[0, +\infty). Bayesian BM25 closed this gap by returning scores to probability space through sigmoid calibration:

P(R=1s)=σ(α(sβ))P(R = 1 \mid s) = \sigma(\alpha(s - \beta))

With this calibration in place, we can ask the multi-signal question: given nn calibrated probability signals P1,P2,,PnP_1, P_2, \ldots, P_n — from BM25, vector similarity, and potentially other sources — what is P(R=1P1,P2,,Pn)P(R = 1 \mid P_1, P_2, \ldots, P_n)?

23.1.2 Reversal of Explanatory Direction

The standard direction in neural network theory proceeds from architecture to probability: one constructs a neural network and then analyzes it probabilistically. Bayesian neural networks, variational inference, and probabilistic deep learning all follow this direction.

This chapter reverses the direction entirely. We begin with probability and arrive at neurons, activations, attention, and depth. At no point in the derivation is any architectural decision made with neural computation in mind. The structure emerges as a consequence of the mathematics — a theorem of probabilistic inference rather than an engineering design.

23.2 The Conjunction Shrinkage Problem

23.2.1 Naive Probabilistic Conjunction

The most direct approach to combining nn independent relevance signals is the product rule. If P1,P2,,PnP_1, P_2, \ldots, P_n are independent calibrated probabilities, the probability that all signals simultaneously indicate relevance is:

PAND=i=1nPiP_{\text{AND}} = \prod_{i=1}^{n} P_i

This formula, introduced in Chapter 20 (Section 20.9.1), is mathematically correct under independence. However, it suffers from a fundamental deficiency when applied to evidence accumulation.

23.2.2 The Shrinkage Theorem

Theorem 23.1 (Conjunction Shrinkage). For nn independent signals each reporting probability p(0,1)p \in (0, 1):

i=1np=pnn0\prod_{i=1}^{n} p = p^n \xrightarrow{n \to \infty} 0

Proof. Since 0<p<10 < p < 1, we have logp<0\log p < 0, so nlogpn \log p \to -\infty as nn \to \infty, and exp(nlogp)0\exp(n \log p) \to 0. \square

The conjunction probability is strictly decreasing in nn: adding another agreeing signal always reduces the combined probability. Consider a concrete example with two signals:

PtextP_{\text{text}}PvecP_{\text{vec}}Product PtextPvecP_{\text{text}} \cdot P_{\text{vec}}
0.90.90.81
0.80.80.64
0.70.70.49
0.60.60.36

Two signals, each moderately confident at 0.7, produce a combined result below 0.5 — suggesting irrelevance when both sources agree on relevance.

23.2.3 The Semantic Mismatch

The product rule answers: "What is the probability that all conditions are simultaneously satisfied?" But the question a search system poses is: "How confident should we be given that multiple signals concur?" These are semantically distinct questions.

When nn independent signals all report high relevance, the product rule yields a probability that decreases with nn. This violates the fundamental intuition of evidence accumulation: agreement among independent sources should increase confidence, not decrease it.

The product treats signals as independent filters — each one further narrowing the set of qualifying documents. But calibrated relevance signals are not filters. They are witnesses providing corroborating testimony about the same hypothesis. The appropriate mathematical framework must treat them as such.

23.2.4 Information-Theoretic View

The product rule discards mutual agreement information. The product i=1nPi\prod_{i=1}^n P_i is symmetric in the PiP_i and contains no interaction terms — whether all signals report similar values or wildly different values is not represented. The agreement structure is lost.

23.3 Log-Odds Conjunction Framework

We now present a conjunction framework that resolves the shrinkage problem while preserving probabilistic soundness. The key insight is to work in the natural parameter space of the Bernoulli distribution: log-odds.

23.3.1 Log-Odds Mean Aggregation

Recall from Chapter 20 that the logit function maps probabilities to log-odds:

logit(p)=logp1p\text{logit}(p) = \log \frac{p}{1 - p}

Rather than multiplying probabilities (which causes shrinkage), we average their log-odds:

Definition 23.1 (Log-Odds Mean). The log-odds mean of nn calibrated probabilities P1,,PnP_1, \ldots, P_n is:

ˉ=1ni=1nlogit(Pi)=1ni=1nlogPi1Pi\bar{\ell} = \frac{1}{n}\sum_{i=1}^{n} \text{logit}(P_i) = \frac{1}{n}\sum_{i=1}^{n} \log \frac{P_i}{1 - P_i}

Theorem 23.2 (Scale Neutrality). If Pi=pP_i = p for all ii, then ˉ=logit(p)\bar{\ell} = \text{logit}(p) and σ(ˉ)=p\sigma(\bar{\ell}) = p, regardless of nn.

Proof. ˉ=1ni=1nlogit(p)=logit(p)\bar{\ell} = \frac{1}{n}\sum_{i=1}^{n} \text{logit}(p) = \text{logit}(p). By logit-sigmoid duality, σ(logit(p))=p\sigma(\text{logit}(p)) = p. \square

This is the critical property: when all signals agree at level pp, the combined result remains pp — not pnp^n. The log-odds mean neutralizes the dependence on signal count that causes conjunction shrinkage.

23.3.2 Connection to Logarithmic Opinion Pooling

The log-odds mean is not an ad hoc construction. It is the exact normalized form of the Logarithmic Opinion Pool (Log-OP), also known as the Product of Experts (PoE) introduced by Hinton (2002).

Theorem 23.3 (Equivalence to Normalized Log-OP). Given uniform weights wi=1/nw_i = 1/n, the normalized Logarithmic Opinion Pool is:

PLog-OP=iPi1/niPi1/n+i(1Pi)1/nP_{\text{Log-OP}} = \frac{\prod_{i} P_i^{1/n}}{\prod_{i} P_i^{1/n} + \prod_{i} (1 - P_i)^{1/n}}

Taking the logit of both sides yields:

logit(PLog-OP)=1ni=1nlogit(Pi)=ˉ\text{logit}(P_{\text{Log-OP}}) = \frac{1}{n}\sum_{i=1}^{n} \text{logit}(P_i) = \bar{\ell}

The log-odds mean is therefore the logit of the normalized Log-OP — the exact representation of Product-of-Experts aggregation in the natural parameter space of Bernoulli random variables. The aggregation is exactly linear in the logit domain, with no approximation at any operating point.

23.3.3 Multiplicative Confidence Scaling

Scale neutrality alone is insufficient. When multiple independent signals agree, confidence should increase. The log-odds mean preserves the average evidence level but does not amplify it. We introduce a multiplicative confidence scaling:

Definition 23.2 (Confidence-Scaled Log-Odds Conjunction).

adjusted=ˉnα=1n1αi=1nlogit(Pi)\ell_{\text{adjusted}} = \bar{\ell} \cdot n^{\alpha} = \frac{1}{n^{1-\alpha}}\sum_{i=1}^{n} \text{logit}(P_i)

where α0\alpha \geq 0 is a scaling constant.

The multiplicative form is essential. An additive bonus ˉ+c\bar{\ell} + c would add a positive constant to the log-odds regardless of the sign of ˉ\bar{\ell}. For sufficiently large nn, this constant could dominate a negative ˉ\bar{\ell}, making irrelevant documents appear relevant — a catastrophic violation. The multiplicative form ˉnα\bar{\ell} \cdot n^{\alpha} amplifies the magnitude while preserving the direction:

Theorem 23.4 (Sign Preservation). The multiplicative scaling preserves the sign of the log-odds mean: sgn(adjusted)=sgn(ˉ)\text{sgn}(\ell_{\text{adjusted}}) = \text{sgn}(\bar{\ell}).

Corollary. If all signals report irrelevance (Pi<0.5P_i < 0.5 for all ii), then Pfinal<0.5P_{\text{final}} < 0.5 for all nn and all α0\alpha \geq 0. Agreement among irrelevant signals cannot produce a relevance judgment.

23.3.4 The n\sqrt{n} Scaling Law

Setting α=0.5\alpha = 0.5 yields a particularly natural result:

adjusted=1ni=1nlogit(Pi)\ell_{\text{adjusted}} = \frac{1}{\sqrt{n}}\sum_{i=1}^{n} \text{logit}(P_i)

This embeds the classical n\sqrt{n} confidence scaling law. In classical statistics, when combining nn independent measurements, the standard error of the mean decreases as 1/n1/\sqrt{n} and the test statistic grows as n\sqrt{n}. Each signal's log-odds evidence is weighted at 1/n1/\sqrt{n}, producing a total that grows as nˉ\sqrt{n} \cdot \bar{\ell} — exactly the rate at which confidence should grow under independent observations.

nn (signals)Weight per signal 1/n1/\sqrt{n}Total weight n\sqrt{n}
11.0001.00
20.7071.41
30.5771.73
50.4472.24
100.3163.16

23.3.5 The Final Posterior

Applying the inverse logit (sigmoid) to the scaled log-odds returns us to probability space:

Pfinal=σ(adjusted)=σ ⁣(1n1αi=1nlogit(Pi))P_{\text{final}} = \sigma(\ell_{\text{adjusted}}) = \sigma\!\left(\frac{1}{n^{1-\alpha}}\sum_{i=1}^{n} \text{logit}(P_i)\right)

For a single signal (n=1n = 1), the transformation is transparent: Pfinal=σ(logit(P1))=P1P_{\text{final}} = \sigma(\text{logit}(P_1)) = P_1.

23.3.6 Behavioral Properties

The log-odds conjunction satisfies four key properties:

  1. Agreement amplification: If Pi>0.5P_i > 0.5 for all ii, then Pfinal>σ(ˉ)P_{\text{final}} > \sigma(\bar{\ell}) for n2n \geq 2.
  2. Disagreement moderation: If signals disagree symmetrically, Pfinal0.5P_{\text{final}} \approx 0.5.
  3. Irrelevance preservation: If Pi<0.5P_i < 0.5 for all ii, then Pfinal<0.5P_{\text{final}} < 0.5.
  4. Relevance preservation: If Pi>0.5P_i > 0.5 for all ii, then Pfinal>0.5P_{\text{final}} > 0.5.

Properties (3) and (4) are structural guarantees of the multiplicative formulation — they hold for all nn, all α0\alpha \geq 0, and all configurations of PiP_i.

The following table compares the product rule and log-odds conjunction (n=2n = 2, α=0.5\alpha = 0.5):

PtextP_{\text{text}}PvecP_{\text{vec}}ProductLog-Odds ConjunctionInterpretation
0.90.90.810.96Strong agreement amplified
0.70.70.490.77Moderate agreement preserved
0.70.30.210.50Exact neutrality (logits cancel)
0.30.30.090.23Irrelevance preserved

23.4 Emergence of Neural Network Structure

We now arrive at the central result: the computation derived in the previous sections is a feedforward neural network.

23.4.1 The Computational Pipeline

The full computation for estimating the relevance probability of a document given nn scoring signals proceeds in three stages:

Stage 1 — Calibration. Each scoring signal produces a calibrated probability Pi(0,1)P_i \in (0, 1). The calibration method may differ across signals:

  • BM25 (sigmoid calibration): Pi=σ(αi(siβi))P_i = \sigma(\alpha_i(s_i - \beta_i)) — as derived in Chapter 20
  • Vector similarity (linear calibration): Pi=(1+si)/2P_i = (1 + s_i)/2 — mapping cosine similarity from [1,1][-1, 1] to [0,1][0, 1], as described in Chapter 22
  • External models: PiP_i is the model's output probability, calibrated by its own training

After Stage 1, all signals share a common representation: probabilities in (0,1)(0, 1).

Stage 2 — Log-Odds Aggregation. The calibrated probabilities are mapped to log-odds and averaged:

ˉ=1ni=1nlogit(Pi)\bar{\ell} = \frac{1}{n}\sum_{i=1}^{n} \text{logit}(P_i)

Stage 3 — Confidence Scaling and Posterior. The mean log-odds is scaled and passed through a sigmoid:

Pfinal=σ ⁣(1n1αi=1nlogit(Pi))P_{\text{final}} = \sigma\!\left(\frac{1}{n^{1-\alpha}} \sum_{i=1}^{n} \text{logit}(P_i)\right)
Loading diagram...

23.4.2 The Neural Structure Theorem

Theorem 23.5 (Neural Network Structure). The computation described in Section 23.4.1 has the structure of a two-layer feedforward network:

  1. Input layer: Scoring signals producing calibrated probabilities P1,,PnP_1, \ldots, P_n
  2. Hidden nonlinearity: Each PiP_i is mapped through logit(Pi)=logPi1Pi\text{logit}(P_i) = \log \frac{P_i}{1 - P_i}
  3. Linear aggregation: adjusted=i=1nwilogit(Pi)\ell_{\text{adjusted}} = \sum_{i=1}^{n} w_i \cdot \text{logit}(P_i) where wi=1/n1αw_i = 1/n^{1-\alpha}
  4. Output activation: Pfinal=σ(adjusted)P_{\text{final}} = \sigma(\ell_{\text{adjusted}})

The full computation is:

Pfinal=σ ⁣(1n1αi=1nlogit(Pi))P_{\text{final}} = \sigma\!\left(\frac{1}{n^{1-\alpha}} \sum_{i=1}^{n} \text{logit}(P_i)\right)

The hidden nonlinearity is the logit function — the canonical link of the Bernoulli exponential family. It was not selected from a design space of candidate activations. It was derived as the unique function that maps probabilities to the natural parameter space where evidence combination is additive.

23.4.3 The Sigmoid-Calibrated Special Case

When all signals use sigmoid calibration — Pi=σ(αisi+βi)P_i = \sigma(\alpha_i s_i + \beta_i') — a remarkable simplification occurs.

Theorem 23.6 (Logistic Regression Equivalence). If all calibrated probabilities are sigmoid functions of raw scores, then the hidden nonlinearity collapses via the logit-sigmoid identity, and the full pipeline reduces to logistic regression:

Pfinal=σ ⁣(i=1nwisi+b)P_{\text{final}} = \sigma\!\left(\sum_{i=1}^{n} w_i' \cdot s_i + b\right)

where wi=αi/n1αw_i' = \alpha_i / n^{1-\alpha} and b=iβi/n1αb = \sum_i \beta_i' / n^{1-\alpha}.

Proof. When Pi=σ(αisi+βi)P_i = \sigma(\alpha_i s_i + \beta_i'), the logit-sigmoid duality yields logit(Pi)=αisi+βi\text{logit}(P_i) = \alpha_i s_i + \beta_i' — the logit acts as the identity on sigmoid-calibrated inputs. Substituting:

Pfinal=σ ⁣(1n1αi=1n(αisi+βi))=σ ⁣(i=1nwisi+b)P_{\text{final}} = \sigma\!\left(\frac{1}{n^{1-\alpha}} \sum_{i=1}^{n} (\alpha_i s_i + \beta_i')\right) = \sigma\!\left(\sum_{i=1}^{n} w_i' \cdot s_i + b\right) \quad \square

This equivalence between Bayesian posterior computation and logistic regression has been known since Cox (1958). Our derivation arrives at it from a different starting point — evidence accumulation for multi-signal retrieval — confirming that logistic regression is not an approximation to Bayesian inference but is Bayesian inference when signals share the same exponential family calibration.

23.4.4 The Heterogeneous Case: A Genuine Two-Layer Network

The logit-sigmoid cancellation of Theorem 23.6 occurs only when every signal's calibration belongs to the same parametric family as the output activation. In practice, hybrid search combines signals with different calibrations:

  • BM25: P1=σ(α1s1+β1)P_1 = \sigma(\alpha_1 s_1 + \beta_1') — sigmoid calibration (logit recovers linear pre-activation)
  • Vector similarity: P2=(1+cosθ)/2P_2 = (1 + \cos\theta)/2 — linear calibration

For the vector signal:

logit(P2)=logit ⁣(1+cosθ2)=log1+cosθ1cosθ\text{logit}(P_2) = \text{logit}\!\left(\frac{1 + \cos\theta}{2}\right) = \log\frac{1 + \cos\theta}{1 - \cos\theta}

This is a nonlinear function of cosθ\cos\theta: the logit of a linear map is not linear. The hidden layer performs a genuine nonlinear transformation on the vector signal, while acting as the identity on the BM25 signal. The network has:

  • A linear pathway for sigmoid-calibrated signals (logit-sigmoid cancellation)
  • A nonlinear pathway for differently-calibrated signals (logit as genuine nonlinearity)

This mixed structure — identity skip connections for some inputs, nonlinear transformation for others — arises naturally from the heterogeneity of calibration methods.

23.4.5 Parameter Correspondence

The following table maps the probabilistic components to their neural network counterparts:

Probabilistic ComponentNeural Network Component
Calibrated probabilities PiP_iFirst-layer outputs
logit(Pi)\text{logit}(P_i)Hidden-layer activations (logit nonlinearity)
Weights wi=1/n1αw_i = 1/n^{1-\alpha}Hidden-to-output weights
Final σ()\sigma(\cdot)Output-layer sigmoid activation
PfinalP_{\text{final}}Network output

In the sigmoid-calibrated special case, the effective weights are wi=αi/n1αw_i' = \alpha_i / n^{1-\alpha} and the effective bias is b=iβi/n1αb = \sum_i \beta_i' / n^{1-\alpha}, reducing to a single logistic regression neuron. When weights become learnable parameters (replacing the fixed 1/n1α1/n^{1-\alpha}), the network can capture signal dependencies through training, completing the correspondence to a fully parameterized neural network.

23.5 Inevitability of Activation Functions

The neural structure theorem reveals that the sigmoid appears twice in the derivation — once in calibration (Stage 1) and once in the final posterior (Stage 3). Neither appearance is an architectural choice. Both are mathematical necessities rooted in the exponential family structure of binary outcomes.

23.5.1 Why Sigmoid Appears Twice

The Bernoulli distribution belongs to the exponential family with natural parameter η=logit(p)\eta = \text{logit}(p). The canonical link function is the logit, and its inverse — the function mapping natural parameters back to means — is the sigmoid:

p=σ(η)=eη1+eηp = \sigma(\eta) = \frac{e^{\eta}}{1 + e^{\eta}}

The sigmoid appears in Stage 1 because calibrating a score to a probability under the Bernoulli model requires the natural-parameter-to-mean mapping. It appears in Stage 3 because returning from aggregated log-odds to a probability requires the same mapping. Both are instances of σ:ηp\sigma: \eta \mapsto p in the Bernoulli exponential family.

Theorem 23.7 (Uniqueness of the Sigmoid). The logistic sigmoid is the unique function satisfying all five of the following constraints simultaneously:

  • (C1) σ:R(0,1)\sigma: \mathbb{R} \to (0, 1) — maps inputs to valid probabilities
  • (C2) σ(x)=[logit]1(x)\sigma(x) = [\text{logit}]^{-1}(x) — canonical inverse link for the Bernoulli family
  • (C3) σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x)) — self-referential derivative
  • (C4) σ(x)=1σ(x)\sigma(-x) = 1 - \sigma(x) — evidence symmetry
  • (C5) σ\sigma arises as the maximum entropy distribution for binary outcomes under first-moment constraints

No commonly proposed alternative satisfies all five. The following table summarizes the violations:

Function(C1)(C2)(C3)(C4)(C5)
Sigmoid σ\sigmaYesYesYesYesYes
tanh\tanhNo (range (1,1)(-1,1))Reduces to σ\sigma
SoftplusNo (unbounded)NoNoNoNo
ReLUNo (unbounded)NoNoNoNo
Probit Φ\PhiYesNoNoYesNo

The probit (Gaussian CDF) is the closest competitor, satisfying (C1) and (C4). But it fails on the exponential family constraints (C2), (C3), and (C5) that are essential for Bayesian inference over Bernoulli outcomes.

23.5.2 ReLU as MAP Estimator Under Sparse Priors

The sigmoid answers "How probable is the hypothesis?" A different probabilistic question yields a different activation with equal inevitability.

Consider: "How much of a latent feature hh is present in the observation?" This asks for a quantity, not a probability. The answer is non-negative (feature presence cannot be negative) and unbounded (there is no maximum amount of evidence).

Definition 23.3 (Sparse Feature Model). Let h0h \geq 0 be a latent activation level with:

  • Sparse prior (Exponential): P(h)=λeλhP(h) = \lambda e^{-\lambda h} for h0h \geq 0
  • Gaussian likelihood: P(xh)exp ⁣((xwh)22τ2)P(x \mid h) \propto \exp\!\left(-\frac{(x - wh)^2}{2\tau^2}\right)

The exponential prior encodes sparsity: most features are absent most of the time — the continuous analog of the term frequency distribution in information retrieval, where most terms are absent from most documents.

Theorem 23.8 (ReLU from MAP Estimation). The MAP estimate of hh under the sparse feature model is:

h=max(0,  zθ)h^* = \max(0,\; z - \theta)

where z=x/wz = x/w is the normalized input and θ=λτ2/w2\theta = \lambda\tau^2/w^2 is a threshold. This is the ReLU activation with bias b=θb = -\theta.

Proof. The log-posterior is:

L(h)=(xwh)22τ2λh+const\mathcal{L}(h) = -\frac{(x - wh)^2}{2\tau^2} - \lambda h + \text{const}

Differentiating: Lh=w(xwh)τ2λ\frac{\partial \mathcal{L}}{\partial h} = \frac{w(x - wh)}{\tau^2} - \lambda. Setting to zero yields hunc=wxλτ2w2h_{\text{unc}} = \frac{wx - \lambda\tau^2}{w^2}. Applying the non-negativity constraint h0h \geq 0:

h=max ⁣(0,  wxλτ2w2)=max(0,  zθ)h^* = \max\!\left(0,\; \frac{wx - \lambda\tau^2}{w^2}\right) = \max(0,\; z - \theta) \quad \square

The ReLU form is the unique MAP estimator satisfying: non-negativity (h0h^* \geq 0), sparsity (h=0h^* = 0 for a positive-measure set of inputs), linearity above threshold, and hard thresholding (exactly zero below threshold, not approximately zero).

23.5.3 Swish as Bayesian Expected Value

ReLU provides the MAP estimate — the mode of the posterior. What about the mean?

Consider the question: "Given input xx, what is the expected output when the signal may be either relevant (passed through) or irrelevant (suppressed)?" This simultaneously involves quantity and probability.

Definition 23.4 (Self-Gated Relevance Model). Let R{0,1}R \in \{0, 1\} be a binary relevance variable. The output is:

Y={xif R=10if R=0Y = \begin{cases} x & \text{if } R = 1 \\ 0 & \text{if } R = 0 \end{cases}

The relevance probability, by the sigmoid posterior (Chapter 20), is P(R=1x)=σ(x)P(R = 1 \mid x) = \sigma(x).

Theorem 23.9 (Swish as Bayesian Expected Relevant Signal). Under the self-gated relevance model, the posterior expected value is the Swish activation:

E[Yx]=xP(R=1x)+0P(R=0x)=xσ(x)=Swish(x)\mathbb{E}[Y \mid x] = x \cdot P(R = 1 \mid x) + 0 \cdot P(R = 0 \mid x) = x \cdot \sigma(x) = \text{Swish}(x)

The self-gating property — that xx serves as both the signal value and the evidence for its own relevance — mirrors a foundational principle of information retrieval: higher BM25 scores correspond to higher relevance probabilities. Magnitude implies reliability.

Theorem 23.10 (ReLU-Swish Duality). ReLU and Swish arise from the same sparse gating structure under different estimation principles:

ReLUSwish
EstimatorMAP (posterior mode)Bayes (posterior mean)
GateHard: 1[x>0]\mathbf{1}[x > 0]Soft: σ(x)\sigma(x)
Formulax1[x>0]x \cdot \mathbf{1}[x > 0]xσ(x)x \cdot \sigma(x)
ReLU(x)=x1[x>0]MAPBayesSwish(x)=xσ(x)\text{ReLU}(x) = x \cdot \mathbf{1}[x > 0] \quad \xrightarrow{\text{MAP} \to \text{Bayes}} \quad \text{Swish}(x) = x \cdot \sigma(x)

This is the activation-function manifestation of the most fundamental duality in statistical estimation: MAP (mode of the posterior) versus Bayes estimator (mean of the posterior).

The generalized Swish Swishβ(x)=xσ(βx)\text{Swish}_\beta(x) = x \cdot \sigma(\beta x) parametrizes a continuous spectrum:

  • β0\beta \to 0: xσ(βx)x/2x \cdot \sigma(\beta x) \to x/2 — uniform prior, maximum ignorance
  • β=1\beta = 1: xσ(x)=Swish(x)x \cdot \sigma(x) = \text{Swish}(x) — canonical Bayesian posterior
  • β\beta \to \infty: xσ(βx)ReLU(x)x \cdot \sigma(\beta x) \to \text{ReLU}(x) — deterministic MAP

The parameter β\beta controls Bayesian certainty. Setting β=1\beta = 1 means the evidence scale and the likelihood scale are matched: one unit of pre-activation corresponds to one unit of log-odds evidence.

23.5.4 GELU as Gaussian Approximation of Swish

The GELU activation arises from the same expected-value framework, replacing the Bernoulli canonical posterior σ(x)\sigma(x) with a Gaussian (probit) relevance model Φ(x)\Phi(x):

GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x)

where Φ\Phi is the standard Gaussian CDF. The well-known approximation Φ(x)σ(1.702x)\Phi(x) \approx \sigma(1.702x) implies:

GELU(x)xσ(1.702x)=Swish1.702(x)\text{GELU}(x) \approx x \cdot \sigma(1.702x) = \text{Swish}_{1.702}(x)

GELU is a specific instance of generalized Swish at β1.702\beta \approx 1.702, corresponding to a Gaussian noise model rather than the canonical Bernoulli model. The empirical near-equivalence of GELU and Swish in deep learning is explained by the probit and logistic CDFs being nearly indistinguishable after scaling.

23.5.5 Three Questions, Three Activations

The three dominant activation functions answer complementary probabilistic questions:

ActivationQuestionOutputDerivation
Sigmoid"How probable?"Bounded (0,1)(0, 1)Canonical link (exponential family)
ReLU"How much?"Unbounded [0,+)[0, +\infty)MAP estimate (sparse prior)
Swish"Expected relevant amount?"Bounded below [ ⁣0.278,+)[\approx\!-0.278, +\infty)Bayes estimate (posterior mean)

The standard practice of using ReLU or Swish in hidden layers and sigmoid (or softmax) at the output corresponds to a two-phase probabilistic inference:

  1. Hidden layers: "Which features are present, and how strongly?" — sparse feature detection
  2. Output layer: "Given detected features, what is the posterior probability?" — Bayesian posterior

This mirrors the information retrieval pipeline: inverted index lookup (sparse feature detection) followed by relevance scoring (probability estimation).

23.6 WAND/BMW as Exact Neural Pruning

The sigmoid activation, derived from probabilistic reasoning, has a property with profound computational consequences: it is bounded. This boundedness enables a class of pruning algorithms from information retrieval to serve as exact neural inference optimizations.

23.6.1 Neural Translation of IR Pruning

Chapter 25 introduces WAND and Block-Max WAND (BMW) as algorithms for efficient top-kk retrieval. In the neural interpretation of Section 23.4, these algorithms translate directly:

  • WAND: If the maximum possible activation of a neuron — given a computable upper bound on its input — is below the current threshold, the neuron's computation is skipped entirely.
  • BMW: If no input in an entire block can produce an activation above threshold, the entire block is skipped.

Theorem 23.11 (Exactness of Neural Pruning). The pruning is exact: the top-kk outputs are identical to those produced by exhaustive computation. No relevant documents are lost.

Proof. The sigmoid is strictly monotone, so subs \leq \text{ub} implies σ(α(sβ))σ(α(ubβ))\sigma(\alpha(s - \beta)) \leq \sigma(\alpha(\text{ub} - \beta)). If σ(α(ubβ))<θ\sigma(\alpha(\text{ub} - \beta)) < \theta, the neuron's output cannot exceed the threshold regardless of the actual input. \square

23.6.2 Requirements for Exact Pruning

Theorem 23.12 (Necessary Conditions). Exact WAND-style pruning of an activation function ff requires:

  1. Boundedness: f:R[a,b]f: \mathbb{R} \to [a, b] for finite a,ba, b
  2. Monotonicity: ff is strictly monotone

Boundedness is required for computable output upper bounds. Monotonicity is required for input upper bounds to yield valid output upper bounds.

Corollary (Incompatibility with ReLU). ReLU satisfies monotonicity but not boundedness (f:R[0,+)f: \mathbb{R} \to [0, +\infty)). Tight output upper bounds cannot be computed without knowledge of the input range, which is generally unavailable during inference.

This incompatibility is not a defect of ReLU — it is a consequence of its probabilistic origin. "How much?" has no upper limit, while "How probable?" is inherently bounded in (0,1)(0, 1). The two activations provide complementary capabilities:

  • Sigmoid: Bounded activations for safe pruning (computable upper bounds)
  • ReLU: Structural sparsity for efficient indexing (exact zeros for absent features)

A system exploiting both — ReLU sparsity for index construction, sigmoid boundedness for query-time pruning — mirrors the IR pipeline exactly. See Chapter 25 for the full WAND/BMW implementation.

23.6.3 Empirical Skip Rates

From experimental evaluation of Bayesian BM25 with WAND/BMW pruning:

Query TypeDocuments SkippedTop-kk Accuracy
Rare terms (IDF > 5)90--99%Exact
Mixed queries50--80%Exact
Common terms (IDF < 2)10--30%Exact

23.7 From Static Weights to Attention

23.7.1 Relaxing the Uniform Reliability Assumption

In the network derived in Section 23.4, all aggregation weights are uniform: wi=1/n1αw_i = 1/n^{1-\alpha}. This reflects equal reliability across scoring functions. We now relax this single constraint — allowing weights to depend on the input — and show that the result is the attention mechanism.

Definition 23.5 (Query-Dependent Weights). Suppose weights depend on the query-signal interaction:

wi=wi(q,si)subject toi=1nwi=1,wi0w_i = w_i(q, s_i) \quad \text{subject to} \quad \sum_{i=1}^{n} w_i = 1, \quad w_i \geq 0

The aggregation becomes:

S=i=1nwi(q,si)logit(Pi)S = \sum_{i=1}^{n} w_i(q, s_i) \cdot \text{logit}(P_i)

This is the attention mechanism: a query-dependent weighted aggregation of value vectors.

23.7.2 Attention as Logarithmic Opinion Pooling

In standard attention (Vaswani et al., 2017), weights are computed as:

wi=exp(f(q,ki))jexp(f(q,kj))w_i = \frac{\exp(f(q, k_i))}{\sum_j \exp(f(q, k_j))}

where f(q,ki)f(q, k_i) is a query-key compatibility function. The softmax ensures wi=1\sum w_i = 1 and wi0w_i \geq 0.

Theorem 23.13 (Attention as Product of Experts). The attention-weighted aggregation in log-odds space is equivalent to a Logarithmic Opinion Pool (Product of Experts) with context-dependent reliability:

PLog-OP=σ ⁣(i=1nwilogit(Pi))P_{\text{Log-OP}} = \sigma\!\left(\sum_{i=1}^{n} w_i \, \text{logit}(P_i)\right)

The attention weights wi(q,si)w_i(q, s_i) are the context-dependent exponents in a PoE ensemble — determining how strongly each expert's opinion is weighted in the product.

This provides the missing justification for why attention computes a weighted sum. Logarithmic Opinion Pooling in the logit domain is additive. The additive structure of log-odds conjunction mandates a weighted sum. Any other aggregation — element-wise maximum, concatenation followed by projection — would violate the multiplicative structure of Product-of-Experts evidence combination.

23.7.3 Architectural Continuity

The progression from the derived architecture to modern Transformers is a sequence of probabilistic generalizations:

StepArchitectureProbabilistic Interpretation
Derived (Section 23.4)Logit-linear-sigmoidBayesian conjunction, uniform reliability
+ Learnable weightsWeighted networkBayesian conjunction, learned reliability
+ Query dependenceAttentionLog-OP (PoE) with context-dependent reliability
+ Multi-headMulti-head attentionEnsemble of parallel PoE aggregators

Each step corresponds to relaxing a constraint in the probabilistic model, not to an architectural invention.

23.7.4 Exact Attention Pruning

The combination of exact pruning (Section 23.6) and the Log-OP interpretation of attention yields a result with no precedent in the sparse attention literature: provably exact attention pruning.

Theorem 23.14 (Token-Level Exact Pruning in Attention). Consider the attention output a=i=1nwivia = \sum_{i=1}^{n} w_i v_i where vi=logit(Pi)v_i = \text{logit}(P_i). If each value admits a computable upper bound ub(vi)vi\text{ub}(v_i) \geq v_i, then token ii can be exactly pruned when:

jAwjvj+jAwjub(vj)<θ\sum_{j \in \mathcal{A}} w_j v_j + \sum_{j \notin \mathcal{A}} w_j \cdot \text{ub}(v_j) < \theta

where A\mathcal{A} is the set of already-evaluated tokens and θ\theta is the current kk-th highest score. This is the WAND pruning condition applied to attention.

Corollary (Head-Level Pruning). In multi-head attention, each head can be treated as a BMW block. If the maximum possible contribution of head jj is insufficient to change the top-kk ranking, the entire head is skipped.

This contrasts with existing sparse attention methods — Longformer's sliding window, BigBird's random attention, top-kk selection — which achieve efficiency through heuristic or learned masks and are inherently approximate. Theorem 23.14 provides an exactness guarantee: pruned tokens are those whose maximum possible contribution is provably insufficient.

23.8 Depth as Recursive Bayesian Inference

23.8.1 Why Depth is Necessary

The derivation in Section 23.4 assumes that calibrated evidence signals are given. In practice, these signals must themselves be inferred from raw data through intermediate latent variables:

P(yx)=z(L)z(1)P(yz(L))=1LP(z()z(1))P(y \mid x) = \sum_{z^{(L)}} \cdots \sum_{z^{(1)}} P(y \mid z^{(L)}) \prod_{\ell=1}^{L} P(z^{(\ell)} \mid z^{(\ell-1)})

where z(0)=xz^{(0)} = x is the raw input and z()z^{(\ell)} are latent variables at depth \ell. Each factor P(z()z(1))P(z^{(\ell)} \mid z^{(\ell-1)}) is an instance of the inference unit from Section 23.4: it takes the previous layer's outputs as evidence and produces calibrated probability estimates.

Depth is necessary because the evidence required for high-level judgments does not exist in the raw data. Consider image classification:

  • Layer 1 (ReLU): "Do edges exist at each spatial location?" — raw pixels contain no explicit concept of "edge"
  • Layer 2 (ReLU): "Do these edges form shapes?" — edges alone do not encode "circle" or "triangle"
  • Layer LL (Sigmoid/Softmax): "Given all constructed features, what is the posterior probability?"

Each layer applies the same probabilistic operation — evidence combination via log-odds aggregation — but on progressively more abstract evidence constructed by preceding layers.

23.8.2 The Inference Unit as Recursive Building Block

The unit derived in Section 23.4 — calibration, log-odds aggregation, sigmoid posterior — is a complete single-stage Bayesian inference module. A deep network is a stack of such modules:

P(z(1)x)Layer 1: evidence from raw data    P(z(2)z(1))Layer 2: evidence from evidence        P(yz(L))Output: judgment from constructed evidence\underbrace{P(z^{(1)} \mid x)}_{\text{Layer 1: evidence from raw data}} \;\to\; \underbrace{P(z^{(2)} \mid z^{(1)})}_{\text{Layer 2: evidence from evidence}} \;\to\; \cdots \;\to\; \underbrace{P(y \mid z^{(L)})}_{\text{Output: judgment from constructed evidence}}

This is the recursive structure of hierarchical Bayesian models, where inference proceeds from observed variables through layers of latent variables to the final hypothesis.

23.8.3 Question Sequencing for Architecture Design

The correspondence between activation functions and probabilistic questions (Section 23.5.5) implies that choosing an activation for a layer is equivalent to choosing the probabilistic question that layer asks. Architecture design becomes question sequencing — specifying the order of questions posed to the data:

ArchitectureQuestion Sequence
ResNet"How much feature?" \to ... \to "How much feature?" \to "Which class?"
Transformer"Expected relevant signal?" \to "Which is relevant?" \to ... \to "Which token?"
Classic MLP"How probable?" \to ... \to "How probable?"

Replacing one activation with another changes the type of question the layer asks. Replacing ReLU with GELU changes the question from "how much feature is present?" (hard thresholding, MAP estimate) to "what is the expected relevant signal under Gaussian noise?" (soft gating, Bayesian estimate). Performance improvements from such swaps correspond to choosing a question better suited to the data distribution.

23.8.4 Reverse Interpretability

The forward direction uses the framework to design networks. The reverse direction uses it to interpret existing networks by reading the probabilistic question each layer asks:

  • Sigmoid hidden layers: Every layer asks "how probable?" — iterated Bayesian inference, stacked logistic regressions
  • ReLU hidden layers: "How much of each feature is present?" — hierarchical sparse feature detection
  • GELU hidden layers: "What is the expected relevant signal under Gaussian noise?" — Bayesian soft-gated feature extraction
  • Swish hidden layers: "What is the expected relevant signal?" — canonical Bayesian expected value, posterior mean of the relevant signal
  • Softmax attention layers: "Which features are relevant to the current context?" — context-dependent Logarithmic Opinion Pooling

Standard interpretability methods inspect the values that flow through a network. The question-sequencing framework interprets the type of computation each layer performs, based solely on its activation function. The two approaches are complementary: one reads the answers, the other reads the questions.

23.9 Implementation in Cognica

The theoretical framework of this chapter maps directly to Cognica's hybrid search implementation. The following illustrates the core computation:

// Log-odds conjunction for multi-signal fusion.
//
// Given n calibrated probability signals, compute the
// combined posterior using the log-odds mean with
// sqrt(n) confidence scaling (alpha = 0.5).
auto compute_log_odds_conjunction(
    const std::vector<double>& calibrated_probs) -> double {
  auto n = calibrated_probs.size();
  if (n == 0) {
    return 0.5;
  }
  if (n == 1) {
    return calibrated_probs[0];
  }

  // Stage 2: Map to log-odds and aggregate
  auto log_odds_sum = 0.0;
  for (const auto& p : calibrated_probs) {
    // logit(p) = log(p / (1 - p))
    auto clamped = std::clamp(p, 1e-10, 1.0 - 1e-10);
    log_odds_sum += std::log(clamped / (1.0 - clamped));
  }

  // Log-odds mean with sqrt(n) confidence scaling
  auto n_double = static_cast<double>(n);
  auto adjusted = log_odds_sum / std::sqrt(n_double);

  // Stage 3: Return to probability space via sigmoid
  return 1.0 / (1.0 + std::exp(-adjusted));
}

The calibration stage (Stage 1) is handled by signal-specific calibrators. For BM25, the sigmoid calibrator from Chapter 20 produces Pi=σ(αi(siβi))P_i = \sigma(\alpha_i(s_i - \beta_i)). For vector similarity, the linear calibrator from Chapter 22 produces Pi=(1+si)/2P_i = (1 + s_i)/2.

The WAND/BMW pruning described in Chapter 25 applies directly to the sigmoid-calibrated scores: monotonicity of the sigmoid ensures that BM25 upper bounds transfer to probability space, enabling exact pruning with zero accuracy loss.

23.10 Summary

This chapter demonstrated that feedforward neural network structure emerges analytically from Bayesian inference over multiple relevance signals. The key concepts are:

Conjunction Shrinkage Problem: The naive product rule PAND=PiP_{\text{AND}} = \prod P_i causes combined probabilities to shrink toward zero as signals are added, even when all signals agree on relevance. This is a semantic mismatch between joint satisfaction and evidence accumulation.

Log-Odds Conjunction: Averaging in the logit domain resolves shrinkage while preserving probabilistic soundness. The log-odds mean is the exact normalized form of Logarithmic Opinion Pooling (Product of Experts), and multiplicative confidence scaling with n\sqrt{n} law amplifies agreement without inverting the direction of evidence.

Neural Structure Theorem: The end-to-end computation — calibrate, logit, aggregate, sigmoid — is a two-layer feedforward neural network. When all signals share sigmoid calibration, the hidden layer collapses to logistic regression. When signals have heterogeneous calibrations, the logit performs a genuine nonlinear transformation.

Inevitability of Activation Functions: Sigmoid is the unique canonical link for Bernoulli binary outcomes. ReLU is the MAP estimator under sparse non-negative priors. Swish is the Bayesian expected value (posterior mean), related to ReLU by the fundamental MAP-to-Bayes duality. GELU is the Gaussian approximation of Swish: GELU(x)xσ(1.702x)\text{GELU}(x) \approx x \cdot \sigma(1.702x).

Exact Neural Pruning: WAND and BMW from information retrieval constitute provably exact pruning methods for sigmoid-activated networks, enabled by the sigmoid's boundedness. ReLU's unboundedness makes exact pruning unattainable — a consequence of the different probabilistic questions they answer.

Attention as Log-OP: Relaxing the uniform reliability assumption extends the derived structure to the attention mechanism — Logarithmic Opinion Pooling with context-dependent expert weights. This explains why attention computes a weighted sum: Log-OP in the logit domain is additive.

Depth as Recursive Inference: Each layer constructs the evidence required by the next through iterated marginalization over latent variables. Architecture design becomes question sequencing, and activation functions identify the type of inference each layer performs.

The mathematics does not care what we call things. Whether we say "Bayesian posterior" or "sigmoid neuron," "sparse feature detector" or "ReLU unit," "evidence accumulation" or "attention" — the same structures appear wherever information is processed under uncertainty.

Chapter 24 explores how Cognica's hybrid search architecture applies these principles in practice, combining BM25 and vector signals through the log-odds conjunction framework.

References

  1. Robertson, S. E. (1977). The Probability Ranking Principle in IR. Journal of Documentation, 33(4), 294--304.
  2. Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333--389.
  3. Hinton, G. E. (2002). Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation, 14(8), 1771--1800.
  4. Platt, J. (1999). Probabilistic Outputs for Support Vector Machines. Advances in Large Margin Classifiers, 10(3), 61--74.
  5. Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30.
  6. Broder, A. Z., et al. (2003). Efficient Query Evaluation Using a Two-Level Retrieval Process. CIKM, 426--434.
  7. Ding, S., & Suel, T. (2011). Faster Top-k Document Retrieval Using Block-Max Indexes. SIGIR, 993--1002.
  8. Nair, V., & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. ICML, 807--814.
  9. Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415.
  10. Ramachandran, P., Zoph, B., & Le, Q. V. (2018). Searching for Activation Functions. ICLR (Workshop).
  11. Jeong, J. (2026). Bayesian BM25: A Probabilistic Framework for Hybrid Text and Vector Search. Zenodo preprint.
  12. Jeong, J. (2026). From Bayesian Inference to Neural Computation. Zenodo preprint.
  13. Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer.
  14. Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian Approximation. ICML, 1050--1059.
  15. Cox, D. R. (1958). The Regression Analysis of Binary Sequences. Journal of the Royal Statistical Society, 20(2), 215--242.

Copyright (c) 2023-2026 Cognica, Inc.