💷📊
Bayesian Modelling
How to reason under uncertainty — updating beliefs with evidence, quantifying what you don't know, and making better decisions when data is scarce.
Probabilistic InferenceStatistical Modelling35 min read
Why Bayesian?

Football analytics has a data problem. Not too little data in general — but too little data per question you're trying to answer. A striker takes maybe 100 shots per season. A newly promoted team has zero Premier League matches. A formation change might have been tried three times. With small samples, point estimates are dangerously misleading.

The Problem: Small Samples Lie

A striker scores 3 goals from 4 shots in the opening two matches. His conversion rate is 75%! Obviously, he won't sustain that. But a naïve model doesn't know to be sceptical — it has no mechanism to say "75% is possible but I'm not very sure, and I'd expect it to regress." Bayesian modelling gives you exactly this.

The Insight: Prior Knowledge Is Data Too

You know things before looking at the numbers. You know the average Premier League conversion rate is around 12%. You know a team's defence doesn't change overnight. Bayesian modelling lets you formally encode this prior knowledge and then update it with evidence. The result tells you not just "what's the best guess" but "how confident should I be."

Uncertainty Quantification

Every estimate comes with a full probability distribution, not just a point and an error bar. You know how wrong you might be.

Natural Regularisation

Priors pull extreme small-sample estimates toward sensible values. A striker's 3/4 gets shrunk toward 12%, not taken at face value.

Decision-Ready

Posterior distributions plug directly into decision theory. "What's the probability this bet has positive expected value?" — that's a posterior integral.

Bayes' Theorem: The Master Equation
One equation to rule them all

Everything in Bayesian inference flows from a single equation. It tells you how to update your beliefs when you see new evidence:

P(θ | data) = P(data | θ) × P(θ) / P(data)
P(θ | data)
Posterior
Updated belief
P(data | θ)
Likelihood
How probable is the data given θ?
P(θ)
Prior
Belief before data
P(data)
Evidence
Normalising constant

In practice, we often write this proportionally (ignoring the normalising constant):

Posterior ∝ Likelihood × Prior
"Your updated belief is proportional to how well the data fits, weighted by your prior belief"
Football Analogy

You're a scout evaluating a striker from a lower league. Your prior: "Most strikers from this league convert about 10% of shots." You then watch 20 matches (data) and see them convert 6/30 (20%). Your posterior belief is somewhere between 10% and 20% — exactly where depends on how confident your prior was and how much data you've collected.

Prior Distributions: What You Knew Before
Encoding domain knowledge mathematically

The prior P(θ) is your belief about the parameter before seeing any data. Choosing it is where domain expertise meets mathematics. It's also the part that makes frequentists uncomfortable — but it's precisely what makes Bayesian inference powerful for small-sample football problems.

Types of Priors

Uninformative (Flat) Prior

"I have no idea" — all values equally likely. Useful when you genuinely have no prior knowledge, but rarely the case in football.

θ ~ Uniform(0, 1) or θ ~ Normal(0, 100²)
Example: conversion rate of a newly invented shot type with zero historical data
Weakly Informative Prior

"I know the rough range" — concentrates probability in sensible regions without being too specific. The sweet spot for most football applications.

θ ~ Beta(2, 15) → mean 0.12, 95% CI: [0.02, 0.30]
Example: conversion rate for a Premier League striker — probably 5–25%, centred near 12%
Informative Prior

"I know this well" — tight distribution based on extensive prior data. Powerful but requires justification — you're saying "I'm already fairly sure."

θ ~ Beta(24, 176) → mean 0.12, 95% CI: [0.08, 0.17]
Example: league-average conversion rate based on 200+ historical seasons of data
Prior Sensitivity

With lots of data, the prior barely matters — the likelihood dominates. With little data, the prior matters a lot. This is a feature, not a bug: when you have 5 data points, you should rely heavily on prior knowledge. Always do a sensitivity analysis — try different reasonable priors and check if conclusions change.

Common Distributions for Football

Beta(α, β)Probabilities/rates (conversion rate, win probability). Bounded [0, 1].
Normal(μ, σ²)Strengths/ratings (team attack, player skill). Unbounded, symmetric.
Gamma(α, β)Rates/intensities (goals per match, fouls per 90). Positive, right-skewed.
Poisson(λ)Count data (goals in a match, passes in a half). Discrete, non-negative.
The Update: Prior Meets Data
How evidence reshapes your beliefs

The likelihood P(data | θ) is the probability of seeing your observed data, given a specific parameter value. It answers: "If the true conversion rate were 30%, how likely would it be to see exactly 4 goals in 8 shots?" Multiplying prior × likelihood (and normalising) gives the posterior.

Bayesian Updating: Prior × Likelihood → PosteriorPriorBeta(2, 5)"I think ~25% conversion"01×Likelihood4 goals in 8 shots"Data says ~50%"01=PosteriorBeta(6, 9)"Updated belief: ~40%"01How the update works (Beta-Binomial conjugacy)Prior: Beta(α=2, β=5) → "2 pseudo-goals, 5 pseudo-misses"Data: 4 goals, 4 misses observedPosterior: Beta(2+4, 5+4) = Beta(6, 9) → mean = 6/15 = 0.40Prior pulled estimate from 50% → 40% (regularisation toward prior belief)

Conjugate Priors: Closed-Form Updates

For some prior-likelihood pairs, the posterior has the same family as the prior — you just update the parameters. These are called conjugate priors, and they're beautiful because no numerical computation is needed.

Key Conjugate Pairs
Beta-Binomial
Prior: Beta(α, β)
Data: k successes, n-k failures
Posterior: Beta(α+k, β+n-k)
Shot conversion, save %, win rate
Normal-Normal
Prior: N(μ₀, σ₀²)
Data: x̄ from n observations
Posterior: N(μₙ, σₙ²)
Team strength, player ratings
Gamma-Poisson
Prior: Gamma(α, β)
Data: sum of counts, n obs
Posterior: Gamma(α+Σx, β+n)
Goals per match, fouls per 90

The Normal-Normal Update (Details)

Prior belief about parameter μ:
μ ~ N(μ₀, σ₀²)
Observe n data points with known variance σ²:
x̄ = sample mean
Posterior:
μₙ = (μ₀/σ₀² + n·x̄/σ²) / (1/σ₀² + n/σ²)
σₙ² = 1 / (1/σ₀² + n/σ²)
Interpretation: The posterior mean is a precision-weighted average of the prior mean and the sample mean. More data → more weight on data. Tighter prior → more weight on prior. The posterior variance is always smaller than both the prior variance and the data variance alone — you're always more certain after seeing data.
This Is Just Regularisation

If you've used L2 regularisation (ridge regression), you've done Bayesian inference with a Gaussian prior — you just didn't call it that. The regularisation strength λ corresponds to the prior precision 1/σ₀². "Don't let coefficients get too large" = "my prior belief is that coefficients are near zero." Same maths, different language.

The Posterior: Everything You Need
A full probability distribution as your answer

The posterior distribution P(θ | data) is the complete answer. It's not a single number — it's a full probability distribution over all possible values of θ. From it, you can extract anything you need:

Point Estimates
  • Posterior mean: E[θ | data] — the "average" belief
  • Posterior median: 50th percentile
  • MAP: Mode (most probable single value)
Uncertainty Intervals
  • 95% credible interval: θ is in [a, b] with 95% probability
  • HDI: Highest density interval (narrowest interval)
  • Full distribution: Visualise the whole shape
Probability Queries
  • • P(θ > 0.15 | data) — "probability conversion rate exceeds 15%"
  • • P(θ_A > θ_B | data) — "probability team A is better than B"
  • • P(goals > 2.5 | data) — "probability of over 2.5 goals"
Predictions
  • Posterior predictive: P(new data | observed data)
  • • Integrates over parameter uncertainty
  • • Wider than plug-in predictions (honest about uncertainty)
Posterior Sharpens With More DataEstimating a striker's true conversion rate (real: 0.30)Conversion rate θ01True: 0.30Prior: Beta(1,1)After 10 shotsAfter 40 shotsAfter 100 shots
The Posterior Sharpens With Data

After 10 shots, we're unsure — the posterior is wide. After 100 shots, we're much more confident — the posterior narrows around the true value. This is automatic calibration: the model tells you when to trust the estimate and when not to. No ad hoc sample size rules needed.

Credible vs. Confidence Intervals
The most misunderstood distinction in statistics

This is worth pausing on because it's the most practically important difference between Bayesian and frequentist approaches. They sound similar but mean fundamentally different things.

Credible Interval vs Confidence IntervalBayesian: 95% Credible Interval"There is a 95% probability that θlies in this interval."95% of posterior mass✓ Direct probability statement✓ θ is random, interval is fixedFrequentist: 95% Confidence Interval"If I repeated this experiment ∞ times,95% of intervals would contain θ."θ (fixed)Red: interval missed θ (5% of the time)✗ NOT a probability statement about θ✗ θ is fixed, interval is random
Why This Matters for Betting

When deciding whether to place a bet, you want to ask: "What's the probability the true win rate is above the bookmaker's implied probability?" That's a posterior probability — a direct answer from Bayesian inference. Frequentist confidence intervals simply cannot answer that question without a Bayesian reinterpretation.

Modern Inference: When Conjugates Aren't Enough
MCMC and probabilistic programming

Conjugate priors are elegant but limited. Real football models — like Dixon-Coles match prediction or hierarchical team ratings — have posteriors that can't be written in closed form. The normalising constant P(data) becomes an intractable integral. Enter Markov Chain Monte Carlo (MCMC).

The MCMC Idea

Instead of computing the posterior analytically, draw samples from it. Start at a random parameter value, then randomly walk through parameter space, spending more time in regions of high posterior probability. After enough samples, the histogram of visited values approximates the posterior distribution. You don't need to know P(data) — the ratio of posteriors at two points is enough.

MCMC Algorithms

Metropolis-Hastings

The original. Propose a random move, accept if it improves the posterior (or with some probability if it doesn't). Simple but slow for high-dimensional models.

Hamiltonian MC (HMC)

Uses gradient information to make smarter proposals. Much more efficient for models with many parameters. The engine behind Stan and PyMC.

NUTS

No-U-Turn Sampler. Auto-tunes HMC's step size and path length. The modern default — fast, reliable, minimal hand-tuning.

Probabilistic Programming Tools

Stan (via PyStan / CmdStanPy)
The gold standard. Fast HMC/NUTS. Used heavily in sports analytics. Dixon-Coles models often implemented here.
PyMC
Pythonic and approachable. Great for prototyping. Uses NUTS under the hood. Good visualisation with ArviZ.
NumPyro / JAX
Blazing fast on GPU. Good for large-scale models or when you need to fit thousands of posteriors quickly.
Variational Inference (VI)
Approximate the posterior with a simpler distribution. Much faster than MCMC but less accurate. Good for huge models.
Hierarchical Models: The Bayesian Superpower
Partial pooling across teams, players, and seasons

The most powerful application of Bayesian inference in football is hierarchical modelling (also called multilevel modelling). Instead of treating each team or player independently, you model them as coming from a shared population — then let the data determine how much to pool.

Hierarchical Bayesian Model: Team Strengthsμ, σ²League-levelHyperpriors: how strongare teams on average?αᵢTeam 1αⱼTeam 2αₖ...αₙTeam 20αᵢ ~ N(μ, σ²)Attack strengthMatch goalsyᵢ ~ Poisson(λᵢ)Match goalsyᵢ ~ Poisson(λᵢ)...Match goalsyᵢ ~ Poisson(λᵢ)λᵢⱼ = exp(αᵢ - δⱼ + γ)attack - defence + homePartial pooling: teams with few matches borrow strength from the leagueNewly promoted team → estimate shrinks toward league average (sensible!)

Three Approaches to Estimation

No Pooling

Estimate each team separately. Works for big clubs with lots of data. Disasters for newly promoted teams — wildly uncertain estimates from 3 matches.

Problem: unstable with small samples
Complete Pooling

Assume all teams are the same. Estimate a single league-wide parameter. Stable but ignores that Man City and Luton Town are obviously different.

Problem: ignores real differences
Partial Pooling ✓

The hierarchical Bayesian approach. Teams with lots of data get estimates close to their own data. Teams with little data get shrunk toward the league average. Automatic and optimal.

Best of both worlds — data-driven regularisation
Why Partial Pooling Is Magic

Consider a newly promoted team after 3 matches. They've scored 5 goals. No-pooling says "1.67 goals/game!" — dangerously overfit. Complete pooling says "1.3 goals/game" (league average) — ignores their data entirely. Partial pooling says "~1.4 goals/game" — closer to the league average because the sample is small, but nudged toward their data. After 30 matches, partial pooling converges to their actual rate.

Football Applications
Bayesian models on the pitch

Bayesian methods are particularly well-suited to football analytics because of the small-sample, high-uncertainty nature of the sport. Here are the key applications:

1. Match Outcome Prediction (Dixon-Coles)

The foundational model in football analytics. Each team has an attack strength (αᵢ) and defence strength (δᵢ). Goals follow a bivariate Poisson distribution with rates determined by attack vs. defence matchups. Bayesian inference estimates all team parameters simultaneously, with natural shrinkage for teams with fewer observations.

Model: goals_home ~ Poisson(exp(αᵢ - δⱼ + γ)) | Priors: αᵢ, δᵢ ~ N(0, σ²) | Inference: MCMC via Stan or PyMC
2. Expected Goals (xG) with Uncertainty

Standard xG models output a point estimate (this shot has 0.12 probability of scoring). A Bayesian xG model outputs a posterior distribution: "this shot has a 95% credible interval of [0.06, 0.22]." This matters for cumulative xG — uncertainty compounds over a match, and a Bayesian model correctly propagates it.

Model: Bayesian logistic regression | Key benefit: Posterior predictive intervals on match-level xG, not just shot-level
3. Player Rating Systems

Hierarchical Bayesian models naturally handle the varying amounts of data per player. A 35-year-old veteran with 500 appearances gets a tight posterior. A 19-year-old debutant with 10 appearances gets a wide one, shrunk toward the population mean. This is exactly what you want for scouting.

Example: Bayesian version of VAEP/OBSO | Key benefit: Uncertainty bounds tell scouts "we need 20 more appearances before this rating is reliable"
4. Betting Value Detection

You have a posterior for the home team's win probability P(home_win). The bookmaker's implied probability is 40%. The posterior tells you: P(true_prob > 0.40 | data) = 0.73. That's a 73% chance this bet has positive expected value — far more useful than a point estimate of 42%.

Key formula: P(true_prob > implied_prob | data) → if consistently above 0.5, bet has edge | Also: Kelly criterion uses the full posterior
5. In-Game Win Probability

Update the prior (pre-match win probability from Dixon-Coles) with live match events as they happen. Goal scored → large update. Red card → moderate update. Passage of time with no goals → small update toward draw. The posterior at any moment gives the live win probability.

Method: Sequential Bayesian updating | Each event: new posterior becomes the next prior | Output: Real-time win probability curves
6. Injury & Fatigue Modelling

Each player has a latent "injury hazard rate" that you estimate hierarchically — borrowing strength across similar players (same position, age, workload). The posterior updates as the season progresses, incorporating rest days, minutes played, and high-intensity sprints. Wide posteriors early in the season → narrow as data accumulates.

Model: Bayesian survival analysis (Cox or Weibull) | Prior: Hierarchical by position/age | Output: P(injury in next 5 matches)
Bayesian vs. Frequentist: When to Use Which
It's not a religion — it's a toolkit
Use Bayesian When
  • • Small samples (few matches, new players)
  • • You have genuine prior knowledge
  • • You need uncertainty quantification
  • • Decisions depend on posterior probabilities
  • • Hierarchical structure (teams, leagues, seasons)
  • • Sequential updating (in-game, week-by-week)
Frequentist Is Fine When
  • • Large datasets (millions of events)
  • • You just need predictions, not uncertainty
  • • Computational budget is very tight
  • • The model is simple (logistic regression, XGBoost)
  • • Prior specification is genuinely controversial
  • • You need fast iteration speed
The Pragmatic View

With enough data, Bayesian and frequentist methods converge to the same answers. The prior washes out. The difference matters most in exactly the situations football analytics faces: small samples, structured data, and decisions that require probability statements. If you're betting, you need posteriors. If you're training a neural network on millions of tracking frames, MLE is fine.

Summary
What You Learned
  • ✓ Why small football samples need Bayesian thinking
  • ✓ Bayes' theorem: Posterior ∝ Likelihood × Prior
  • ✓ Prior distributions and encoding domain knowledge
  • ✓ Conjugate updates (Beta-Binomial, Normal-Normal)
  • ✓ Credible intervals vs. confidence intervals
  • ✓ MCMC for complex models
  • ✓ Hierarchical models and partial pooling
  • ✓ Six football applications
Key Equations
P(θ|D) ∝ P(D|θ) × P(θ)
Beta(α,β) + k wins, n-k losses → Beta(α+k, β+n-k)
μₙ = (μ₀/σ₀² + nx̄/σ²) / (1/σ₀² + n/σ²)
σₙ² = 1 / (1/σ₀² + n/σ²)
λᵢⱼ = exp(αᵢ - δⱼ + γ)
Key Takeaway

Bayesian inference isn't about being "more correct" than frequentist statistics — it's about getting the right answers to the right questions. When you have small samples, prior knowledge, hierarchical structure, and you need probability statements about parameters (which is most of football analytics and all of betting), Bayesian modelling isn't optional — it's the natural framework. Start with Beta-Binomial for rates, add hierarchical structure for teams, and reach for MCMC when models get complex.