Football analytics has a data problem. Not too little data in general — but too little data per question you're trying to answer. A striker takes maybe 100 shots per season. A newly promoted team has zero Premier League matches. A formation change might have been tried three times. With small samples, point estimates are dangerously misleading.
A striker scores 3 goals from 4 shots in the opening two matches. His conversion rate is 75%! Obviously, he won't sustain that. But a naïve model doesn't know to be sceptical — it has no mechanism to say "75% is possible but I'm not very sure, and I'd expect it to regress." Bayesian modelling gives you exactly this.
You know things before looking at the numbers. You know the average Premier League conversion rate is around 12%. You know a team's defence doesn't change overnight. Bayesian modelling lets you formally encode this prior knowledge and then update it with evidence. The result tells you not just "what's the best guess" but "how confident should I be."
Every estimate comes with a full probability distribution, not just a point and an error bar. You know how wrong you might be.
Priors pull extreme small-sample estimates toward sensible values. A striker's 3/4 gets shrunk toward 12%, not taken at face value.
Posterior distributions plug directly into decision theory. "What's the probability this bet has positive expected value?" — that's a posterior integral.
Everything in Bayesian inference flows from a single equation. It tells you how to update your beliefs when you see new evidence:
In practice, we often write this proportionally (ignoring the normalising constant):
You're a scout evaluating a striker from a lower league. Your prior: "Most strikers from this league convert about 10% of shots." You then watch 20 matches (data) and see them convert 6/30 (20%). Your posterior belief is somewhere between 10% and 20% — exactly where depends on how confident your prior was and how much data you've collected.
The prior P(θ) is your belief about the parameter before seeing any data. Choosing it is where domain expertise meets mathematics. It's also the part that makes frequentists uncomfortable — but it's precisely what makes Bayesian inference powerful for small-sample football problems.
Types of Priors
"I have no idea" — all values equally likely. Useful when you genuinely have no prior knowledge, but rarely the case in football.
"I know the rough range" — concentrates probability in sensible regions without being too specific. The sweet spot for most football applications.
"I know this well" — tight distribution based on extensive prior data. Powerful but requires justification — you're saying "I'm already fairly sure."
With lots of data, the prior barely matters — the likelihood dominates. With little data, the prior matters a lot. This is a feature, not a bug: when you have 5 data points, you should rely heavily on prior knowledge. Always do a sensitivity analysis — try different reasonable priors and check if conclusions change.
Common Distributions for Football
The likelihood P(data | θ) is the probability of seeing your observed data, given a specific parameter value. It answers: "If the true conversion rate were 30%, how likely would it be to see exactly 4 goals in 8 shots?" Multiplying prior × likelihood (and normalising) gives the posterior.
Conjugate Priors: Closed-Form Updates
For some prior-likelihood pairs, the posterior has the same family as the prior — you just update the parameters. These are called conjugate priors, and they're beautiful because no numerical computation is needed.
The Normal-Normal Update (Details)
If you've used L2 regularisation (ridge regression), you've done Bayesian inference with a Gaussian prior — you just didn't call it that. The regularisation strength λ corresponds to the prior precision 1/σ₀². "Don't let coefficients get too large" = "my prior belief is that coefficients are near zero." Same maths, different language.
The posterior distribution P(θ | data) is the complete answer. It's not a single number — it's a full probability distribution over all possible values of θ. From it, you can extract anything you need:
- • Posterior mean: E[θ | data] — the "average" belief
- • Posterior median: 50th percentile
- • MAP: Mode (most probable single value)
- • 95% credible interval: θ is in [a, b] with 95% probability
- • HDI: Highest density interval (narrowest interval)
- • Full distribution: Visualise the whole shape
- • P(θ > 0.15 | data) — "probability conversion rate exceeds 15%"
- • P(θ_A > θ_B | data) — "probability team A is better than B"
- • P(goals > 2.5 | data) — "probability of over 2.5 goals"
- • Posterior predictive: P(new data | observed data)
- • Integrates over parameter uncertainty
- • Wider than plug-in predictions (honest about uncertainty)
After 10 shots, we're unsure — the posterior is wide. After 100 shots, we're much more confident — the posterior narrows around the true value. This is automatic calibration: the model tells you when to trust the estimate and when not to. No ad hoc sample size rules needed.
This is worth pausing on because it's the most practically important difference between Bayesian and frequentist approaches. They sound similar but mean fundamentally different things.
When deciding whether to place a bet, you want to ask: "What's the probability the true win rate is above the bookmaker's implied probability?" That's a posterior probability — a direct answer from Bayesian inference. Frequentist confidence intervals simply cannot answer that question without a Bayesian reinterpretation.
Conjugate priors are elegant but limited. Real football models — like Dixon-Coles match prediction or hierarchical team ratings — have posteriors that can't be written in closed form. The normalising constant P(data) becomes an intractable integral. Enter Markov Chain Monte Carlo (MCMC).
Instead of computing the posterior analytically, draw samples from it. Start at a random parameter value, then randomly walk through parameter space, spending more time in regions of high posterior probability. After enough samples, the histogram of visited values approximates the posterior distribution. You don't need to know P(data) — the ratio of posteriors at two points is enough.
MCMC Algorithms
The original. Propose a random move, accept if it improves the posterior (or with some probability if it doesn't). Simple but slow for high-dimensional models.
Uses gradient information to make smarter proposals. Much more efficient for models with many parameters. The engine behind Stan and PyMC.
No-U-Turn Sampler. Auto-tunes HMC's step size and path length. The modern default — fast, reliable, minimal hand-tuning.
Probabilistic Programming Tools
The most powerful application of Bayesian inference in football is hierarchical modelling (also called multilevel modelling). Instead of treating each team or player independently, you model them as coming from a shared population — then let the data determine how much to pool.
Three Approaches to Estimation
Estimate each team separately. Works for big clubs with lots of data. Disasters for newly promoted teams — wildly uncertain estimates from 3 matches.
Assume all teams are the same. Estimate a single league-wide parameter. Stable but ignores that Man City and Luton Town are obviously different.
The hierarchical Bayesian approach. Teams with lots of data get estimates close to their own data. Teams with little data get shrunk toward the league average. Automatic and optimal.
Consider a newly promoted team after 3 matches. They've scored 5 goals. No-pooling says "1.67 goals/game!" — dangerously overfit. Complete pooling says "1.3 goals/game" (league average) — ignores their data entirely. Partial pooling says "~1.4 goals/game" — closer to the league average because the sample is small, but nudged toward their data. After 30 matches, partial pooling converges to their actual rate.
Bayesian methods are particularly well-suited to football analytics because of the small-sample, high-uncertainty nature of the sport. Here are the key applications:
The foundational model in football analytics. Each team has an attack strength (αᵢ) and defence strength (δᵢ). Goals follow a bivariate Poisson distribution with rates determined by attack vs. defence matchups. Bayesian inference estimates all team parameters simultaneously, with natural shrinkage for teams with fewer observations.
Standard xG models output a point estimate (this shot has 0.12 probability of scoring). A Bayesian xG model outputs a posterior distribution: "this shot has a 95% credible interval of [0.06, 0.22]." This matters for cumulative xG — uncertainty compounds over a match, and a Bayesian model correctly propagates it.
Hierarchical Bayesian models naturally handle the varying amounts of data per player. A 35-year-old veteran with 500 appearances gets a tight posterior. A 19-year-old debutant with 10 appearances gets a wide one, shrunk toward the population mean. This is exactly what you want for scouting.
You have a posterior for the home team's win probability P(home_win). The bookmaker's implied probability is 40%. The posterior tells you: P(true_prob > 0.40 | data) = 0.73. That's a 73% chance this bet has positive expected value — far more useful than a point estimate of 42%.
Update the prior (pre-match win probability from Dixon-Coles) with live match events as they happen. Goal scored → large update. Red card → moderate update. Passage of time with no goals → small update toward draw. The posterior at any moment gives the live win probability.
Each player has a latent "injury hazard rate" that you estimate hierarchically — borrowing strength across similar players (same position, age, workload). The posterior updates as the season progresses, incorporating rest days, minutes played, and high-intensity sprints. Wide posteriors early in the season → narrow as data accumulates.
- • Small samples (few matches, new players)
- • You have genuine prior knowledge
- • You need uncertainty quantification
- • Decisions depend on posterior probabilities
- • Hierarchical structure (teams, leagues, seasons)
- • Sequential updating (in-game, week-by-week)
- • Large datasets (millions of events)
- • You just need predictions, not uncertainty
- • Computational budget is very tight
- • The model is simple (logistic regression, XGBoost)
- • Prior specification is genuinely controversial
- • You need fast iteration speed
With enough data, Bayesian and frequentist methods converge to the same answers. The prior washes out. The difference matters most in exactly the situations football analytics faces: small samples, structured data, and decisions that require probability statements. If you're betting, you need posteriors. If you're training a neural network on millions of tracking frames, MLE is fine.
- ✓ Why small football samples need Bayesian thinking
- ✓ Bayes' theorem: Posterior ∝ Likelihood × Prior
- ✓ Prior distributions and encoding domain knowledge
- ✓ Conjugate updates (Beta-Binomial, Normal-Normal)
- ✓ Credible intervals vs. confidence intervals
- ✓ MCMC for complex models
- ✓ Hierarchical models and partial pooling
- ✓ Six football applications
Bayesian inference isn't about being "more correct" than frequentist statistics — it's about getting the right answers to the right questions. When you have small samples, prior knowledge, hierarchical structure, and you need probability statements about parameters (which is most of football analytics and all of betting), Bayesian modelling isn't optional — it's the natural framework. Start with Beta-Binomial for rates, add hierarchical structure for teams, and reach for MCMC when models get complex.