Probaballer - Football Analytics & Betting Insights

Generalised Linear Models

The framework that unifies linear regression, logistic regression, and Poisson regression under one roof

30 min read Predictive Models

Why Generalised Linear Models?

Football is full of data that isn't normally distributed. Goals are non-negative integers (not continuous). Shots are either on target or not (binary). Pass completion rates are bounded between 0 and 1. A team's fouls in a match are counts that can't go below zero.

Standard linear regression assumes your outcome is continuous, unbounded, and has normally distributed errors. Force it onto count data and you get nonsense — predictions of -0.3 goals, confidence intervals stretching into negative territory, and standard errors that are systematically wrong.

Generalised Linear Models (GLMs) solve this by extending regression to handle any outcome distribution in the exponential family. Instead of one special technique per data type, GLMs provide a unified framework: choose your distribution, choose your link function, and the same estimation machinery handles everything.

What You Already Know Are GLMs

If you've used linear regression, logistic regression, or Poisson regression, you've already used GLMs — they're all special cases. A GLM isn't a new model; it's the generalisation that reveals these are the same algorithm with different settings.

The Three Components

Every GLM has exactly three components. Once you understand these, you can construct a model for any type of outcome:

1. Random Component

The probability distribution of your outcome variable Y. Must be from the exponential family: Normal, Poisson, Binomial, Gamma, Negative Binomial, Inverse Gaussian, etc. This determines the variance structure — Poisson has variance = mean, Binomial has variance = np(1-p), and so on.

2. Systematic Component

The linear predictor η = β₀ + β₁x₁ + β₂x₂ + .... This is always a linear combination of your predictors — it's why the model is called "linear" even though the relationship between predictors and the outcome can be highly nonlinear. The linearity is in the parameters, not the response.

3. Link Function

The function g() that connects the expected value of Y to the linear predictor: g(E[Y]) = η. The link transforms E[Y] from its natural range (e.g., [0,1] for probabilities, [0,∞) for counts) to the entire real line (-∞, +∞) where the linear predictor lives.

Link Functions

The link function is the key innovation of GLMs. It's the "bridge" between the linear predictor (which can be any real number) and the expected value of Y (which is constrained by the distribution).

Distribution	Canonical Link	g(μ)	g⁻¹(η)	Football Use
Normal	Identity	μ	η	xG, player ratings
Binomial	Logit	log(μ/(1-μ))	1/(1+e⁻ᶯ)	Shot conversion, win/loss
Poisson	Log	log(μ)	eᶯ	Goals, shots, fouls
Gamma	Inverse	1/μ	1/η	Time between events
Neg. Binomial	Log	log(μ)	eᶯ	Overdispersed counts

Canonical vs. Non-Canonical Links

Each distribution has a canonical link — the one that makes the maths simplest (sufficient statistics are linear in parameters). But you can use any link with any distribution. For example, Poisson with a log link is standard for goals, but Poisson with an identity link is used in some xG models where you want additive rather than multiplicative effects.

The Mathematics

Exponential Family Form

All GLM distributions can be written in the exponential family form:

f(y; θ, φ) = exp{(yθ - b(θ))/a(φ) + c(y, φ)}

where θ is the natural parameter (related to the mean), φ is the dispersion parameter (related to variance), and b(θ) determines the specific distribution. The key property: E[Y] = b'(θ) and Var[Y] = b''(θ) × a(φ).

Maximum Likelihood Estimation

GLMs are fitted by maximum likelihood. The log-likelihood is:

ℓ(β) = Σᵢ {yᵢθᵢ - b(θᵢ)}/a(φ) + c(yᵢ, φ)

Taking derivatives and setting to zero gives the score equations:

∂ℓ/∂β = X'W(y - μ) = 0

where W = diag(1 / (V(μᵢ) × (g'(μᵢ))²))

These can't be solved in closed form (except for the Normal/identity case, which reduces to OLS). Instead, they're solved iteratively using Iteratively Reweighted Least Squares (IRLS):

β̂⁽ᵗ⁺¹⁾ = (X'W⁽ᵗ⁾X)⁻¹ X'W⁽ᵗ⁾z⁽ᵗ⁾

z⁽ᵗ⁾ = η⁽ᵗ⁾ + (y - μ⁽ᵗ⁾) × g'(μ⁽ᵗ⁾) (working response)

This is just weighted least squares applied iteratively — at each step, the weights W and working response z are updated, then a weighted regression is run. Convergence is typically fast (5-10 iterations).

Why IRLS Matters

IRLS reveals that GLMs are fundamentally "just" repeated weighted regressions. This is why GLMs are fast to fit and computationally stable. It also explains why standard regression diagnostics (leverage, Cook's distance) extend naturally to GLMs — they're computed from the final iteration's weights.

Poisson Regression for Goals

The single most important GLM in football analytics. Goals are counts — non-negative integers with a right-skewed distribution. A Poisson GLM with a log link is the natural choice:

Goals ~ Poisson(λ)

log(λ) = β₀ + β₁·home + αᵢ - δⱼ

λ = exp(β₀ + β₁·home + αᵢ - δⱼ)

This is the Dixon-Coles model — the foundation of almost every match prediction system. The log link ensures λ > 0 (can't have negative expected goals) and gives multiplicative effects: a team's goal rate is their attack strength × opponent's defence weakness × home advantage, all on the exponential scale.

Interpreting Coefficients

Because of the log link, coefficients are on the log-rate scale. To interpret them:

β₁ = 0.25→ exp(0.25) = 1.28 → home teams score 28% more goals

αᵢ = 0.40→ exp(0.40) = 1.49 → team i scores 49% more than average

δⱼ = -0.30→ exp(-(-0.30)) = 1.35 → team j concedes 35% more than average

The Poisson Assumption Problem

Poisson assumes Var[Y] = E[Y]. But football goals are often overdispersed — the variance exceeds the mean (some matches are 5-4, most are 1-0). Solutions: (1) Quasi-Poisson — estimate a dispersion parameter φ that inflates standard errors; (2) Negative Binomial — a proper distribution with Var = μ + μ²/k; (3) Check if overdispersion is "real" or just due to missing predictors.

Logistic Regression for Probabilities

The second most important GLM in football: modelling binary outcomes. Did the shot result in a goal? Did the team win? Was the pass completed?

Goal ~ Bernoulli(p)

logit(p) = log(p/(1-p)) = β₀ + β₁·distance + β₂·angle + β₃·header

p = 1 / (1 + exp(-(β₀ + β₁·distance + β₂·angle + β₃·header)))

This is literally an xG (expected goals) model. Every xG value you see on TV is the output of a logistic GLM (or a more complex model like gradient boosting, but the principle is identical). The logit link maps probabilities (0,1) to the real line (-∞,+∞).

Interpreting Coefficients: Odds Ratios

Logistic regression coefficients are on the log-odds scale. Exponentiate to get odds ratios:

β₁ = -0.08→ exp(-0.08) = 0.92 → each extra metre reduces goal odds by 8%

β₂ = 0.03→ exp(0.03) = 1.03 → each extra degree of angle increases goal odds by 3%

β₃ = -0.65→ exp(-0.65) = 0.52 → headers have 48% lower odds vs foot shots

Why xG Models Use Logistic Regression

It's a natural fit: each shot has a binary outcome (goal/no goal), the features are continuous (distance, angle) and categorical (body part, assist type), and the output is a probability. The calibration property of logistic regression means that if you predict 0.15 xG for 1000 shots, roughly 150 should be goals. This is exactly what xG needs — well-calibrated probabilities.

Model Comparison & Diagnostics

Deviance

In GLMs, deviance replaces the residual sum of squares from linear regression. It measures how far your model is from a perfect (saturated) model:

D = -2 × [ℓ(β̂) - ℓ(saturated)]

Lower deviance = better fit (like lower RSS)

Likelihood Ratio Tests

To test whether adding a predictor improves the model, compare deviances:

ΔD = D_reduced - D_full ~ χ²(df)

df = number of extra parameters in the full model

AIC for Model Selection

AIC = -2ℓ(β̂) + 2p = Deviance + 2p

p = number of parameters. Lower AIC = better trade-off between fit and complexity.

Key Diagnostics

Pearson Residuals

rᵢ = (yᵢ - μ̂ᵢ) / √V(μ̂ᵢ). Should be ≈ N(0,1) if model is correct.

Deviance Residuals

Contribution of each observation to total deviance. More symmetric than Pearson.

Dispersion Check

φ̂ = X²/(n-p). If φ̂ ≫ 1, you have overdispersion — use quasi-likelihood or NB.

Leverage & Influence

Hat matrix diagonal hᵢᵢ and Cook's distance extend from OLS via IRLS weights.

Football Applications

Match Prediction (Poisson GLM)

The Dixon-Coles model: goals ~ Poisson, log(λ) = μ + αᵢ - δⱼ + γ·home. Fit on historical results to estimate attack/defence strengths, then predict future matches. The log link gives multiplicative structure — City's attack × Newcastle's defence × home advantage. From λ₁ and λ₂, compute P(Home win), P(Draw), P(Away win) via the Poisson PMF.

Expected Goals / xG (Logistic GLM)

goal ~ Bernoulli, logit(p) = β₀ + β₁·dist + β₂·angle + β₃·header + β₄·big_chance + .... Each shot gets a probability of being scored. Sum over all shots to get team xG. The logistic GLM provides well-calibrated probabilities — the foundation of all modern analytics.

Card & Foul Models (Poisson / NB)

fouls ~ Poisson, log(λ) = β₀ + β₁·away + β₂·aggression + β₃·referee_strictness. Fouls are overdispersed counts — some matches have 30+ fouls while others have 10. A Negative Binomial GLM handles this better than Poisson. Useful for betting on cards markets and referee analysis.

Pass Completion (Binomial GLM)

completed/attempted ~ Binomial(n, p), logit(p) = β₀ + β₁·distance + β₂·pressure + β₃·progressive. Model a player's pass completion rate as a function of difficulty. This gives "expected pass completion" — comparing actual to expected reveals which players are elite passers beyond what their pass selection explains.

Time-to-Event (Gamma GLM)

time_to_first_shot ~ Gamma, log(μ) = β₀ + β₁·pressing_intensity + β₂·possession. How quickly does a team create their first chance? Time-to-event data is positive and right-skewed — a Gamma GLM with log link is the natural choice. Also used for time between goals, recovery time from injuries, and minutes played before substitution.

Betting Market Efficiency (Logistic GLM)

home_win ~ Bernoulli, logit(p) = β₀ + β₁·implied_prob + β₂·our_model_prob. If bookmaker odds are efficient, β₂ should be zero (our model adds nothing beyond market odds). If β₂ is significantly positive, our model captures information the market misses — the basis for value betting.

GLMs and Beyond

GLMs are the starting point — many extensions build on the same framework:

GLMM (+ Mixed Effects)

Add random effects for grouped data. GLM + partial pooling. See the Mixed Effects article.

GAM (+ Smooth Terms)

Replace linear terms with smooth splines: g(μ) = β₀ + f₁(x₁) + f₂(x₂). Captures nonlinearity while keeping interpretability.

Regularised GLM (Ridge/Lasso)

Add penalty terms to prevent overfitting with many predictors. glmnet in R fits L1/L2 regularised GLMs efficiently.

Bayesian GLM

Place priors on β — get posterior distributions instead of point estimates. Natural regularisation + full uncertainty.

GLMs vs. Machine Learning

GLMs are interpretable — every coefficient has a clear meaning (log-odds ratio, rate ratio). Machine learning models (XGBoost, neural nets) often predict better but can't tell you why. In football analytics, this matters: you want to know that headers convert 48% less than foot shots, not just that the model's AUC is 0.79. Start with a GLM to understand the problem, then upgrade to ML if prediction accuracy is paramount.

Practical Tips

Choose Distribution by Data Type

Continuous → Normal. Binary → Bernoulli. Counts → Poisson (or NB if overdispersed). Proportions → Binomial. Positive continuous → Gamma. Don't force counts into a Normal model just because "it kinda works" — the wrong variance structure gives wrong inference even if predictions look OK.

Always Check Dispersion

After fitting a Poisson GLM, compute φ̂ = residual deviance / residual df. If φ̂ > 1.5, you have overdispersion — switch to quasi-Poisson or Negative Binomial. For logistic regression, φ̂ ≈ 1 always (Bernoulli has no dispersion parameter).

Use Offsets for Rates

If teams play different numbers of matches, use an offset: goals ~ Poisson, log(λ) = log(matches) + β₀ + β₁x. The log(matches) offset converts the model from total goals to goals per match. Without it, teams with more matches look artificially stronger.

Tooling

R: glm() in base R handles everything. MASS::glm.nb() for Negative Binomial. Python: statsmodels.GLM, sklearn.linear_model.PoissonRegressor, LogisticRegression.

Summary

Key Concepts

GLM = distribution + linear predictor + link function
Link maps E[Y] to (-∞,+∞) for the linear predictor
Poisson/log link for goals → Dixon-Coles model
Binomial/logit link for probabilities → xG model
Fitted by IRLS (iteratively reweighted least squares)
Deviance and AIC for model comparison

Key Equations

g(E[Y]) = Xβ

Poisson: log(λ) = β₀ + β₁x

Logistic: log(p/(1-p)) = β₀ + β₁x

D = -2[ℓ(β̂) - ℓ(saturated)]

AIC = -2ℓ(β̂) + 2p

Key Takeaway

GLMs are the workhorse of football analytics. The Poisson GLM underpins every match prediction model. The logistic GLM underpins every xG model. Understanding GLMs means understanding why these models work — why the log link gives multiplicative team strengths, why logistic regression gives calibrated probabilities, and why you can't just use linear regression for everything. Before reaching for gradient boosting or neural networks, make sure you understand what a GLM would tell you first.