Why Generalised Linear Models?
Football is full of data that isn't normally distributed. Goals are non-negative integers (not continuous). Shots are either on target or not (binary). Pass completion rates are bounded between 0 and 1. A team's fouls in a match are counts that can't go below zero.
Standard linear regression assumes your outcome is continuous, unbounded, and has normally distributed errors. Force it onto count data and you get nonsense — predictions of -0.3 goals, confidence intervals stretching into negative territory, and standard errors that are systematically wrong.
Generalised Linear Models (GLMs) solve this by extending regression to handle any outcome distribution in the exponential family. Instead of one special technique per data type, GLMs provide a unified framework: choose your distribution, choose your link function, and the same estimation machinery handles everything.
If you've used linear regression, logistic regression, or Poisson regression, you've already used GLMs — they're all special cases. A GLM isn't a new model; it's the generalisation that reveals these are the same algorithm with different settings.
The Three Components
Every GLM has exactly three components. Once you understand these, you can construct a model for any type of outcome:
The probability distribution of your outcome variable Y. Must be from the exponential family: Normal, Poisson, Binomial, Gamma, Negative Binomial, Inverse Gaussian, etc. This determines the variance structure — Poisson has variance = mean, Binomial has variance = np(1-p), and so on.
The linear predictor η = β₀ + β₁x₁ + β₂x₂ + .... This is always a linear combination of your predictors — it's why the model is called "linear" even though the relationship between predictors and the outcome can be highly nonlinear. The linearity is in the parameters, not the response.
The function g() that connects the expected value of Y to the linear predictor: g(E[Y]) = η. The link transforms E[Y] from its natural range (e.g., [0,1] for probabilities, [0,∞) for counts) to the entire real line (-∞, +∞) where the linear predictor lives.
Link Functions
The link function is the key innovation of GLMs. It's the "bridge" between the linear predictor (which can be any real number) and the expected value of Y (which is constrained by the distribution).
| Distribution | Canonical Link | g(μ) | g⁻¹(η) | Football Use |
|---|---|---|---|---|
| Normal | Identity | μ | η | xG, player ratings |
| Binomial | Logit | log(μ/(1-μ)) | 1/(1+e⁻ᶯ) | Shot conversion, win/loss |
| Poisson | Log | log(μ) | eᶯ | Goals, shots, fouls |
| Gamma | Inverse | 1/μ | 1/η | Time between events |
| Neg. Binomial | Log | log(μ) | eᶯ | Overdispersed counts |
Each distribution has a canonical link — the one that makes the maths simplest (sufficient statistics are linear in parameters). But you can use any link with any distribution. For example, Poisson with a log link is standard for goals, but Poisson with an identity link is used in some xG models where you want additive rather than multiplicative effects.
The Mathematics
Exponential Family Form
All GLM distributions can be written in the exponential family form:
where θ is the natural parameter (related to the mean), φ is the dispersion parameter (related to variance), and b(θ) determines the specific distribution. The key property: E[Y] = b'(θ) and Var[Y] = b''(θ) × a(φ).
Maximum Likelihood Estimation
GLMs are fitted by maximum likelihood. The log-likelihood is:
Taking derivatives and setting to zero gives the score equations:
These can't be solved in closed form (except for the Normal/identity case, which reduces to OLS). Instead, they're solved iteratively using Iteratively Reweighted Least Squares (IRLS):
This is just weighted least squares applied iteratively — at each step, the weights W and working response z are updated, then a weighted regression is run. Convergence is typically fast (5-10 iterations).
IRLS reveals that GLMs are fundamentally "just" repeated weighted regressions. This is why GLMs are fast to fit and computationally stable. It also explains why standard regression diagnostics (leverage, Cook's distance) extend naturally to GLMs — they're computed from the final iteration's weights.
Poisson Regression for Goals
The single most important GLM in football analytics. Goals are counts — non-negative integers with a right-skewed distribution. A Poisson GLM with a log link is the natural choice:
This is the Dixon-Coles model — the foundation of almost every match prediction system. The log link ensures λ > 0 (can't have negative expected goals) and gives multiplicative effects: a team's goal rate is their attack strength × opponent's defence weakness × home advantage, all on the exponential scale.
Interpreting Coefficients
Because of the log link, coefficients are on the log-rate scale. To interpret them:
Poisson assumes Var[Y] = E[Y]. But football goals are often overdispersed — the variance exceeds the mean (some matches are 5-4, most are 1-0). Solutions: (1) Quasi-Poisson — estimate a dispersion parameter φ that inflates standard errors; (2) Negative Binomial — a proper distribution with Var = μ + μ²/k; (3) Check if overdispersion is "real" or just due to missing predictors.
Logistic Regression for Probabilities
The second most important GLM in football: modelling binary outcomes. Did the shot result in a goal? Did the team win? Was the pass completed?
This is literally an xG (expected goals) model. Every xG value you see on TV is the output of a logistic GLM (or a more complex model like gradient boosting, but the principle is identical). The logit link maps probabilities (0,1) to the real line (-∞,+∞).
Interpreting Coefficients: Odds Ratios
Logistic regression coefficients are on the log-odds scale. Exponentiate to get odds ratios:
It's a natural fit: each shot has a binary outcome (goal/no goal), the features are continuous (distance, angle) and categorical (body part, assist type), and the output is a probability. The calibration property of logistic regression means that if you predict 0.15 xG for 1000 shots, roughly 150 should be goals. This is exactly what xG needs — well-calibrated probabilities.
Model Comparison & Diagnostics
Deviance
In GLMs, deviance replaces the residual sum of squares from linear regression. It measures how far your model is from a perfect (saturated) model:
Likelihood Ratio Tests
To test whether adding a predictor improves the model, compare deviances:
AIC for Model Selection
Key Diagnostics
rᵢ = (yᵢ - μ̂ᵢ) / √V(μ̂ᵢ). Should be ≈ N(0,1) if model is correct.
Contribution of each observation to total deviance. More symmetric than Pearson.
φ̂ = X²/(n-p). If φ̂ ≫ 1, you have overdispersion — use quasi-likelihood or NB.
Hat matrix diagonal hᵢᵢ and Cook's distance extend from OLS via IRLS weights.
Football Applications
The Dixon-Coles model: goals ~ Poisson, log(λ) = μ + αᵢ - δⱼ + γ·home. Fit on historical results to estimate attack/defence strengths, then predict future matches. The log link gives multiplicative structure — City's attack × Newcastle's defence × home advantage. From λ₁ and λ₂, compute P(Home win), P(Draw), P(Away win) via the Poisson PMF.
goal ~ Bernoulli, logit(p) = β₀ + β₁·dist + β₂·angle + β₃·header + β₄·big_chance + .... Each shot gets a probability of being scored. Sum over all shots to get team xG. The logistic GLM provides well-calibrated probabilities — the foundation of all modern analytics.
fouls ~ Poisson, log(λ) = β₀ + β₁·away + β₂·aggression + β₃·referee_strictness. Fouls are overdispersed counts — some matches have 30+ fouls while others have 10. A Negative Binomial GLM handles this better than Poisson. Useful for betting on cards markets and referee analysis.
completed/attempted ~ Binomial(n, p), logit(p) = β₀ + β₁·distance + β₂·pressure + β₃·progressive. Model a player's pass completion rate as a function of difficulty. This gives "expected pass completion" — comparing actual to expected reveals which players are elite passers beyond what their pass selection explains.
time_to_first_shot ~ Gamma, log(μ) = β₀ + β₁·pressing_intensity + β₂·possession. How quickly does a team create their first chance? Time-to-event data is positive and right-skewed — a Gamma GLM with log link is the natural choice. Also used for time between goals, recovery time from injuries, and minutes played before substitution.
home_win ~ Bernoulli, logit(p) = β₀ + β₁·implied_prob + β₂·our_model_prob. If bookmaker odds are efficient, β₂ should be zero (our model adds nothing beyond market odds). If β₂ is significantly positive, our model captures information the market misses — the basis for value betting.
GLMs and Beyond
GLMs are the starting point — many extensions build on the same framework:
Add random effects for grouped data. GLM + partial pooling. See the Mixed Effects article.
Replace linear terms with smooth splines: g(μ) = β₀ + f₁(x₁) + f₂(x₂). Captures nonlinearity while keeping interpretability.
Add penalty terms to prevent overfitting with many predictors. glmnet in R fits L1/L2 regularised GLMs efficiently.
Place priors on β — get posterior distributions instead of point estimates. Natural regularisation + full uncertainty.
GLMs are interpretable — every coefficient has a clear meaning (log-odds ratio, rate ratio). Machine learning models (XGBoost, neural nets) often predict better but can't tell you why. In football analytics, this matters: you want to know that headers convert 48% less than foot shots, not just that the model's AUC is 0.79. Start with a GLM to understand the problem, then upgrade to ML if prediction accuracy is paramount.
Practical Tips
Continuous → Normal. Binary → Bernoulli. Counts → Poisson (or NB if overdispersed). Proportions → Binomial. Positive continuous → Gamma. Don't force counts into a Normal model just because "it kinda works" — the wrong variance structure gives wrong inference even if predictions look OK.
After fitting a Poisson GLM, compute φ̂ = residual deviance / residual df. If φ̂ > 1.5, you have overdispersion — switch to quasi-Poisson or Negative Binomial. For logistic regression, φ̂ ≈ 1 always (Bernoulli has no dispersion parameter).
If teams play different numbers of matches, use an offset: goals ~ Poisson, log(λ) = log(matches) + β₀ + β₁x. The log(matches) offset converts the model from total goals to goals per match. Without it, teams with more matches look artificially stronger.
R: glm() in base R handles everything. MASS::glm.nb() for Negative Binomial. Python: statsmodels.GLM, sklearn.linear_model.PoissonRegressor, LogisticRegression.
Summary
- GLM = distribution + linear predictor + link function
- Link maps E[Y] to (-∞,+∞) for the linear predictor
- Poisson/log link for goals → Dixon-Coles model
- Binomial/logit link for probabilities → xG model
- Fitted by IRLS (iteratively reweighted least squares)
- Deviance and AIC for model comparison
GLMs are the workhorse of football analytics. The Poisson GLM underpins every match prediction model. The logistic GLM underpins every xG model. Understanding GLMs means understanding why these models work — why the log link gives multiplicative team strengths, why logistic regression gives calibrated probabilities, and why you can't just use linear regression for everything. Before reaching for gradient boosting or neural networks, make sure you understand what a GLM would tell you first.