💷📊
Regression Models
From simple linear relationships to probabilistic predictions: understanding regression for football analytics.
Machine LearningStatisticsPredictionProbability
What is Regression?

Regression is one of the most fundamental concepts in statistics and machine learning. At its core, regression answers a simple question: given some input variables, what output should we expect?

Think of it like this: if you know a team's expected goals (xG), can you predict how many actual goals they'll score? If you know the difference in team ratings, can you predict the probability of a home win? Regression gives us the mathematical tools to answer these questions.

Input (Features)

The information we have: xG, possession %, form, player ratings, odds, etc. Also called "independent variables" or "predictors."

Output (Target)

What we want to predict: goals scored, win probability, goal difference, etc. Also called "dependent variable" or "response."

The Core Idea

Regression finds the relationship between inputs and outputs using historical data. Once we know this relationship, we can use it to make predictions on new, unseen data.

Linear Regression: The Foundation
The simplest and most interpretable regression model

Linear regression assumes a straight-line relationship between inputs and outputs. Despite its simplicity, it's incredibly powerful and forms the foundation for understanding more complex models.

Linear Regression: Finding the Best Fit LineTeam xG (Expected Goals)Goals ScoredData pointsBest fit lineResiduals (errors)
The Equation
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
y = prediction
β₀ = intercept (baseline)
βᵢ = coefficients (weights)
ε = error term
The Mathematics: Ordinary Least Squares (OLS)

Linear regression finds coefficients by minimizing the sum of squared residuals. Here's the full derivation:

1. Define the Loss Function (Sum of Squared Errors)
SSE = Σᵢ(yᵢ - ŷᵢ)² = Σᵢ(yᵢ - β₀ - β₁xᵢ)²
2. Take Partial Derivatives and Set to Zero
∂SSE/∂β₀ = -2Σᵢ(yᵢ - β₀ - β₁xᵢ) = 0
∂SSE/∂β₁ = -2Σᵢxᵢ(yᵢ - β₀ - β₁xᵢ) = 0
3. Solve for Coefficients (Closed-Form Solution)
β₁ = Σᵢ(xᵢ - x̄)(yᵢ - ȳ) / Σᵢ(xᵢ - x̄)² = Cov(X,Y) / Var(X)
β₀ = ȳ - β₁x̄
4. Matrix Form (Multiple Regression)
β = (XᵀX)⁻¹Xᵀy
where X is the design matrix with a column of 1s for the intercept
Key Insight: The OLS solution minimizes variance in predictions. The matrix formula (XᵀX)⁻¹Xᵀy is computationally efficient and gives the exact solution in one step — no iteration needed!
Football Example

Predicting Goals from xG:

Goals = 0.15 + 0.92 × xG

Interpretation: For every 1.0 increase in xG, we expect 0.92 more goals. The intercept (0.15) is the baseline when xG = 0.

Interpreting Coefficients

Each coefficient tells you:

Sign: Positive = increases output, Negative = decreases
Magnitude: How much the output changes per unit input
Measuring Model Fit: R² (Coefficient of Determination)
E[(y - ŷ)²] = Bias(ŷ)² + Var(ŷ) + σ²
R² = 0.0
Model = mean
R² = 0.5
Explains 50%
R² = 1.0
Perfect fit
Statistical Assumptions (Gauss-Markov)
1. Linearity: E[Y|X] = Xβ (relationship is actually linear)
2. Homoscedasticity: Var(εᵢ) = σ² (constant error variance)
3. Independence: Cov(εᵢ, εⱼ) = 0 for i ≠ j (errors are uncorrelated)
4. No multicollinearity: Features are not perfectly correlated
5. Normality: ε ~ N(0, σ²) (for inference/confidence intervals)
Strengths
Highly interpretable — every coefficient has meaning
Fast to train, even on large datasets
Works well when relationships are actually linear
Good baseline to compare other models against
Weaknesses
Can't capture non-linear relationships
Sensitive to outliers
Assumes errors are normally distributed
Can predict impossible values (negative goals)
Understanding Distributions
Different data types require different assumptions

Before choosing a regression model, you need to understand what type of data you're predicting. The distribution of your target variable determines which regression technique is appropriate.

Common Distributions in Football AnalyticsNormal (Gaussian)Goal differencePlayer ratingsPoissonGoals per matchShots, cornersBinomial/BernoulliWinLoseWin/Loss outcomesBTTS, Clean sheetChoose distribution based on your target variableContinuous → Normal | Count data → Poisson | Binary → Binomial/Logistic
Probability Density/Mass Functions
Normal Distribution
f(x) = (1/√(2πσ²)) × e^(-(x-μ)²/(2σ²))
μ = mean, σ² = variance
E[X] = μ, Var(X) = σ²
Poisson Distribution
P(X = k) = (λᵏ × e⁻λ) / k!
λ = rate parameter
E[X] = λ, Var(X) = λ
Bernoulli Distribution
P(X = k) = pᵏ(1-p)¹⁻ᵏ, k ∈ {0,1}
p = probability of success
E[X] = p, Var(X) = p(1-p)
Normal (Gaussian)

Continuous values that can be positive or negative, clustered around a mean.

Examples:
  • • Goal difference (-5 to +5)
  • • xG difference
  • • Player rating changes
Use: Linear Regression
Poisson

Count data — discrete, non-negative integers representing "how many."

Examples:
  • • Goals scored (0, 1, 2, 3...)
  • • Shots on target
  • • Corners, fouls
Use: Poisson Regression
Binomial/Bernoulli

Binary outcomes — yes/no, win/lose, happened/didn't happen.

Examples:
  • • Win or not win
  • • Both teams score (BTTS)
  • • Over/Under 2.5 goals
Use: Logistic Regression
Why Does This Matter?

Using the wrong distribution leads to poor predictions and invalid confidence intervals. Linear regression assumes errors are normally distributed — if you use it for count data (goals), you might predict negative goals or fractional values that don't make sense.

Logistic Regression: Predicting Probabilities
When your target is binary (yes/no, win/lose)

Logistic regression is the go-to model for binary classification. Despite the name, it's used for classification, not regression of continuous values. It outputs a probability between 0 and 1.

Logistic Regression: The Sigmoid Curve1.00.50.0P(Home Win)xG Difference (Home - Away)Home Win (y=1)No Win (y=0)Output is always between 0 and 1 (probability)
The Mathematics: From Odds to Probability
1. Start with Odds (not probability)
Odds = P(Y=1) / P(Y=0) = p / (1-p)
Odds of 2:1 means twice as likely to win as lose
2. Model the Log-Odds (Logit) as Linear
logit(p) = log(p/(1-p)) = β₀ + β₁x₁ + β₂x₂ + ...
Log-odds can range from -∞ to +∞, perfect for linear models
3. Invert to Get Probability (Sigmoid Function)
p = σ(z) = 1 / (1 + e⁻ᶻ)
where z = β₀ + β₁x₁ + ...
4. Properties of the Sigmoid
σ(-∞) = 0
σ(0) = 0.5
σ(+∞) = 1
Maximum Likelihood Estimation (MLE)

Unlike OLS, logistic regression uses MLE — we find parameters that maximize the probability of observing our data:

L(β) = Πᵢ p(xᵢ)^yᵢ × (1-p(xᵢ))^(1-yᵢ)

Taking the log (for computational stability):

ℓ(β) = Σᵢ [yᵢ log(p(xᵢ)) + (1-yᵢ) log(1-p(xᵢ))]

This is the negative cross-entropy loss. We maximize it using gradient descent or Newton-Raphson.

Football Example

Predicting Home Win:

Features: xG_diff, form_diff, h2h_record
log(P/(1-P)) = -0.5 + 0.8×xG_diff + 0.3×form_diff

If xG_diff = 0.5 and form_diff = 1.0, the model might output P(Home Win) = 0.62

Interpreting Coefficients

Coefficients are in log-odds (not probabilities):

Positive: Increases probability of outcome
exp(β): Odds ratio — how much odds multiply per unit increase
If β = 0.8, then exp(0.8) = 2.23 — odds more than double per unit xG_diff
Multinomial Logistic (Softmax) for 1X2

For Home/Draw/Away prediction, we extend to K classes:

P(Y = k) = exp(zₖ) / Σⱼ exp(zⱼ)
where zₖ = β₀ₖ + β₁ₖx₁ + ... (one set of coefficients per class)
The softmax function ensures all probabilities sum to 1: P(Home) + P(Draw) + P(Away) = 1
Poisson Regression: Predicting Counts
When your target is count data (goals, shots, etc.)

Poisson regression is designed for count data — non-negative integers like goals scored. It's the foundation of many football prediction models, especially for predicting scorelines.

Poisson Distribution: Goals per Match (λ = 1.5)ProbabilityNumber of Goals022.3%133.5%225.1%312.6%44.7%5+1.8%
The Mathematics: Generalized Linear Model
1. The Poisson Distribution
P(Y = k) = (λᵏ × e⁻λ) / k!
λ is both the mean and variance (equidispersion)
2. The Log Link Function
log(λ) = β₀ + β₁x₁ + β₂x₂ + ...
Ensures λ > 0 since λ = exp(Xβ)
3. Interpreting Coefficients
exp(βᵢ) = multiplicative effect on expected count
If β₁ = 0.3, then exp(0.3) = 1.35 → each unit increase in x₁ multiplies expected goals by 1.35
4. MLE for Poisson
ℓ(β) = Σᵢ [yᵢ log(λᵢ) - λᵢ - log(yᵢ!)]
= Σᵢ [yᵢ(Xᵢβ) - exp(Xᵢβ)] + const
The Dixon-Coles Model (1997)

The classic football prediction model extends basic Poisson:

log(λ_home) = μ + home_advantage + α_home - β_away
log(λ_away) = μ + α_away - β_home

Where α = attack strength, β = defense strength. Also includes a ρ parameter to adjust for correlation in low-scoring matches.

Classic Football Model

Independent Poisson Model:

log(λ_home) = μ + attack_home + defense_away
log(λ_away) = μ + attack_away + defense_home

Each team's goals are modeled separately using their attack/defense ratings. This is the basis of many betting models.

From λ to Scoreline Probabilities

Once you have λ_home and λ_away:

P(Home=2, Away=1) = P(Home=2) × P(Away=1)
= Poisson(2; λ_home) × Poisson(1; λ_away)

Calculate P for each scoreline (0-0, 1-0, 0-1, ...), then sum to get match outcome probabilities.

Limitation: Independence Assumption

Basic Poisson regression assumes home and away goals are independent. In reality, they're often correlated (high-scoring games, defensive games). Advanced models use bivariate Poisson or copulas to model this correlation.

Regularized Regression
Preventing overfitting when you have many features

When you have many features (especially correlated ones), linear regression can overfit — the model becomes too tailored to training data and performs poorly on new data. Regularization adds a penalty for complex models.

Ridge (L2)

Adds a penalty proportional to the square of coefficient values.

Loss = Σᵢ(yᵢ - ŷᵢ)² + α × Σⱼ(βⱼ²)
Effect: Shrinks coefficients toward zero but never exactly zero. Good for correlated features.
Closed form: β = (XᵀX + αI)⁻¹Xᵀy
Lasso (L1)

Adds a penalty proportional to the absolute value of coefficients.

Loss = Σᵢ(yᵢ - ŷᵢ)² + α × Σⱼ|βⱼ|
Effect: Can shrink coefficients to exactly zero — automatic feature selection!
No closed form: Requires iterative optimization (coordinate descent)
Elastic Net

Combines Ridge and Lasso penalties. Best of both worlds.

Loss = MSE + α₁Σ|βⱼ| + α₂Σ(βⱼ²)
Effect: Feature selection + handles correlated features well.
Mixing param: ρ ∈ [0,1] controls L1/L2 balance
The Bias-Variance Tradeoff
E[(y - ŷ)²] = Bias(ŷ)² + Var(ŷ) + σ²
Bias²
Error from wrong assumptions (underfitting)
Variance
Error from sensitivity to training data (overfitting)
σ² (Irreducible)
Noise in the data we can't reduce

Regularization increases bias but decreases variance. The optimal α balances these — found via cross-validation.

When to Use Regularization
Many features relative to number of samples
Correlated features (multicollinearity)
Want to identify most important features
Model is overfitting (high train, low test performance)
Other Regression Types
Specialized models for specific situations
Negative Binomial

Like Poisson but handles overdispersion (variance > mean). More flexible for count data.

Var(Y) = μ + μ²/r (r = dispersion)
Use when: Poisson doesn't fit well, variance is higher than expected (e.g., goals in cup matches)
Ordinal Regression

For ordered categories (e.g., lose/draw/win, ratings 1-5). Uses cumulative link functions.

logit(P(Y ≤ k)) = αₖ - Xβ
Use when: Categories have a natural order but distances between them are unknown
Quantile Regression

Predicts specific percentiles (median, 90th percentile) rather than the mean.

Loss = Σᵢ ρτ(yᵢ - ŷᵢ) where ρτ = check function
Use when: You care about extreme scenarios or the distribution is skewed
Zero-Inflated Models

For count data with excess zeros. Two-component mixture model.

P(Y=0) = π + (1-π)×Poisson(0;λ)
Use when: Many zeros in data (e.g., goals scored by a goalkeeper)
Evaluating Regression Models
How do you know if your model is good?
For Continuous Targets
R² (Coefficient of Determination)
R² = 1 - SS_res/SS_tot
How much variance the model explains. 1.0 = perfect, 0 = no better than mean.
RMSE (Root Mean Squared Error)
RMSE = √(Σᵢ(yᵢ-ŷᵢ)²/n)
Average error magnitude in original units. Lower = better.
MAE (Mean Absolute Error)
MAE = Σᵢ|yᵢ-ŷᵢ|/n
Average absolute error. Less sensitive to outliers than RMSE.
For Classification (Logistic)
Log Loss (Cross-Entropy)
L = -Σᵢ[yᵢlog(pᵢ) + (1-yᵢ)log(1-pᵢ)]
Measures probability calibration. Lower = better calibrated probabilities.
AUC-ROC
Area under ROC curve. Measures ranking ability. 1.0 = perfect ranking, 0.5 = random.
Brier Score
BS = Σᵢ(pᵢ - yᵢ)²/n
MSE for probabilities. Combines calibration and discrimination.
Football-Specific: Betting Metrics

For betting applications, also track ROI (return on investment), Closing Line Value (did you beat the closing odds?), and calibration plots (do 70% predictions actually happen 70% of the time?).

Application to Football
Practical examples with code
Common Use Cases
Poisson: Predict scorelines, goal totals
Logistic: Win/lose, BTTS, Over/Under
Multinomial: 1X2 match outcome
Linear: xG prediction, player ratings
Feature Ideas
Rolling xG averages (home/away)
Form (points per game last N matches)
Head-to-head historical record
Rest days, travel distance
Market odds (as a feature)
Example: Logistic Regression for Home Win
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import log_loss, roc_auc_score
import numpy as np

# Features (calculated BEFORE each match)
features = ['home_xg_avg_5', 'away_xg_avg_5',
            'home_form_5', 'away_form_5',
            'xg_diff', 'elo_diff']

X = df[features]
y = df['home_win']  # 1 = home win, 0 = not home win

# Time-based split
tscv = TimeSeriesSplit(n_splits=5)

# Logistic Regression with regularization
model = LogisticRegression(
    C=1.0,              # Inverse of regularization strength
    penalty='l2',        # Ridge regularization
    max_iter=1000
)

# Train and evaluate
for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    model.fit(X_train, y_train)
    probs = model.predict_proba(X_test)[:, 1]
    
    print(f"Log Loss: {log_loss(y_test, probs):.4f}")
    print(f"AUC-ROC: {roc_auc_score(y_test, probs):.4f}")

# Interpret coefficients
for name, coef in zip(features, model.coef_[0]):
    print(f"{name}: {coef:.3f} (odds ratio: {np.exp(coef):.2f})")
Example: Poisson Regression for Goals
import statsmodels.api as sm
from scipy.stats import poisson
import numpy as np

# Prepare data: each row is one team in one match
# Features: attack strength, opponent defense strength
X = df[['attack_rating', 'opp_defense_rating', 'is_home']]
X = sm.add_constant(X)
y = df['goals_scored']

# Fit Poisson regression (GLM with log link)
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()

print(results.summary())

# Predict expected goals for a match
home_features = [1, 1.2, 0.9, 1]  # const, attack, opp_def, is_home
away_features = [1, 1.0, 1.1, 0]

lambda_home = np.exp(np.dot(home_features, results.params))
lambda_away = np.exp(np.dot(away_features, results.params))

print(f"Expected goals - Home: {lambda_home:.2f}, Away: {lambda_away:.2f}")

# Calculate scoreline probabilities
max_goals = 6
for h in range(max_goals):
    for a in range(max_goals):
        prob = poisson.pmf(h, lambda_home) * poisson.pmf(a, lambda_away)
        if prob > 0.01:  # Only show likely scorelines
            print(f"{h}-{a}: {prob:.1%}")
Key Takeaways
1
Match distribution to target type

Continuous → Linear, Binary → Logistic, Counts → Poisson

2
Coefficients are interpretable

Linear: direct effect, Logistic: log-odds, Poisson: log-rate

3
Use regularization for many features

Ridge, Lasso, or Elastic Net prevent overfitting

4
Great baselines for football

Poisson for goals, logistic for outcomes — simple but effective