Probaballer - Football Analytics & Betting Insights

Regression Models

From simple linear relationships to probabilistic predictions: understanding regression for football analytics.

Machine LearningStatisticsPredictionProbability

What is Regression?

Regression is one of the most fundamental concepts in statistics and machine learning. At its core, regression answers a simple question: given some input variables, what output should we expect?

Think of it like this: if you know a team's expected goals (xG), can you predict how many actual goals they'll score? If you know the difference in team ratings, can you predict the probability of a home win? Regression gives us the mathematical tools to answer these questions.

Input (Features)

The information we have: xG, possession %, form, player ratings, odds, etc. Also called "independent variables" or "predictors."

Output (Target)

What we want to predict: goals scored, win probability, goal difference, etc. Also called "dependent variable" or "response."

The Core Idea

Regression finds the relationship between inputs and outputs using historical data. Once we know this relationship, we can use it to make predictions on new, unseen data.

Linear Regression: The Foundation

The simplest and most interpretable regression model

Linear regression assumes a straight-line relationship between inputs and outputs. Despite its simplicity, it's incredibly powerful and forms the foundation for understanding more complex models.

The Equation

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

y = prediction

β₀ = intercept (baseline)

βᵢ = coefficients (weights)

ε = error term

The Mathematics: Ordinary Least Squares (OLS)

Linear regression finds coefficients by minimizing the sum of squared residuals. Here's the full derivation:

1. Define the Loss Function (Sum of Squared Errors)

SSE = Σᵢ(yᵢ - ŷᵢ)² = Σᵢ(yᵢ - β₀ - β₁xᵢ)²

2. Take Partial Derivatives and Set to Zero

∂SSE/∂β₀ = -2Σᵢ(yᵢ - β₀ - β₁xᵢ) = 0

∂SSE/∂β₁ = -2Σᵢxᵢ(yᵢ - β₀ - β₁xᵢ) = 0

3. Solve for Coefficients (Closed-Form Solution)

β₁ = Σᵢ(xᵢ - x̄)(yᵢ - ȳ) / Σᵢ(xᵢ - x̄)² = Cov(X,Y) / Var(X)

β₀ = ȳ - β₁x̄

4. Matrix Form (Multiple Regression)

β = (XᵀX)⁻¹Xᵀy

where X is the design matrix with a column of 1s for the intercept

Key Insight: The OLS solution minimizes variance in predictions. The matrix formula (XᵀX)⁻¹Xᵀy is computationally efficient and gives the exact solution in one step — no iteration needed!

Football Example

Predicting Goals from xG:

Goals = 0.15 + 0.92 × xG

Interpretation: For every 1.0 increase in xG, we expect 0.92 more goals. The intercept (0.15) is the baseline when xG = 0.

Interpreting Coefficients

Each coefficient tells you:

•Sign: Positive = increases output, Negative = decreases

•Magnitude: How much the output changes per unit input

Measuring Model Fit: R² (Coefficient of Determination)

E[(y - ŷ)²] = Bias(ŷ)² + Var(ŷ) + σ²

R² = 0.0

Model = mean

R² = 0.5

Explains 50%

R² = 1.0

Perfect fit

Statistical Assumptions (Gauss-Markov)

1. Linearity: E[Y|X] = Xβ (relationship is actually linear)

2. Homoscedasticity: Var(εᵢ) = σ² (constant error variance)

3. Independence: Cov(εᵢ, εⱼ) = 0 for i ≠ j (errors are uncorrelated)

4. No multicollinearity: Features are not perfectly correlated

5. Normality: ε ~ N(0, σ²) (for inference/confidence intervals)

Strengths

•Highly interpretable — every coefficient has meaning

•Fast to train, even on large datasets

•Works well when relationships are actually linear

•Good baseline to compare other models against

Weaknesses

•Can't capture non-linear relationships

•Sensitive to outliers

•Assumes errors are normally distributed

•Can predict impossible values (negative goals)

Understanding Distributions

Different data types require different assumptions

Before choosing a regression model, you need to understand what type of data you're predicting. The distribution of your target variable determines which regression technique is appropriate.

Probability Density/Mass Functions

Normal Distribution

f(x) = (1/√(2πσ²)) × e^(-(x-μ)²/(2σ²))

μ = mean, σ² = variance

E[X] = μ, Var(X) = σ²

Poisson Distribution

P(X = k) = (λᵏ × e⁻λ) / k!

λ = rate parameter

E[X] = λ, Var(X) = λ

Bernoulli Distribution

P(X = k) = pᵏ(1-p)¹⁻ᵏ, k ∈ {0,1}

p = probability of success

E[X] = p, Var(X) = p(1-p)

Normal (Gaussian)

Continuous values that can be positive or negative, clustered around a mean.

Examples:

• Goal difference (-5 to +5)
• xG difference
• Player rating changes

Use: Linear Regression

Poisson

Count data — discrete, non-negative integers representing "how many."

Examples:

• Goals scored (0, 1, 2, 3...)
• Shots on target
• Corners, fouls

Use: Poisson Regression

Binomial/Bernoulli

Binary outcomes — yes/no, win/lose, happened/didn't happen.

Examples:

• Win or not win
• Both teams score (BTTS)
• Over/Under 2.5 goals

Use: Logistic Regression

Why Does This Matter?

Using the wrong distribution leads to poor predictions and invalid confidence intervals. Linear regression assumes errors are normally distributed — if you use it for count data (goals), you might predict negative goals or fractional values that don't make sense.

Logistic Regression: Predicting Probabilities

When your target is binary (yes/no, win/lose)

Logistic regression is the go-to model for binary classification. Despite the name, it's used for classification, not regression of continuous values. It outputs a probability between 0 and 1.

The Mathematics: From Odds to Probability

1. Start with Odds (not probability)

Odds = P(Y=1) / P(Y=0) = p / (1-p)

Odds of 2:1 means twice as likely to win as lose

2. Model the Log-Odds (Logit) as Linear

logit(p) = log(p/(1-p)) = β₀ + β₁x₁ + β₂x₂ + ...

Log-odds can range from -∞ to +∞, perfect for linear models

3. Invert to Get Probability (Sigmoid Function)

p = σ(z) = 1 / (1 + e⁻ᶻ)

where z = β₀ + β₁x₁ + ...

4. Properties of the Sigmoid

σ(-∞) = 0

σ(0) = 0.5

σ(+∞) = 1

Maximum Likelihood Estimation (MLE)

Unlike OLS, logistic regression uses MLE — we find parameters that maximize the probability of observing our data:

L(β) = Πᵢ p(xᵢ)^yᵢ × (1-p(xᵢ))^(1-yᵢ)

Taking the log (for computational stability):

ℓ(β) = Σᵢ [yᵢ log(p(xᵢ)) + (1-yᵢ) log(1-p(xᵢ))]

This is the negative cross-entropy loss. We maximize it using gradient descent or Newton-Raphson.

Football Example

Predicting Home Win:

Features: xG_diff, form_diff, h2h_record

log(P/(1-P)) = -0.5 + 0.8×xG_diff + 0.3×form_diff

If xG_diff = 0.5 and form_diff = 1.0, the model might output P(Home Win) = 0.62

Interpreting Coefficients

Coefficients are in log-odds (not probabilities):

•Positive: Increases probability of outcome

•exp(β): Odds ratio — how much odds multiply per unit increase

If β = 0.8, then exp(0.8) = 2.23 — odds more than double per unit xG_diff

Multinomial Logistic (Softmax) for 1X2

For Home/Draw/Away prediction, we extend to K classes:

P(Y = k) = exp(zₖ) / Σⱼ exp(zⱼ)

where zₖ = β₀ₖ + β₁ₖx₁ + ... (one set of coefficients per class)

The softmax function ensures all probabilities sum to 1: P(Home) + P(Draw) + P(Away) = 1

Poisson Regression: Predicting Counts

When your target is count data (goals, shots, etc.)

Poisson regression is designed for count data — non-negative integers like goals scored. It's the foundation of many football prediction models, especially for predicting scorelines.

The Mathematics: Generalized Linear Model

1. The Poisson Distribution

P(Y = k) = (λᵏ × e⁻λ) / k!

λ is both the mean and variance (equidispersion)

2. The Log Link Function

log(λ) = β₀ + β₁x₁ + β₂x₂ + ...

Ensures λ > 0 since λ = exp(Xβ)

3. Interpreting Coefficients

exp(βᵢ) = multiplicative effect on expected count

If β₁ = 0.3, then exp(0.3) = 1.35 → each unit increase in x₁ multiplies expected goals by 1.35

4. MLE for Poisson

ℓ(β) = Σᵢ [yᵢ log(λᵢ) - λᵢ - log(yᵢ!)]

= Σᵢ [yᵢ(Xᵢβ) - exp(Xᵢβ)] + const

The Dixon-Coles Model (1997)

The classic football prediction model extends basic Poisson:

log(λ_home) = μ + home_advantage + α_home - β_away

log(λ_away) = μ + α_away - β_home

Where α = attack strength, β = defense strength. Also includes a ρ parameter to adjust for correlation in low-scoring matches.

Classic Football Model

Independent Poisson Model:

log(λ_home) = μ + attack_home + defense_away

log(λ_away) = μ + attack_away + defense_home

Each team's goals are modeled separately using their attack/defense ratings. This is the basis of many betting models.

From λ to Scoreline Probabilities

Once you have λ_home and λ_away:

P(Home=2, Away=1) = P(Home=2) × P(Away=1)

= Poisson(2; λ_home) × Poisson(1; λ_away)

Calculate P for each scoreline (0-0, 1-0, 0-1, ...), then sum to get match outcome probabilities.

Limitation: Independence Assumption

Basic Poisson regression assumes home and away goals are independent. In reality, they're often correlated (high-scoring games, defensive games). Advanced models use bivariate Poisson or copulas to model this correlation.

Regularized Regression

Preventing overfitting when you have many features

When you have many features (especially correlated ones), linear regression can overfit — the model becomes too tailored to training data and performs poorly on new data. Regularization adds a penalty for complex models.

Ridge (L2)

Adds a penalty proportional to the square of coefficient values.

Loss = Σᵢ(yᵢ - ŷᵢ)² + α × Σⱼ(βⱼ²)

Effect: Shrinks coefficients toward zero but never exactly zero. Good for correlated features.

Closed form: β = (XᵀX + αI)⁻¹Xᵀy

Lasso (L1)

Adds a penalty proportional to the absolute value of coefficients.

Loss = Σᵢ(yᵢ - ŷᵢ)² + α × Σⱼ|βⱼ|

Effect: Can shrink coefficients to exactly zero — automatic feature selection!

No closed form: Requires iterative optimization (coordinate descent)

Elastic Net

Combines Ridge and Lasso penalties. Best of both worlds.

Loss = MSE + α₁Σ|βⱼ| + α₂Σ(βⱼ²)

Effect: Feature selection + handles correlated features well.

Mixing param: ρ ∈ [0,1] controls L1/L2 balance

The Bias-Variance Tradeoff

E[(y - ŷ)²] = Bias(ŷ)² + Var(ŷ) + σ²

Bias²

Error from wrong assumptions (underfitting)

Variance

Error from sensitivity to training data (overfitting)

σ² (Irreducible)

Noise in the data we can't reduce

Regularization increases bias but decreases variance. The optimal α balances these — found via cross-validation.

When to Use Regularization

Many features relative to number of samples

Correlated features (multicollinearity)

Want to identify most important features

Model is overfitting (high train, low test performance)

Other Regression Types

Specialized models for specific situations

Negative Binomial

Like Poisson but handles overdispersion (variance > mean). More flexible for count data.

Var(Y) = μ + μ²/r (r = dispersion)

Use when: Poisson doesn't fit well, variance is higher than expected (e.g., goals in cup matches)

Ordinal Regression

For ordered categories (e.g., lose/draw/win, ratings 1-5). Uses cumulative link functions.

logit(P(Y ≤ k)) = αₖ - Xβ

Use when: Categories have a natural order but distances between them are unknown

Quantile Regression

Predicts specific percentiles (median, 90th percentile) rather than the mean.

Loss = Σᵢ ρτ(yᵢ - ŷᵢ) where ρτ = check function

Use when: You care about extreme scenarios or the distribution is skewed

Zero-Inflated Models

For count data with excess zeros. Two-component mixture model.

P(Y=0) = π + (1-π)×Poisson(0;λ)

Use when: Many zeros in data (e.g., goals scored by a goalkeeper)

Evaluating Regression Models

How do you know if your model is good?

For Continuous Targets

R² (Coefficient of Determination)

R² = 1 - SS_res/SS_tot

How much variance the model explains. 1.0 = perfect, 0 = no better than mean.

RMSE (Root Mean Squared Error)

RMSE = √(Σᵢ(yᵢ-ŷᵢ)²/n)

Average error magnitude in original units. Lower = better.

MAE (Mean Absolute Error)

MAE = Σᵢ|yᵢ-ŷᵢ|/n

Average absolute error. Less sensitive to outliers than RMSE.

For Classification (Logistic)

Log Loss (Cross-Entropy)

L = -Σᵢ[yᵢlog(pᵢ) + (1-yᵢ)log(1-pᵢ)]

Measures probability calibration. Lower = better calibrated probabilities.

AUC-ROC

Area under ROC curve. Measures ranking ability. 1.0 = perfect ranking, 0.5 = random.

Brier Score

BS = Σᵢ(pᵢ - yᵢ)²/n

MSE for probabilities. Combines calibration and discrimination.

Football-Specific: Betting Metrics

For betting applications, also track ROI (return on investment), Closing Line Value (did you beat the closing odds?), and calibration plots (do 70% predictions actually happen 70% of the time?).

Application to Football

Practical examples with code

Common Use Cases

•Poisson: Predict scorelines, goal totals

•Logistic: Win/lose, BTTS, Over/Under

•Multinomial: 1X2 match outcome

•Linear: xG prediction, player ratings

Feature Ideas

•Rolling xG averages (home/away)

•Form (points per game last N matches)

•Head-to-head historical record

•Rest days, travel distance

•Market odds (as a feature)

Example: Logistic Regression for Home Win

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import log_loss, roc_auc_score
import numpy as np

# Features (calculated BEFORE each match)
features = ['home_xg_avg_5', 'away_xg_avg_5',
            'home_form_5', 'away_form_5',
            'xg_diff', 'elo_diff']

X = df[features]
y = df['home_win']  # 1 = home win, 0 = not home win

# Time-based split
tscv = TimeSeriesSplit(n_splits=5)

# Logistic Regression with regularization
model = LogisticRegression(
    C=1.0,              # Inverse of regularization strength
    penalty='l2',        # Ridge regularization
    max_iter=1000
)

# Train and evaluate
for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    model.fit(X_train, y_train)
    probs = model.predict_proba(X_test)[:, 1]
    
    print(f"Log Loss: {log_loss(y_test, probs):.4f}")
    print(f"AUC-ROC: {roc_auc_score(y_test, probs):.4f}")

# Interpret coefficients
for name, coef in zip(features, model.coef_[0]):
    print(f"{name}: {coef:.3f} (odds ratio: {np.exp(coef):.2f})")

Example: Poisson Regression for Goals

import statsmodels.api as sm
from scipy.stats import poisson
import numpy as np

# Prepare data: each row is one team in one match
# Features: attack strength, opponent defense strength
X = df[['attack_rating', 'opp_defense_rating', 'is_home']]
X = sm.add_constant(X)
y = df['goals_scored']

# Fit Poisson regression (GLM with log link)
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()

print(results.summary())

# Predict expected goals for a match
home_features = [1, 1.2, 0.9, 1]  # const, attack, opp_def, is_home
away_features = [1, 1.0, 1.1, 0]

lambda_home = np.exp(np.dot(home_features, results.params))
lambda_away = np.exp(np.dot(away_features, results.params))

print(f"Expected goals - Home: {lambda_home:.2f}, Away: {lambda_away:.2f}")

# Calculate scoreline probabilities
max_goals = 6
for h in range(max_goals):
    for a in range(max_goals):
        prob = poisson.pmf(h, lambda_home) * poisson.pmf(a, lambda_away)
        if prob > 0.01:  # Only show likely scorelines
            print(f"{h}-{a}: {prob:.1%}")

Key Takeaways

Match distribution to target type

Continuous → Linear, Binary → Logistic, Counts → Poisson

Coefficients are interpretable

Linear: direct effect, Logistic: log-odds, Poisson: log-rate

Use regularization for many features

Ridge, Lasso, or Elastic Net prevent overfitting

Great baselines for football

Poisson for goals, logistic for outcomes — simple but effective