Probaballer - Football Analytics & Betting Insights

Premier League: Bivariate Poisson + Dixon-Coles (v5)

Building independent match odds using bivariate Poisson, rolling xG, ClubElo ratings, form (PPG), shots-on-target data, and XGBoost ensemble validation.

Overview

This model generates independent 1X2 (home / draw / away) probabilities for every upcoming Premier League fixture. Rather than relying on bookmaker odds, we build our own probability estimates from first principles using a Bivariate Poisson goal-scoring framework (Karlis & Ntzoufras, 2003) with the Dixon-Coles low-score correction, informed by rolling expected goals (xG) averages, ClubElo ratings, decay-weighted form (PPG), shots-on-target data from football-data.co.uk, and season possession percentages from FBref.

The system runs as a Python pipeline that writes predictions to a Postgres database, served to the frontend via an API endpoint and displayed alongside bookmaker odds in the fixtures UI — letting us see exactly where our model disagrees with the market.

Data Sources

Four complementary data sources feeding the prediction model

FotMob Match Data

Primary

Schedule and match statistics CSVs covering three full Premier League seasons:

• 2023–24: 380 matches (extra historical depth)
• 2024–25: 380 matches (historical context for promoted/relegated teams)
• 2025–26: 380 matches (220 completed + 160 upcoming)
• Key metric: Expected Goals (xG) per team per match

ClubElo Ratings

Secondary

Daily Elo ratings for all 20 Premier League teams, capturing long-term strength and form.

• Range: ~1677 (promoted sides) to ~2064 (title contenders)
• Pulled via soccerdata Python library
• Captures factors xG misses: squad depth, transfers, motivation

Football-Data.co.uk

Tertiary

Historical match statistics from football-data.co.uk covering three seasons (1,000+ matches, 120 columns each):

• Full-Time Result (FTR) — walk-forward target variable
• Match Dates — form windows & rest-day calculation
• Closing Odds (B365H/D/A, MaxH/D/A) — market blend & Kelly staking

FBref Season Stats

Quaternary

Season-level team statistics from FBref (StatsBomb-powered), pulled via soccerdata:

• Possession % — game control indicator
• Teams with higher possession dominate territory and create more chances
• Range: ~39% (defensive teams) to ~61% (possession-dominant sides)

Team Name Mapping Challenge

A persistent headache in football data: every source names teams differently. Our pipeline maintains explicit mappings between ClubElo, FotMob, football-data.co.uk, FBref, and The Odds API names. On the frontend, fuzzy matching (token overlap + Levenshtein similarity ≥ 0.75) handles the remaining discrepancies.

ClubElo	FotMob	FD.co.uk	FBref
Man City	Manchester City	Man City	Manchester City
Forest	Nottingham Forest	Nott'm Forest	Nott'ham Forest
Bournemouth	AFC Bournemouth	Bournemouth	Bournemouth
Man United	Manchester United	Man United	Manchester Utd

Model Architecture

14 steps from raw data to decimal odds

Step 1Build Match-xG Table

For every completed match, join the FotMob schedule with top-stats data to create a row containing home/away teams, actual goals, and expected goals (xG). When xG is unavailable, fall back to actual goals.

Step 2Compute Rolling xG Strength

For each team, compute rolling 8-match xG averages across both home and away appearances:

Attack

avg xG created

Defence

avg xG conceded

Window of 8 captures current form — a team's last 2 months is more predictive than September results. Teams with <3 matches get league-average defaults.

Step 3Load Football-Data.co.uk Match Stats

Download match data from football-data.co.uk for both seasons (690+ matches, 120 columns each). Compute rolling 8-match averages for:

SoT For

avg shots on target created

SoT Against

avg shots on target conceded

Shots on target provide a complementary signal to xG — a team generating many SoT is creating real danger even when xG models disagree.

Step 4Fetch FBref Season Possession

Pull season-level Possession % from FBref for all 20 teams. Higher possession indicates greater game control and territory dominance.

Range: ~39% (deep-defending sides) to ~61% (possession-dominant teams like Man City/Chelsea/Liverpool). Used as a multiplier on lambda to reward teams that control the ball.

Step 5Calculate Expected Goals (λ)

λ_home = avg_league_home_xG × (atk_home / avg_xG) × (def_away / avg_xG)

λ_away = avg_league_away_xG × (atk_away / avg_xG) × (def_home / avg_xG)

Three intuitive components: (1) league baseline encoding home advantage, (2) how strong this team's attack is relative to average, (3) how leaky the opponent's defence is.

Step 6Elo + SoT + Possession Corrections

λ *= 1 + (elo_home − elo_away) / 3000

λ *= 1 + 0.15 × (sot_atk_ratio × opp_sot_def_ratio − 1)

λ *= 1 + 0.10 × (poss / avg_poss − 1)

Three additive corrections on lambda: (1) Elo captures squad quality, (2) SoT ratio rewards clinical finishing, (3) possession ratio rewards game control. All lambdas clamped to [0.3, 4.5].

Step 7Poisson Score Matrix

Construct a 9×9 score probability matrix (max 8 goals per side). Each cell's base probability is the product of independent Poisson distributions:

P(i, j) = Poisson(i; λ_home) × Poisson(j; λ_away) × DC(i, j)

The independent Poisson assumption dates back to Maher (1982) and remains a strong baseline for football.

Step 8Dixon-Coles Correction (ρ = −0.13)

The independent Poisson assumption slightly misprices low-scoring outcomes. Dixon & Coles (1997) introduced correction factors for the four lowest scorelines:

Score	Correction	Effect
0 – 0	1 − λ_h × λ_a × ρ	↑ more likely
1 – 0	1 + λ_a × ρ	↓ less likely
0 – 1	1 + λ_h × ρ	↓ less likely
1 – 1	1 − ρ	↑ more likely
Other	1.0	unchanged

This reflects the empirical observation that in tight, defensive games, draws become more likely than the independent model predicts. Matrix renormalised to sum to 1 after correction.

Step 9Extract 1X2 Probabilities

Sum regions of the score matrix: home win (below diagonal), draw (diagonal), away win (above diagonal). A probability floor of 2% is applied per outcome, then renormalised.

Step 10Bivariate Poissonv5

Instead of modelling goals independently, the Bivariate Poisson (Karlis & Ntzoufras, 2003) decomposes home goals X and away goals Y as: X = X₁ + X₃, Y = X₂ + X₃, where X₃ ~ Poi(λ₃) is a shared component capturing positive goal correlation.

X₁ ~ Poi(λ₁), X₂ ~ Poi(λ₂), X₃ ~ Poi(λ₃)

Cov(X, Y) = λ₃ = 0.08

The shared component naturally increases draw scoreline probabilities without the flat inflation of v4. When one team scores, conditions that led to the goal (open play, fatigue) make it more likely the other team also scores.

Step 11Form (PPG) Adjustmentv5

Recent form captured via decay-weighted Points Per Game over each team’s last 5 matches, computed walk-forward from football-data.co.uk results:

λ *= 1 + FORM_WEIGHT × (team_form / avg_form − 1)

With FORM_WEIGHT = 0.12, a team in excellent form (PPG 2.5 vs avg 1.4) gets ~9% more expected goals. Form was the #1 most important feature in XGBoost validation (gain: 2.0–2.1).

Step 12Power Shrinkagev5

Raw Poisson probabilities tend to be overconfident. Power shrinkage compresses probabilities toward the centre by raising each to a power α < 1 before renormalising:

p_H = p_H ^ SHRINK_POWER

p_D = p_D ^ SHRINK_POWER

p_A = p_A ^ SHRINK_POWER

With SHRINK_POWER = 0.90, a raw 60% becomes ~55% and a raw 20% becomes ~23%. The net effect reduces overconfidence on heavy favourites and increases the model’s willingness to predict draws when probabilities are close.

Step 13Convert to Decimal Odds

odds = 1 / probability

Displayed alongside bookmaker odds in the UI for direct comparison of where the model disagrees with the market.

Hyper-Parameters

Parameter	Value	Description
ROLLING_WINDOW	10	Recent matches for xG averages
RHO	−0.10	Dixon-Coles correction parameter
ELO_DIVISOR	4000	~2.5% shift per 100 Elo
SOT_WEIGHT	0.25	Shots-on-target ratio influence on λ
DECAY_FACTOR	0.85	Exponential decay per match (half-life ≈ 4.3)
VENUE_WEIGHT	0.30	Blend of venue-specific vs overall rolling stats
DRAW_BOOST_MAX	0.08	Max draw boost for evenly-matched, low-scoring matchups
MARKET_WEIGHT	0.15	Blend with bookmaker implied probabilities
BVP_LAMBDA3	0.08	Bivariate Poisson covariance (structural draw correction) — v5
FORM_WEIGHT	0.12	Decay-weighted PPG form influence on λ — v5
SHRINK_POWER	0.90	Power shrinkage α<1 reduces overconfidence — v5
KELLY_FRAC	0.25	Quarter-Kelly for optimal bet sizing

Values selected via grid search (108 combinations) on ~389 tune matches across 2024–25 + 2025–26 seasons, then validated on ~260 holdout matches the search never saw. The 2023–24 season provides warm-up data for form calculations. This temporal split prevents overfitting hyperparameters to evaluation data.

Pipeline Architecture

FotMob CSVs──┐

ClubElo API ──┤

FD.co.uk ──┤──→build_predictions.py──→Neon Postgres

FBref API ──┘│

▼

/api/predictions(Next.js API route)

│

▼

Fixtures UI(model odds alongside bookmaker odds)

The Python script runs nightly via GitHub Actions at 00:00 UTC, pulling fresh data from all four sources, generating predictions for every upcoming fixture, and writing them to the database. The API route queries the database with an optional date filter. The React frontend fetches predictions in parallel with fixture data and matches them via fuzzy team name comparison.

Sample Output

Match	Home	Draw	Away
Man City vs Wolves	64.9%(1.54)	23.6%(4.23)	11.5%(8.65)
Arsenal vs Man Utd	38.0%(2.63)	28.1%(3.56)	33.9%(2.95)
Bournemouth vs Liverpool	25.1%(3.98)	24.7%(4.05)	50.2%(1.99)

Backtest Results

Walk-forward backtest on 649 completed matches across 2024–25 + 2025–26 seasons (2023–24 as warm-up). Hyperparameters tuned on first 60% (~389 matches), evaluated on ~260 holdout matches the grid search never saw.

Metric	v4 (holdout)	v5 Poisson	v5 + XGBoost
Accuracy	42.2%	49.6%	56.5%
Brier score	0.6561	0.6207	0.5374
Log loss	1.0796	1.0288	0.8997
Value-bet ROI (5% edge)	+28.6%	+26.7%	+88.0%
Value-bet P/L	+45.6u	+43.8u	+197.2u
Full walk-forward (649 matches)	53.5% accuracy • Brier 0.5785 • +42.8% ROI (+193.8u)
Kelly ROI (full 649)	+39.7% on 22.6u staked (+8.9u profit)

The v5 model delivers a transformative improvement. Bivariate Poisson improves calibration, and the XGBoost ensemble boosts accuracy from 42% to 56.5%. Form (PPG) is the #1 most important feature (XGBoost gain: 2.0–2.1). Value-bet ROI jumps from +28.6% to +88.0% on holdout.

Open Interactive Backtest Dashboard →

Future Improvements

XGBoost in Production

Currently used only in backtest validation; could be integrated into the live pipeline for real-time ensemble predictions.

Player-level features

Injuries, suspensions, and lineup data to capture squad disruption and rotation patterns.

Multi-league expansion

Extend to La Liga, Serie A, Bundesliga, and Ligue 1 using the same data sources.

Bayesian parameter estimation

Replace grid search with MCMC for continuous parameter optimisation and uncertainty quantification.

References

Dixon, M. J. and Coles, S. G. (1997). “Modelling Association Football Scores and Inefficiencies in the Football Betting Market.” Journal of the Royal Statistical Society: Series C, 46(2), 265–280.

Karlis, D. and Ntzoufras, I. (2003). “Analysis of Sports Data by Using Bivariate Poisson Models.”The Statistician, 52(3), 381–393.

Maher, M. J. (1982). “Modelling Association Football Scores.” Statistica Neerlandica, 36(3), 109–118.

ClubElo — Historical and live Elo ratings for European football clubs.

FotMob — Match statistics including expected goals (xG).

Football-Data.co.uk — Historical match data with shots, corners, cards, and closing market odds.

FBref — Season-level team statistics powered by StatsBomb data.