βš½πŸ’·πŸ“Š
Premier League: Bivariate Poisson + Dixon-Coles (v5)
Building independent match odds using bivariate Poisson, rolling xG, ClubElo ratings, form (PPG), shots-on-target data, and XGBoost ensemble validation.
Overview

This model generates independent 1X2 (home / draw / away) probabilities for every upcoming Premier League fixture. Rather than relying on bookmaker odds, we build our own probability estimates from first principles using a Bivariate Poisson goal-scoring framework (Karlis & Ntzoufras, 2003) with the Dixon-Coles low-score correction, informed by rolling expected goals (xG) averages, ClubElo ratings, decay-weighted form (PPG), shots-on-target data from football-data.co.uk, and season possession percentages from FBref.

The system runs as a Python pipeline that writes predictions to a Postgres database, served to the frontend via an API endpoint and displayed alongside bookmaker odds in the fixtures UI β€” letting us see exactly where our model disagrees with the market.

Data Sources
Four complementary data sources feeding the prediction model
FotMob Match Data
Primary

Schedule and match statistics CSVs covering three full Premier League seasons:

  • β€’ 2023–24: 380 matches (extra historical depth)
  • β€’ 2024–25: 380 matches (historical context for promoted/relegated teams)
  • β€’ 2025–26: 380 matches (220 completed + 160 upcoming)
  • β€’ Key metric: Expected Goals (xG) per team per match
ClubElo Ratings
Secondary

Daily Elo ratings for all 20 Premier League teams, capturing long-term strength and form.

  • β€’ Range: ~1677 (promoted sides) to ~2064 (title contenders)
  • β€’ Pulled via soccerdata Python library
  • β€’ Captures factors xG misses: squad depth, transfers, motivation
Football-Data.co.uk
Tertiary

Historical match statistics from football-data.co.uk covering three seasons (1,000+ matches, 120 columns each):

  • β€’ Full-Time Result (FTR) β€” walk-forward target variable
  • β€’ Match Dates β€” form windows & rest-day calculation
  • β€’ Closing Odds (B365H/D/A, MaxH/D/A) β€” market blend & Kelly staking
FBref Season Stats
Quaternary

Season-level team statistics from FBref (StatsBomb-powered), pulled via soccerdata:

  • β€’ Possession % β€” game control indicator
  • β€’ Teams with higher possession dominate territory and create more chances
  • β€’ Range: ~39% (defensive teams) to ~61% (possession-dominant sides)
Team Name Mapping Challenge

A persistent headache in football data: every source names teams differently. Our pipeline maintains explicit mappings between ClubElo, FotMob, football-data.co.uk, FBref, and The Odds API names. On the frontend, fuzzy matching (token overlap + Levenshtein similarity β‰₯ 0.75) handles the remaining discrepancies.

ClubEloFotMobFD.co.ukFBref
Man CityManchester CityMan CityManchester City
ForestNottingham ForestNott'm ForestNott'ham Forest
BournemouthAFC BournemouthBournemouthBournemouth
Man UnitedManchester UnitedMan UnitedManchester Utd
Model Architecture
14 steps from raw data to decimal odds
Step 1Build Match-xG Table

For every completed match, join the FotMob schedule with top-stats data to create a row containing home/away teams, actual goals, and expected goals (xG). When xG is unavailable, fall back to actual goals.

Step 2Compute Rolling xG Strength

For each team, compute rolling 8-match xG averages across both home and away appearances:

Attack
avg xG created
Defence
avg xG conceded

Window of 8 captures current form β€” a team's last 2 months is more predictive than September results. Teams with <3 matches get league-average defaults.

Step 3Load Football-Data.co.uk Match Stats

Download match data from football-data.co.uk for both seasons (690+ matches, 120 columns each). Compute rolling 8-match averages for:

SoT For
avg shots on target created
SoT Against
avg shots on target conceded

Shots on target provide a complementary signal to xG β€” a team generating many SoT is creating real danger even when xG models disagree.

Step 4Fetch FBref Season Possession

Pull season-level Possession % from FBref for all 20 teams. Higher possession indicates greater game control and territory dominance.

Range: ~39% (deep-defending sides) to ~61% (possession-dominant teams like Man City/Chelsea/Liverpool). Used as a multiplier on lambda to reward teams that control the ball.

Step 5Calculate Expected Goals (Ξ»)
Ξ»_home = avg_league_home_xG Γ— (atk_home / avg_xG) Γ— (def_away / avg_xG)
Ξ»_away = avg_league_away_xG Γ— (atk_away / avg_xG) Γ— (def_home / avg_xG)

Three intuitive components: (1) league baseline encoding home advantage, (2) how strong this team's attack is relative to average, (3) how leaky the opponent's defence is.

Step 6Elo + SoT + Possession Corrections
Ξ» *= 1 + (elo_home βˆ’ elo_away) / 3000
Ξ» *= 1 + 0.15 Γ— (sot_atk_ratio Γ— opp_sot_def_ratio βˆ’ 1)
Ξ» *= 1 + 0.10 Γ— (poss / avg_poss βˆ’ 1)

Three additive corrections on lambda: (1) Elo captures squad quality, (2) SoT ratio rewards clinical finishing, (3) possession ratio rewards game control. All lambdas clamped to [0.3, 4.5].

Step 7Poisson Score Matrix

Construct a 9Γ—9 score probability matrix (max 8 goals per side). Each cell's base probability is the product of independent Poisson distributions:

P(i, j) = Poisson(i; Ξ»_home) Γ— Poisson(j; Ξ»_away) Γ— DC(i, j)

The independent Poisson assumption dates back to Maher (1982) and remains a strong baseline for football.

Step 8Dixon-Coles Correction (ρ = βˆ’0.13)

The independent Poisson assumption slightly misprices low-scoring outcomes. Dixon & Coles (1997) introduced correction factors for the four lowest scorelines:

ScoreCorrectionEffect
0 – 01 βˆ’ Ξ»_h Γ— Ξ»_a Γ— ρ↑ more likely
1 – 01 + Ξ»_a Γ— ρ↓ less likely
0 – 11 + Ξ»_h Γ— ρ↓ less likely
1 – 11 βˆ’ ρ↑ more likely
Other1.0unchanged

This reflects the empirical observation that in tight, defensive games, draws become more likely than the independent model predicts. Matrix renormalised to sum to 1 after correction.

Step 9Extract 1X2 Probabilities

Sum regions of the score matrix: home win (below diagonal), draw (diagonal), away win (above diagonal). A probability floor of 2% is applied per outcome, then renormalised.

Step 10Bivariate Poissonv5

Instead of modelling goals independently, the Bivariate Poisson (Karlis & Ntzoufras, 2003) decomposes home goals X and away goals Y as: X = X₁ + X₃, Y = Xβ‚‚ + X₃, where X₃ ~ Poi(λ₃) is a shared component capturing positive goal correlation.

X₁ ~ Poi(λ₁), Xβ‚‚ ~ Poi(Ξ»β‚‚), X₃ ~ Poi(λ₃)
Cov(X, Y) = λ₃ = 0.08

The shared component naturally increases draw scoreline probabilities without the flat inflation of v4. When one team scores, conditions that led to the goal (open play, fatigue) make it more likely the other team also scores.

Step 11Form (PPG) Adjustmentv5

Recent form captured via decay-weighted Points Per Game over each team’s last 5 matches, computed walk-forward from football-data.co.uk results:

Ξ» *= 1 + FORM_WEIGHT Γ— (team_form / avg_form βˆ’ 1)

With FORM_WEIGHT = 0.12, a team in excellent form (PPG 2.5 vs avg 1.4) gets ~9% more expected goals. Form was the #1 most important feature in XGBoost validation (gain: 2.0–2.1).

Step 12Power Shrinkagev5

Raw Poisson probabilities tend to be overconfident. Power shrinkage compresses probabilities toward the centre by raising each to a power Ξ± < 1 before renormalising:

p_H = p_H ^ SHRINK_POWER
p_D = p_D ^ SHRINK_POWER
p_A = p_A ^ SHRINK_POWER

With SHRINK_POWER = 0.90, a raw 60% becomes ~55% and a raw 20% becomes ~23%. The net effect reduces overconfidence on heavy favourites and increases the model’s willingness to predict draws when probabilities are close.

Step 13Convert to Decimal Odds
odds = 1 / probability

Displayed alongside bookmaker odds in the UI for direct comparison of where the model disagrees with the market.

Hyper-Parameters
ParameterValueDescription
ROLLING_WINDOW10Recent matches for xG averages
RHOβˆ’0.10Dixon-Coles correction parameter
ELO_DIVISOR4000~2.5% shift per 100 Elo
SOT_WEIGHT0.25Shots-on-target ratio influence on Ξ»
DECAY_FACTOR0.85Exponential decay per match (half-life β‰ˆ 4.3)
VENUE_WEIGHT0.30Blend of venue-specific vs overall rolling stats
DRAW_BOOST_MAX0.08Max draw boost for evenly-matched, low-scoring matchups
MARKET_WEIGHT0.15Blend with bookmaker implied probabilities
BVP_LAMBDA30.08Bivariate Poisson covariance (structural draw correction) β€” v5
FORM_WEIGHT0.12Decay-weighted PPG form influence on Ξ» β€” v5
SHRINK_POWER0.90Power shrinkage Ξ±<1 reduces overconfidence β€” v5
KELLY_FRAC0.25Quarter-Kelly for optimal bet sizing

Values selected via grid search (108 combinations) on ~389 tune matches across 2024–25 + 2025–26 seasons, then validated on ~260 holdout matches the search never saw. The 2023–24 season provides warm-up data for form calculations. This temporal split prevents overfitting hyperparameters to evaluation data.

Pipeline Architecture
FotMob CSVs──┐
ClubElo API ───
FD.co.uk ─────→build_predictions.py──→Neon Postgres
FBref API β”€β”€β”˜β”‚
β–Ό
/api/predictions(Next.js API route)
β”‚
β–Ό
Fixtures UI(model odds alongside bookmaker odds)

The Python script runs nightly via GitHub Actions at 00:00 UTC, pulling fresh data from all four sources, generating predictions for every upcoming fixture, and writing them to the database. The API route queries the database with an optional date filter. The React frontend fetches predictions in parallel with fixture data and matches them via fuzzy team name comparison.

Sample Output
MatchHomeDrawAway
Man City vs Wolves64.9%(1.54)23.6%(4.23)11.5%(8.65)
Arsenal vs Man Utd38.0%(2.63)28.1%(3.56)33.9%(2.95)
Bournemouth vs Liverpool25.1%(3.98)24.7%(4.05)50.2%(1.99)
Backtest Results
Walk-forward backtest on 649 completed matches across 2024–25 + 2025–26 seasons (2023–24 as warm-up). Hyperparameters tuned on first 60% (~389 matches), evaluated on ~260 holdout matches the grid search never saw.
Metricv4 (holdout)v5 Poissonv5 + XGBoost
Accuracy42.2%49.6%56.5%
Brier score0.65610.62070.5374
Log loss1.07961.02880.8997
Value-bet ROI (5% edge)+28.6%+26.7%+88.0%
Value-bet P/L+45.6u+43.8u+197.2u
Full walk-forward (649 matches)53.5% accuracy β€’ Brier 0.5785 β€’ +42.8% ROI (+193.8u)
Kelly ROI (full 649)+39.7% on 22.6u staked (+8.9u profit)

The v5 model delivers a transformative improvement. Bivariate Poisson improves calibration, and the XGBoost ensemble boosts accuracy from 42% to 56.5%. Form (PPG) is the #1 most important feature (XGBoost gain: 2.0–2.1). Value-bet ROI jumps from +28.6% to +88.0% on holdout.

Open Interactive Backtest Dashboard β†’
Future Improvements
XGBoost in Production

Currently used only in backtest validation; could be integrated into the live pipeline for real-time ensemble predictions.

Player-level features

Injuries, suspensions, and lineup data to capture squad disruption and rotation patterns.

Multi-league expansion

Extend to La Liga, Serie A, Bundesliga, and Ligue 1 using the same data sources.

Bayesian parameter estimation

Replace grid search with MCMC for continuous parameter optimisation and uncertainty quantification.

References

Dixon, M. J. and Coles, S. G. (1997). β€œModelling Association Football Scores and Inefficiencies in the Football Betting Market.” Journal of the Royal Statistical Society: Series C, 46(2), 265–280.

Karlis, D. and Ntzoufras, I. (2003). β€œAnalysis of Sports Data by Using Bivariate Poisson Models.”The Statistician, 52(3), 381–393.

Maher, M. J. (1982). β€œModelling Association Football Scores.” Statistica Neerlandica, 36(3), 109–118.

ClubElo β€” Historical and live Elo ratings for European football clubs.

FotMob β€” Match statistics including expected goals (xG).

Football-Data.co.uk β€” Historical match data with shots, corners, cards, and closing market odds.

FBref β€” Season-level team statistics powered by StatsBomb data.