Probaballer - Football Analytics & Betting Insights

Research Deep DiveACM 2025

Diffoot: Graph-Conditioned Diffusion for Football Trajectories

A diffusion-based model that predicts defending team movements over the next 4 seconds, using heterogeneous graphs to capture attacker-defender-ball interactions from Bundesliga tracking data.

Diffusion ModelsHeterogeneous Graphs4s Prediction HorizonTactical Analysis40 min read

Source: German Football League (DFL) Bundesliga Data

Read on ACM Digital Library View Code on GitHub

The Tactical Problem

Football tactics planning can be broken down into three steps:

Set attacking positions — where should our players be?
Predict defending response — how will opponents react?
Adjust attacking plan — optimize based on predicted defense

Diffoot focuses on Step 2: Given the positions and movements of both teams over the past 4 seconds, predict where the defending team's players will be over the next 4 seconds. This 4+4 second window (200 frames at 25 FPS) enables tactical planning at the possession level.

Why Existing Methods Fall Short

The paper identifies three key limitations of existing trajectory prediction models (Social-LSTM, Social-GAN, Agentformer, etc.) for football:

❌ Unbounded Movement

Most models are designed for pedestrians or vehicles with no spatial constraints. Football players must stay within the pitch boundaries.

❌ Wrong Motion Patterns

Models trained on road traffic or crowd behavior don't capture football-specific movements: pressing, marking, covering, off-ball runs.

❌ Missing Interactions

They don't model the structured interactions in football: attacker-defender marking, ball-player relationships, teammate coordination.

Diffoot's Solution

Combine a Heterogeneous Graph Neural Network (to model attacker-defender-ball interactions) with a Conditional Diffusion Model (to generate diverse, realistic trajectories). The graph provides rich context; the diffusion model captures multi-modal uncertainty.

Diffoot Architecture: Two Main Components

Graph encoding + Conditional diffusion

Diffoot has two core components that work together:

Component 1: Heterogeneous Graph Encoder

Models relationships between players and ball using a multi-relation graph with different edge types for different interactions.

• Attacker ↔ Attacker (passing opportunities)
• Defender ↔ Defender (coordination)
• Attacker ↔ Defender (marking pressure)
• Player ↔ Ball (possession/proximity)
• Temporal edges (same player across frames)

Component 2: Conditional Diffusion Model

Generates future trajectories by denoising random noise, conditioned on the graph embeddings from the encoder.

• 1000 noise steps (cosine schedule)
• DDIM sampling with 50 steps at inference
• v-prediction loss for stable training
• Cross-attention + FiLM conditioning

Heterogeneous Graph Construction

How Diffoot models player-ball interactions

Unlike simple distance-based graphs, Diffoot constructs a heterogeneous graphwith multiple node types and edge types that capture football-specific interactions.

Node Types & Features

⚽ Ball Node

• Position: (x, y)

• Velocity: (vx, vy)

• Other features: -1 (placeholder)

🔴 Attacker Nodes

• Position: (x, y)

• Velocity: (vx, vy)

• Distance to ball

• Distance to goal

• Possession time τ

🔵 Defender Nodes

• Position: (x, y)

• Velocity: (vx, vy)

• Distance to ball

• Distance to goal

• Possession time τ

Node type is appended as the last feature dimension so the GNN can distinguish between player types.

Edge Types & Weights

Edges are weighted by W = W_dist + W_situation, where W_dist captures proximity and W_situation captures football-specific context. Edges below threshold are removed.

Attacker ↔ Attacker Edges

W_situation reflects passing opportunities. If one player has the ball (τ > 0), the edge weight increases based on passing lane openness.

σ(A_i, A_j) = 1 / (1 + e^(-τ_i)) # Higher if player has ball longer

Defender ↔ Defender Edges

W_situation = 0 (no special context). Only distance-based weighting for coordination.

W(D_i, D_j) = W_dist(D_i, D_j) # Pure proximity

Attacker ↔ Defender Edges

W_situation reflects defensive pressure intensity — how aggressively the defender is closing down the attacker.

σ(A_i, D_j) = max(0, -cos(θ)) × ||v_D|| # Approach speed × direction

Player ↔ Ball Edges

Distance to ball is weighted more heavily than player-player distance. For defenders, W_situation considers approach speed toward ball.

W_dist(P, B) weighted higher than W_dist(P_i, P_j)

Temporal Edges

Connect the same node across consecutive frames (t → t+1). This allows the GNN to learn how individual players' states evolve over time.

E_temporal = {(v_i,t → v_i,t+1) for all players across T frames}

🎯 Why This Graph Design Matters

Standard approach: Connect all players within distance threshold, uniform edge weights.

Diffoot approach: Different edge types for different tactical relationships, with weights that encode football semantics (passing lanes, pressing intensity, ball proximity).

This rich graph structure is what allows Diffoot to outperform Social-STGCNN and other baselines that use simpler graph constructions.

Graph Attention Encoding

Converting the graph into a conditioning vector

The heterogeneous graph is processed by a Graph Attention Network (GAT)to produce a conditioning vector for the diffusion model.

Encoding Pipeline

# 1. Embed position features (categorical → continuous)

h_i^(0) = Embedding(position_i) ⊕ other_features_i

# 2. Two layers of Graph Attention

for l in 1..2:

α_ij = softmax(LeakyReLU(a^T [W·h_i || W·h_j]))

h_i^(l) = σ(Σ_j α_ij · W · h_j^(l-1))

# 3. Attention Pooling → single condition vector

α_i = softmax(w_p^T · tanh(W_p · h_i^(L)))

c = Σ_i α_i · h_i^(L) # Final conditioning vector

Graph Encoder Hyperparameters

Hidden dimension:256

Position embedding dim:8

GAT layers:2

Edge thresholds (δ_dist, δ_situation):0.1, 0.05

What the Condition Encodes

• Current tactical configuration
• Who is pressing whom
• Passing lane availability
• Ball possession context
• Team formations & spacing

Conditional Diffusion Model

The trajectory generation component

Forward Process: Adding Noise

Diffoot uses a cosine noise schedule (not linear) for more stable training:

f(k) = cos((k/K + s) / (1 + s) × π/2)²

ᾱ_k = f(k) / f(0)

β_k = 1 - ᾱ_k / ᾱ_{k-1}

# K = 1000 total steps, s = 0.008 (schedule correction)

Denoising Network Architecture

The denoising network is based on CSDI (Conditional Score-based Diffusion for Imputation), modified with LinFormer for efficient computation:

Diffusion Step Embedding

Timestep k is converted to embedding k_emb and added to each residual block input.

Condition Injection

Graph condition c is injected via Cross-Attention and FiLM(Feature-wise Linear Modulation) — not simple addition.

H' = CrossAttention(H, c) # Query: H, Key/Value: c
H'' = γ(c) × H' + β(c) # FiLM: scale and shift

Spatiotemporal Processing

LinFormer (efficient transformer) processes temporal and spatial features separately, then combines them.

y_t = LinFormer_temporal(Z) # Along time axis
y_f = LinFormer_spatial(Z) # Along player axis
H = Combine(y_t, y_f)

Gated Output

A gating mechanism filters useful information before the final prediction.

gate = σ(Conv(H)) # Sigmoid gate
output = gate ⊙ tanh(Conv(H))

Loss Function

Diffoot uses v-prediction (not ε-prediction) for more stable training, plus an NLL term for variance prediction:

# v-prediction target (better than noise prediction)

v = √ᾱ_k × ε - √(1-ᾱ_k) × x_0

L_simple = ||v - v_θ(x_k, k, c)||²

# Variance prediction for uncertainty

L_vlb = -log p_θ(x_{k-1} | x_k, c)

# Final loss

L_total = L_simple + λ × L_vlb # λ = 0.001

Inference: DDIM Sampling

At inference, Diffoot uses DDIM with only 50 steps (not 1000), making it ~20× faster than DDPM:

# DDIM sampling (deterministic + faster)

x_{k-1} = √ᾱ_{k-1} × x̂_0 + √(1-ᾱ_{k-1}-σ²) × ε_θ + σ × z

# η controls stochasticity (η=0.2 in Diffoot)

σ = η × √((1-ᾱ_{k-1})/(1-ᾱ_k)) × √(1-ᾱ_k/ᾱ_{k-1})

η = 0.2 adds slight randomness for diversity while keeping trajectories coherent.

Training & Implementation Details

Model Hyperparameters

Noise steps K:1000

Model channels:256

Attention heads:4

Residual blocks:5

LinFormer compress dim:32

DDIM inference steps:50

DDIM η:0.2

Training Configuration

Epochs:30

Batch size:16

Optimizer:AdamW

Learning rate:1e-4

LR scheduler:×0.75 on plateau

Hardware:NVIDIA L40S

Key Implementation Detail: Relative Trajectories

Instead of predicting absolute future positions, Diffoot predicts relative displacementsfrom the last observed frame. This makes learning easier and improves generalization.

Ŷ_rel = Y_{T+1:T+T' - Y_T # Relative to last observed position
Y_pred = Y_T + Ŷ_rel # Convert back to absolute

Dataset: Bundesliga Tracking Data

Diffoot was trained on official Bundesliga tracking data from the German Football League (DFL), captured using Chyron Hego's TRACAB system at 25 FPS.

Dataset Statistics

Match ID	Frames	Samples	Split
DFL-MAT-J03WOH	137,214	2,297	Train
DFL-MAT-J03WOY	142,536	2,468	Train
DFL-MAT-J03WPY	146,211	3,056	Train
DFL-MAT-J03WQQ	142,345	2,462	Train
DFL-MAT-J03WMX	145,967	2,203	Validation
DFL-MAT-J03WR9	146,810	2,203	Test

Total: 6 matches, ~21,887 samples after preprocessing. 70:15:15 train/val/test split.

Data Preprocessing

• Extract attack sequences ≥8 seconds without possession changes
• Sliding window: 4s past + 4s future = 200 frames per sample
• Linear interpolation for missing positions (occlusion)
• Z-score normalization on position/velocity features
• 70% random vertical flip for data augmentation

Experimental Results

Comparison with baseline models

Diffoot was evaluated against trajectory prediction baselines using Best-of-20 sampling (minADE₂₀, minFDE₂₀, etc.) for stochastic models.

Performance Comparison

Model	ADE ↓	FDE ↓	Fréchet ↓	Direction Error ↓
Vanilla-LSTM	9.724 ± 2.17	12.095 ± 3.85	12.834 ± 3.73	87.01° ± 41.4°
Transformer	8.276 ± 1.25	9.669 ± 1.94	10.632 ± 1.88	57.26° ± 33.8°
Social-LSTM	4.075 ± 1.55	7.711 ± 3.24	7.778 ± 3.20	85.80° ± 21.0°
Social-GAN	3.603 ± 1.04	6.028 ± 2.22	6.706 ± 2.08	64.16° ± 19.4°
Social-STGCNN	3.478 ± 0.92	5.799 ± 2.05	6.239 ± 1.89	51.86° ± 30.3°
Diffoot (Ours)	3.425 ± 0.97	6.179 ± 1.98	6.479 ± 1.98	38.05° ± 20.7°

Bold = best, underline = second best. ADE/FDE in meters, Direction Error in degrees.

Diffoot Wins: ADE & Direction

Best ADE (3.425m) — most accurate average position prediction.
Best Direction Error (38.05°) — captures movement direction far better than baselines.

Social-STGCNN Wins: FDE & Fréchet

Slightly better endpoint accuracy (FDE) and worst-case deviation (Fréchet), but much worse at capturing direction of movement.

🎯 Key Insight: Direction Matters

The 38° vs 52° direction error gap is significant. Social-STGCNN uses a fixed graph structure that can't adapt to changing tactical situations. Diffoot's heterogeneous graph with attention-weighted edges captures which interactions matter right now, leading to better directional predictions.

Qualitative Observations

• Social-LSTM & Social-GAN: Systematically underpredict trajectory lengths
• Social-STGCNN: Unstable, noisy outputs; fails to capture coherent patterns
• Diffoot: Trajectory lengths and directions closely match ground truth

Limitations & Future Directions

What the authors acknowledge and where research could go

Paper-Acknowledged Limitations

1. Limited Generalization

Trained on only 6 Bundesliga matches from 2022/23. Not validated on other leagues, playing styles, or seasons.

Improvement: Train on multi-league data (Premier League, La Liga, Serie A) to learn style-invariant representations.

2. High Computational Cost

Diffusion models are slow. Even with DDIM (50 steps), inference is too slow for real-time tactical analysis during live matches.

Improvement: Consistency distillation, latent diffusion, or faster samplers (DPM-Solver++).

3. Missing Event Information

The graph doesn't include event data (passes, shots, tackles) or player-specific tendencies (dribbling skill, passing accuracy).

Improvement: Integrate event stream as additional conditioning; add player embeddings learned from historical data.

4. Single-Point Prediction Focus

The model predicts trajectories but doesn't explicitly output a probabilistic distribution over future positions at each timestep.

Improvement: Add uncertainty quantification heads; visualize prediction confidence over the pitch.

Additional Improvement Opportunities

1. Longer Prediction Horizons

Currently limited to 4 seconds. Extending to 6-10 seconds would enable analysis of full attacking sequences and set pieces.

• Hierarchical diffusion: coarse trajectory → fine trajectory
• Multi-scale temporal encoding
• Autoregressive extension with re-conditioning

2. Bidirectional Prediction

Currently predicts only defenders given attackers. Could extend to predict both teams jointly, or predict attackers given defenders.

• Joint attacker-defender prediction for full match simulation
• Counterfactual: "What if defender X had positioned here?"

3. Tactical Guidance (Classifier-Free)

Add controllable generation: "Generate trajectories where team presses high" or "Show defensive shape against counter-attack."

• Train with tactical labels (formation, pressing intensity)
• Classifier-free guidance at inference
• Enables what-if scenario exploration

4. Ball Trajectory Integration

Currently ball is a node in the conditioning graph, but future ball position isn't predicted. Joint player-ball prediction would be more complete.

• Ball as additional output channel
• Physics-informed ball motion (passes, shots)
• Condition on predicted pass/shot events

5. Physics Constraints

Like other neural methods, can predict physically impossible movements. Adding physics-informed losses would improve realism.

• Max acceleration: ~5 m/s² for elite players
• Max sustained speed: ~9.5 m/s
• Collision avoidance between players

Diffoot vs. TranSPORTmer vs. Other Approaches

Aspect	TranSPORTmer	Social-STGCNN	Diffoot
Architecture	Set Attention + Temporal Transformer	Spatio-Temporal Graph Conv	Heterogeneous GAT + Diffusion
Output Type	Point prediction	Point prediction	Multi-modal samples
Prediction Horizon	Configurable	Short (1-2s typical)	4 seconds
Graph Type	Implicit (attention)	Fixed distance-based	Heterogeneous (multiple edge types)
Multi-Task	Yes (forecast + impute + classify)	No	No (prediction only)
Inference Speed	Fast (~5ms)	Fast (~10ms)	Slower (50 DDIM steps)
Direction Accuracy	Good	Moderate (51.9°)	Best (38.1°)
Best Use Case	Real-time multi-task	Quick baseline	Tactical analysis, counterfactuals

When to Use Each

TranSPORTmer: Real-time applications, multi-task needs (forecasting + imputation + classification)

Social-STGCNN: Quick prototyping, limited compute, when direction isn't critical

Diffoot: Post-match tactical analysis, opponent scouting, counterfactual exploration, highest direction accuracy

Resources & Further Reading

Read the Diffoot Paper

ACM Digital Library

Official Code Repository

GitHub - minsuh99