Football tactics planning can be broken down into three steps:
- Set attacking positions — where should our players be?
- Predict defending response — how will opponents react?
- Adjust attacking plan — optimize based on predicted defense
Diffoot focuses on Step 2: Given the positions and movements of both teams over the past 4 seconds, predict where the defending team's players will be over the next 4 seconds. This 4+4 second window (200 frames at 25 FPS) enables tactical planning at the possession level.
The paper identifies three key limitations of existing trajectory prediction models (Social-LSTM, Social-GAN, Agentformer, etc.) for football:
Most models are designed for pedestrians or vehicles with no spatial constraints. Football players must stay within the pitch boundaries.
Models trained on road traffic or crowd behavior don't capture football-specific movements: pressing, marking, covering, off-ball runs.
They don't model the structured interactions in football: attacker-defender marking, ball-player relationships, teammate coordination.
Combine a Heterogeneous Graph Neural Network (to model attacker-defender-ball interactions) with a Conditional Diffusion Model (to generate diverse, realistic trajectories). The graph provides rich context; the diffusion model captures multi-modal uncertainty.
Diffoot has two core components that work together:
Models relationships between players and ball using a multi-relation graph with different edge types for different interactions.
- • Attacker ↔ Attacker (passing opportunities)
- • Defender ↔ Defender (coordination)
- • Attacker ↔ Defender (marking pressure)
- • Player ↔ Ball (possession/proximity)
- • Temporal edges (same player across frames)
Generates future trajectories by denoising random noise, conditioned on the graph embeddings from the encoder.
- • 1000 noise steps (cosine schedule)
- • DDIM sampling with 50 steps at inference
- • v-prediction loss for stable training
- • Cross-attention + FiLM conditioning
Unlike simple distance-based graphs, Diffoot constructs a heterogeneous graphwith multiple node types and edge types that capture football-specific interactions.
Node Types & Features
• Position: (x, y)
• Velocity: (vx, vy)
• Other features: -1 (placeholder)
• Position: (x, y)
• Velocity: (vx, vy)
• Distance to ball
• Distance to goal
• Possession time τ
• Position: (x, y)
• Velocity: (vx, vy)
• Distance to ball
• Distance to goal
• Possession time τ
Node type is appended as the last feature dimension so the GNN can distinguish between player types.
Edge Types & Weights
Edges are weighted by W = W_dist + W_situation, where W_dist captures proximity and W_situation captures football-specific context. Edges below threshold are removed.
W_situation reflects passing opportunities. If one player has the ball (τ > 0), the edge weight increases based on passing lane openness.
W_situation = 0 (no special context). Only distance-based weighting for coordination.
W_situation reflects defensive pressure intensity — how aggressively the defender is closing down the attacker.
Distance to ball is weighted more heavily than player-player distance. For defenders, W_situation considers approach speed toward ball.
Connect the same node across consecutive frames (t → t+1). This allows the GNN to learn how individual players' states evolve over time.
Standard approach: Connect all players within distance threshold, uniform edge weights.
Diffoot approach: Different edge types for different tactical relationships, with weights that encode football semantics (passing lanes, pressing intensity, ball proximity).
This rich graph structure is what allows Diffoot to outperform Social-STGCNN and other baselines that use simpler graph constructions.
The heterogeneous graph is processed by a Graph Attention Network (GAT)to produce a conditioning vector for the diffusion model.
- • Current tactical configuration
- • Who is pressing whom
- • Passing lane availability
- • Ball possession context
- • Team formations & spacing
Forward Process: Adding Noise
Diffoot uses a cosine noise schedule (not linear) for more stable training:
Denoising Network Architecture
The denoising network is based on CSDI (Conditional Score-based Diffusion for Imputation), modified with LinFormer for efficient computation:
Timestep k is converted to embedding k_emb and added to each residual block input.
Graph condition c is injected via Cross-Attention and FiLM(Feature-wise Linear Modulation) — not simple addition.
H'' = γ(c) × H' + β(c) # FiLM: scale and shift
LinFormer (efficient transformer) processes temporal and spatial features separately, then combines them.
y_f = LinFormer_spatial(Z) # Along player axis
H = Combine(y_t, y_f)
A gating mechanism filters useful information before the final prediction.
output = gate ⊙ tanh(Conv(H))
Loss Function
Diffoot uses v-prediction (not ε-prediction) for more stable training, plus an NLL term for variance prediction:
Inference: DDIM Sampling
At inference, Diffoot uses DDIM with only 50 steps (not 1000), making it ~20× faster than DDPM:
η = 0.2 adds slight randomness for diversity while keeping trajectories coherent.
Instead of predicting absolute future positions, Diffoot predicts relative displacementsfrom the last observed frame. This makes learning easier and improves generalization.
Y_pred = Y_T + Ŷ_rel # Convert back to absolute
Diffoot was trained on official Bundesliga tracking data from the German Football League (DFL), captured using Chyron Hego's TRACAB system at 25 FPS.
| Match ID | Frames | Samples | Split |
|---|---|---|---|
| DFL-MAT-J03WOH | 137,214 | 2,297 | Train |
| DFL-MAT-J03WOY | 142,536 | 2,468 | Train |
| DFL-MAT-J03WPY | 146,211 | 3,056 | Train |
| DFL-MAT-J03WQQ | 142,345 | 2,462 | Train |
| DFL-MAT-J03WMX | 145,967 | 2,203 | Validation |
| DFL-MAT-J03WR9 | 146,810 | 2,203 | Test |
Total: 6 matches, ~21,887 samples after preprocessing. 70:15:15 train/val/test split.
- • Extract attack sequences ≥8 seconds without possession changes
- • Sliding window: 4s past + 4s future = 200 frames per sample
- • Linear interpolation for missing positions (occlusion)
- • Z-score normalization on position/velocity features
- • 70% random vertical flip for data augmentation
Diffoot was evaluated against trajectory prediction baselines using Best-of-20 sampling (minADE₂₀, minFDE₂₀, etc.) for stochastic models.
| Model | ADE ↓ | FDE ↓ | Fréchet ↓ | Direction Error ↓ |
|---|---|---|---|---|
| Vanilla-LSTM | 9.724 ± 2.17 | 12.095 ± 3.85 | 12.834 ± 3.73 | 87.01° ± 41.4° |
| Transformer | 8.276 ± 1.25 | 9.669 ± 1.94 | 10.632 ± 1.88 | 57.26° ± 33.8° |
| Social-LSTM | 4.075 ± 1.55 | 7.711 ± 3.24 | 7.778 ± 3.20 | 85.80° ± 21.0° |
| Social-GAN | 3.603 ± 1.04 | 6.028 ± 2.22 | 6.706 ± 2.08 | 64.16° ± 19.4° |
| Social-STGCNN | 3.478 ± 0.92 | 5.799 ± 2.05 | 6.239 ± 1.89 | 51.86° ± 30.3° |
| Diffoot (Ours) | 3.425 ± 0.97 | 6.179 ± 1.98 | 6.479 ± 1.98 | 38.05° ± 20.7° |
Bold = best, underline = second best. ADE/FDE in meters, Direction Error in degrees.
Best ADE (3.425m) — most accurate average position prediction.
Best Direction Error (38.05°) — captures movement direction far better than baselines.
Slightly better endpoint accuracy (FDE) and worst-case deviation (Fréchet), but much worse at capturing direction of movement.
The 38° vs 52° direction error gap is significant. Social-STGCNN uses a fixed graph structure that can't adapt to changing tactical situations. Diffoot's heterogeneous graph with attention-weighted edges captures which interactions matter right now, leading to better directional predictions.
- • Social-LSTM & Social-GAN: Systematically underpredict trajectory lengths
- • Social-STGCNN: Unstable, noisy outputs; fails to capture coherent patterns
- • Diffoot: Trajectory lengths and directions closely match ground truth
Paper-Acknowledged Limitations
Trained on only 6 Bundesliga matches from 2022/23. Not validated on other leagues, playing styles, or seasons.
Diffusion models are slow. Even with DDIM (50 steps), inference is too slow for real-time tactical analysis during live matches.
The graph doesn't include event data (passes, shots, tackles) or player-specific tendencies (dribbling skill, passing accuracy).
The model predicts trajectories but doesn't explicitly output a probabilistic distribution over future positions at each timestep.
Additional Improvement Opportunities
Currently limited to 4 seconds. Extending to 6-10 seconds would enable analysis of full attacking sequences and set pieces.
- • Hierarchical diffusion: coarse trajectory → fine trajectory
- • Multi-scale temporal encoding
- • Autoregressive extension with re-conditioning
Currently predicts only defenders given attackers. Could extend to predict both teams jointly, or predict attackers given defenders.
- • Joint attacker-defender prediction for full match simulation
- • Counterfactual: "What if defender X had positioned here?"
Add controllable generation: "Generate trajectories where team presses high" or "Show defensive shape against counter-attack."
- • Train with tactical labels (formation, pressing intensity)
- • Classifier-free guidance at inference
- • Enables what-if scenario exploration
Currently ball is a node in the conditioning graph, but future ball position isn't predicted. Joint player-ball prediction would be more complete.
- • Ball as additional output channel
- • Physics-informed ball motion (passes, shots)
- • Condition on predicted pass/shot events
Like other neural methods, can predict physically impossible movements. Adding physics-informed losses would improve realism.
- • Max acceleration: ~5 m/s² for elite players
- • Max sustained speed: ~9.5 m/s
- • Collision avoidance between players
| Aspect | TranSPORTmer | Social-STGCNN | Diffoot |
|---|---|---|---|
| Architecture | Set Attention + Temporal Transformer | Spatio-Temporal Graph Conv | Heterogeneous GAT + Diffusion |
| Output Type | Point prediction | Point prediction | Multi-modal samples |
| Prediction Horizon | Configurable | Short (1-2s typical) | 4 seconds |
| Graph Type | Implicit (attention) | Fixed distance-based | Heterogeneous (multiple edge types) |
| Multi-Task | Yes (forecast + impute + classify) | No | No (prediction only) |
| Inference Speed | Fast (~5ms) | Fast (~10ms) | Slower (50 DDIM steps) |
| Direction Accuracy | Good | Moderate (51.9°) | Best (38.1°) |
| Best Use Case | Real-time multi-task | Quick baseline | Tactical analysis, counterfactuals |
TranSPORTmer: Real-time applications, multi-task needs (forecasting + imputation + classification)
Social-STGCNN: Quick prototyping, limited compute, when direction isn't critical
Diffoot: Post-match tactical analysis, opponent scouting, counterfactual exploration, highest direction accuracy
• DDPM (Ho et al., 2020) — Denoising Diffusion Probabilistic Models
• DDIM (Song et al., 2021) — Faster deterministic sampling
• CSDI (Tashiro et al., 2021) — Conditional Score-based Diffusion for Imputation
• Social-LSTM (Alahi et al., 2016) — Social pooling for pedestrians
• Social-GAN (Gupta et al., 2018) — GAN for multi-agent trajectories
• Social-STGCNN (Mohamed et al., 2020) — Graph convolutions for trajectories