💷📊
Research Deep DiveACM 2025
Diffoot: Graph-Conditioned Diffusion for Football Trajectories
A diffusion-based model that predicts defending team movements over the next 4 seconds, using heterogeneous graphs to capture attacker-defender-ball interactions from Bundesliga tracking data.
Diffusion ModelsHeterogeneous Graphs4s Prediction HorizonTactical Analysis40 min read
Source: German Football League (DFL) Bundesliga Data
The Tactical Problem

Football tactics planning can be broken down into three steps:

  1. Set attacking positions — where should our players be?
  2. Predict defending response — how will opponents react?
  3. Adjust attacking plan — optimize based on predicted defense

Diffoot focuses on Step 2: Given the positions and movements of both teams over the past 4 seconds, predict where the defending team's players will be over the next 4 seconds. This 4+4 second window (200 frames at 25 FPS) enables tactical planning at the possession level.

Why Existing Methods Fall Short

The paper identifies three key limitations of existing trajectory prediction models (Social-LSTM, Social-GAN, Agentformer, etc.) for football:

❌ Unbounded Movement

Most models are designed for pedestrians or vehicles with no spatial constraints. Football players must stay within the pitch boundaries.

❌ Wrong Motion Patterns

Models trained on road traffic or crowd behavior don't capture football-specific movements: pressing, marking, covering, off-ball runs.

❌ Missing Interactions

They don't model the structured interactions in football: attacker-defender marking, ball-player relationships, teammate coordination.

Diffoot's Solution

Combine a Heterogeneous Graph Neural Network (to model attacker-defender-ball interactions) with a Conditional Diffusion Model (to generate diverse, realistic trajectories). The graph provides rich context; the diffusion model captures multi-modal uncertainty.

Diffoot Architecture: Two Main Components
Graph encoding + Conditional diffusion

Diffoot has two core components that work together:

Component 1: Heterogeneous Graph Encoder

Models relationships between players and ball using a multi-relation graph with different edge types for different interactions.

  • • Attacker ↔ Attacker (passing opportunities)
  • • Defender ↔ Defender (coordination)
  • • Attacker ↔ Defender (marking pressure)
  • • Player ↔ Ball (possession/proximity)
  • • Temporal edges (same player across frames)
Component 2: Conditional Diffusion Model

Generates future trajectories by denoising random noise, conditioned on the graph embeddings from the encoder.

  • • 1000 noise steps (cosine schedule)
  • • DDIM sampling with 50 steps at inference
  • • v-prediction loss for stable training
  • • Cross-attention + FiLM conditioning
Heterogeneous Graph Construction
How Diffoot models player-ball interactions

Unlike simple distance-based graphs, Diffoot constructs a heterogeneous graphwith multiple node types and edge types that capture football-specific interactions.

Node Types & Features

⚽ Ball Node

• Position: (x, y)

• Velocity: (vx, vy)

• Other features: -1 (placeholder)

🔴 Attacker Nodes

• Position: (x, y)

• Velocity: (vx, vy)

• Distance to ball

• Distance to goal

• Possession time τ

🔵 Defender Nodes

• Position: (x, y)

• Velocity: (vx, vy)

• Distance to ball

• Distance to goal

• Possession time τ

Node type is appended as the last feature dimension so the GNN can distinguish between player types.

Edge Types & Weights

Edges are weighted by W = W_dist + W_situation, where W_dist captures proximity and W_situation captures football-specific context. Edges below threshold are removed.

Attacker ↔ Attacker Edges

W_situation reflects passing opportunities. If one player has the ball (τ > 0), the edge weight increases based on passing lane openness.

σ(A_i, A_j) = 1 / (1 + e^(-τ_i))   # Higher if player has ball longer
Defender ↔ Defender Edges

W_situation = 0 (no special context). Only distance-based weighting for coordination.

W(D_i, D_j) = W_dist(D_i, D_j)   # Pure proximity
Attacker ↔ Defender Edges

W_situation reflects defensive pressure intensity — how aggressively the defender is closing down the attacker.

σ(A_i, D_j) = max(0, -cos(θ)) × ||v_D||   # Approach speed × direction
Player ↔ Ball Edges

Distance to ball is weighted more heavily than player-player distance. For defenders, W_situation considers approach speed toward ball.

W_dist(P, B) weighted higher than W_dist(P_i, P_j)
Temporal Edges

Connect the same node across consecutive frames (t → t+1). This allows the GNN to learn how individual players' states evolve over time.

E_temporal = {(v_i,t → v_i,t+1) for all players across T frames}
🎯 Why This Graph Design Matters

Standard approach: Connect all players within distance threshold, uniform edge weights.

Diffoot approach: Different edge types for different tactical relationships, with weights that encode football semantics (passing lanes, pressing intensity, ball proximity).

This rich graph structure is what allows Diffoot to outperform Social-STGCNN and other baselines that use simpler graph constructions.

Graph Attention Encoding
Converting the graph into a conditioning vector

The heterogeneous graph is processed by a Graph Attention Network (GAT)to produce a conditioning vector for the diffusion model.

Encoding Pipeline
# 1. Embed position features (categorical → continuous)
h_i^(0) = Embedding(position_i) ⊕ other_features_i
# 2. Two layers of Graph Attention
for l in 1..2:
α_ij = softmax(LeakyReLU(a^T [W·h_i || W·h_j]))
h_i^(l) = σ(Σ_j α_ij · W · h_j^(l-1))
# 3. Attention Pooling → single condition vector
α_i = softmax(w_p^T · tanh(W_p · h_i^(L)))
c = Σ_i α_i · h_i^(L) # Final conditioning vector
Graph Encoder Hyperparameters
Hidden dimension:256
Position embedding dim:8
GAT layers:2
Edge thresholds (δ_dist, δ_situation):0.1, 0.05
What the Condition Encodes
  • • Current tactical configuration
  • • Who is pressing whom
  • • Passing lane availability
  • • Ball possession context
  • • Team formations & spacing
Conditional Diffusion Model
The trajectory generation component

Forward Process: Adding Noise

Diffoot uses a cosine noise schedule (not linear) for more stable training:

f(k) = cos((k/K + s) / (1 + s) × π/2)²
ᾱ_k = f(k) / f(0)
β_k = 1 - ᾱ_k / ᾱ_{k-1}
# K = 1000 total steps, s = 0.008 (schedule correction)

Denoising Network Architecture

The denoising network is based on CSDI (Conditional Score-based Diffusion for Imputation), modified with LinFormer for efficient computation:

1
Diffusion Step Embedding

Timestep k is converted to embedding k_emb and added to each residual block input.

2
Condition Injection

Graph condition c is injected via Cross-Attention and FiLM(Feature-wise Linear Modulation) — not simple addition.

H' = CrossAttention(H, c)   # Query: H, Key/Value: c
H'' = γ(c) × H' + β(c)   # FiLM: scale and shift
3
Spatiotemporal Processing

LinFormer (efficient transformer) processes temporal and spatial features separately, then combines them.

y_t = LinFormer_temporal(Z)   # Along time axis
y_f = LinFormer_spatial(Z)   # Along player axis
H = Combine(y_t, y_f)
4
Gated Output

A gating mechanism filters useful information before the final prediction.

gate = σ(Conv(H))   # Sigmoid gate
output = gate ⊙ tanh(Conv(H))

Loss Function

Diffoot uses v-prediction (not ε-prediction) for more stable training, plus an NLL term for variance prediction:

# v-prediction target (better than noise prediction)
v = √ᾱ_k × ε - √(1-ᾱ_k) × x_0
L_simple = ||v - v_θ(x_k, k, c)||²
# Variance prediction for uncertainty
L_vlb = -log p_θ(x_{k-1} | x_k, c)
# Final loss
L_total = L_simple + λ × L_vlb   # λ = 0.001

Inference: DDIM Sampling

At inference, Diffoot uses DDIM with only 50 steps (not 1000), making it ~20× faster than DDPM:

# DDIM sampling (deterministic + faster)
x_{k-1} = √ᾱ_{k-1} × x̂_0 + √(1-ᾱ_{k-1}-σ²) × ε_θ + σ × z
# η controls stochasticity (η=0.2 in Diffoot)
σ = η × √((1-ᾱ_{k-1})/(1-ᾱ_k)) × √(1-ᾱ_k/ᾱ_{k-1})

η = 0.2 adds slight randomness for diversity while keeping trajectories coherent.

Training & Implementation Details
Model Hyperparameters
Noise steps K:1000
Model channels:256
Attention heads:4
Residual blocks:5
LinFormer compress dim:32
DDIM inference steps:50
DDIM η:0.2
Training Configuration
Epochs:30
Batch size:16
Optimizer:AdamW
Learning rate:1e-4
LR scheduler:×0.75 on plateau
Hardware:NVIDIA L40S
Key Implementation Detail: Relative Trajectories

Instead of predicting absolute future positions, Diffoot predicts relative displacementsfrom the last observed frame. This makes learning easier and improves generalization.

Ŷ_rel = Y_{T+1:T+T' - Y_T   # Relative to last observed position
Y_pred = Y_T + Ŷ_rel   # Convert back to absolute
Dataset: Bundesliga Tracking Data

Diffoot was trained on official Bundesliga tracking data from the German Football League (DFL), captured using Chyron Hego's TRACAB system at 25 FPS.

Dataset Statistics
Match IDFramesSamplesSplit
DFL-MAT-J03WOH137,2142,297Train
DFL-MAT-J03WOY142,5362,468Train
DFL-MAT-J03WPY146,2113,056Train
DFL-MAT-J03WQQ142,3452,462Train
DFL-MAT-J03WMX145,9672,203Validation
DFL-MAT-J03WR9146,8102,203Test

Total: 6 matches, ~21,887 samples after preprocessing. 70:15:15 train/val/test split.

Data Preprocessing
  • • Extract attack sequences ≥8 seconds without possession changes
  • • Sliding window: 4s past + 4s future = 200 frames per sample
  • • Linear interpolation for missing positions (occlusion)
  • • Z-score normalization on position/velocity features
  • • 70% random vertical flip for data augmentation
Experimental Results
Comparison with baseline models

Diffoot was evaluated against trajectory prediction baselines using Best-of-20 sampling (minADE₂₀, minFDE₂₀, etc.) for stochastic models.

Performance Comparison
ModelADE ↓FDE ↓Fréchet ↓Direction Error ↓
Vanilla-LSTM9.724 ± 2.1712.095 ± 3.8512.834 ± 3.7387.01° ± 41.4°
Transformer8.276 ± 1.259.669 ± 1.9410.632 ± 1.8857.26° ± 33.8°
Social-LSTM4.075 ± 1.557.711 ± 3.247.778 ± 3.2085.80° ± 21.0°
Social-GAN3.603 ± 1.046.028 ± 2.226.706 ± 2.0864.16° ± 19.4°
Social-STGCNN3.478 ± 0.925.799 ± 2.056.239 ± 1.8951.86° ± 30.3°
Diffoot (Ours)3.425 ± 0.976.179 ± 1.986.479 ± 1.9838.05° ± 20.7°

Bold = best, underline = second best. ADE/FDE in meters, Direction Error in degrees.

Diffoot Wins: ADE & Direction

Best ADE (3.425m) — most accurate average position prediction.
Best Direction Error (38.05°) — captures movement direction far better than baselines.

Social-STGCNN Wins: FDE & Fréchet

Slightly better endpoint accuracy (FDE) and worst-case deviation (Fréchet), but much worse at capturing direction of movement.

🎯 Key Insight: Direction Matters

The 38° vs 52° direction error gap is significant. Social-STGCNN uses a fixed graph structure that can't adapt to changing tactical situations. Diffoot's heterogeneous graph with attention-weighted edges captures which interactions matter right now, leading to better directional predictions.

Qualitative Observations
  • Social-LSTM & Social-GAN: Systematically underpredict trajectory lengths
  • Social-STGCNN: Unstable, noisy outputs; fails to capture coherent patterns
  • Diffoot: Trajectory lengths and directions closely match ground truth
Limitations & Future Directions
What the authors acknowledge and where research could go

Paper-Acknowledged Limitations

1. Limited Generalization

Trained on only 6 Bundesliga matches from 2022/23. Not validated on other leagues, playing styles, or seasons.

Improvement: Train on multi-league data (Premier League, La Liga, Serie A) to learn style-invariant representations.
2. High Computational Cost

Diffusion models are slow. Even with DDIM (50 steps), inference is too slow for real-time tactical analysis during live matches.

Improvement: Consistency distillation, latent diffusion, or faster samplers (DPM-Solver++).
3. Missing Event Information

The graph doesn't include event data (passes, shots, tackles) or player-specific tendencies (dribbling skill, passing accuracy).

Improvement: Integrate event stream as additional conditioning; add player embeddings learned from historical data.
4. Single-Point Prediction Focus

The model predicts trajectories but doesn't explicitly output a probabilistic distribution over future positions at each timestep.

Improvement: Add uncertainty quantification heads; visualize prediction confidence over the pitch.

Additional Improvement Opportunities

1. Longer Prediction Horizons

Currently limited to 4 seconds. Extending to 6-10 seconds would enable analysis of full attacking sequences and set pieces.

  • • Hierarchical diffusion: coarse trajectory → fine trajectory
  • • Multi-scale temporal encoding
  • • Autoregressive extension with re-conditioning
2. Bidirectional Prediction

Currently predicts only defenders given attackers. Could extend to predict both teams jointly, or predict attackers given defenders.

  • • Joint attacker-defender prediction for full match simulation
  • • Counterfactual: "What if defender X had positioned here?"
3. Tactical Guidance (Classifier-Free)

Add controllable generation: "Generate trajectories where team presses high" or "Show defensive shape against counter-attack."

  • • Train with tactical labels (formation, pressing intensity)
  • • Classifier-free guidance at inference
  • • Enables what-if scenario exploration
4. Ball Trajectory Integration

Currently ball is a node in the conditioning graph, but future ball position isn't predicted. Joint player-ball prediction would be more complete.

  • • Ball as additional output channel
  • • Physics-informed ball motion (passes, shots)
  • • Condition on predicted pass/shot events
5. Physics Constraints

Like other neural methods, can predict physically impossible movements. Adding physics-informed losses would improve realism.

  • • Max acceleration: ~5 m/s² for elite players
  • • Max sustained speed: ~9.5 m/s
  • • Collision avoidance between players
Diffoot vs. TranSPORTmer vs. Other Approaches
AspectTranSPORTmerSocial-STGCNNDiffoot
ArchitectureSet Attention + Temporal TransformerSpatio-Temporal Graph ConvHeterogeneous GAT + Diffusion
Output TypePoint predictionPoint predictionMulti-modal samples
Prediction HorizonConfigurableShort (1-2s typical)4 seconds
Graph TypeImplicit (attention)Fixed distance-basedHeterogeneous (multiple edge types)
Multi-TaskYes (forecast + impute + classify)NoNo (prediction only)
Inference SpeedFast (~5ms)Fast (~10ms)Slower (50 DDIM steps)
Direction AccuracyGoodModerate (51.9°)Best (38.1°)
Best Use CaseReal-time multi-taskQuick baselineTactical analysis, counterfactuals
When to Use Each

TranSPORTmer: Real-time applications, multi-task needs (forecasting + imputation + classification)

Social-STGCNN: Quick prototyping, limited compute, when direction isn't critical

Diffoot: Post-match tactical analysis, opponent scouting, counterfactual exploration, highest direction accuracy

Resources & Further Reading
Key References from the Paper

DDPM (Ho et al., 2020) — Denoising Diffusion Probabilistic Models

DDIM (Song et al., 2021) — Faster deterministic sampling

CSDI (Tashiro et al., 2021) — Conditional Score-based Diffusion for Imputation

Social-LSTM (Alahi et al., 2016) — Social pooling for pedestrians

Social-GAN (Gupta et al., 2018) — GAN for multi-agent trajectories

Social-STGCNN (Mohamed et al., 2020) — Graph convolutions for trajectories

Key Takeaways
4-second horizon: Predicts defender movements for next 4s given past 4s
Heterogeneous graph: Multiple edge types for attacker-defender-ball interactions
Best direction accuracy: 38° vs 52° for Social-STGCNN
Multi-modal sampling: Generate diverse plausible futures via diffusion
DDIM inference: 50 steps (vs 1000 DDPM) for faster generation
Open source: Code available on GitHub
⚠️Limited data: Only 6 Bundesliga matches, generalization unverified
⚠️No events: Doesn't integrate pass/shot/tackle information