Probaballer - Football Analytics & Betting Insights

5. Spatiotemporal GNNs for Football

The ultimate architecture for football analytics — combining graph neural networks with temporal modeling to understand how player interactions evolve during a match.

Spatiotemporal LearningAdvanced Deep Learning55 min read

The Challenge: Space AND Time

In the previous articles, we learned about GNNs for modeling spatial relationships between players, and RNNs for modeling temporal patterns. But football involves both simultaneously — player relationships that evolve over time.

Think about a through ball: to understand whether it will succeed, you need to know where the striker is relative to defenders (spatial), how fast everyone is moving (temporal), and how these relationships are changing — is the defender closing down or is the striker pulling away? This interplay between space and time is what makes football so tactically rich, and it's exactly what STGNNs are designed to capture.

The Problem: Neither GNNs nor RNNs Alone Are Enough

GNNs alone capture player relationships at a single snapshot but miss how these relationships evolve. RNNs alone capture temporal dynamics but treat each player independently — missing crucial interactions. A midfielder's run only makes sense in the context of both who is nearby and when they started moving.

The Solution: Spatiotemporal GNNs

Spatiotemporal GNNs (STGNNs) combine the best of both worlds. They model spatial relationships between entities (players) using graph neural networks, while simultaneously capturing temporal dynamics using recurrent or convolutional architectures over time. The result: a model that understands both who interacts with whom and how these interactions evolve.

What Makes Football Data Spatiotemporal?

Modern tracking systems capture player positions at high frequency (typically 25 frames per second). This creates a rich dataset where each "frame" is a snapshot of all 22 players plus the ball, and consecutive frames show how everyone moves. The challenge is that you can't analyze these frames independently — a player's movement only makes sense in the context of what happened before and what their teammates/opponents are doing.

Spatial Dimension

• 22 players with (x, y) positions
• Distances and angles between players
• Team formations and shapes
• Passing lanes and spaces
• Proximity-based interactions

Temporal Dimension

• Tracking data at 25 Hz (25 frames/sec)
• Velocities and accelerations
• Sequential plays and movements
• Phase transitions (attack → defense)
• Historical context for predictions

Formal Definition: Spatiotemporal Graphs

The mathematical foundation

A spatiotemporal graph extends the static graph concept to include time. Instead of a single graph, we have a sequence of graphs — one for each timestep.

A spatiotemporal graph sequence:

𝒢 = { G^(1), G^(2), ..., G^(T) } where G^(t) = (V^(t), E^(t), X^(t))

T = Number of timesteps in the sequence

G^(t) = Graph at timestep t

V^(t) = Set of nodes at time t (players)

E^(t) = Set of edges at time t (may change!)

X^(t) = Node feature matrix at time t (N × F)

Key Properties

Static vs. Dynamic Graphs

Static: Same nodes and edges across time. Only node features X^(t) change.

Dynamic: Both structure (edges) and features can change. More realistic for football!

Node Correspondence

In football, we typically have fixed node identity — player 7 at t=0 is the same as player 7 at t=10. This makes temporal modeling easier since we can track individuals.

The Spatiotemporal Signal

We can represent the full spatiotemporal data as a 3D tensor:

X ∈ ℝ^(T × N × F)

T = Time dimension (e.g., 50 frames = 2 seconds at 25 Hz)

N = Node dimension (e.g., 22 players + 1 ball = 23)

F = Feature dimension (e.g., [x, y, vx, vy, ax, ay] = 6 features)

Football Example

For a 2-second window of tracking data: X ∈ ℝ^(50 × 23 × 6). This means 50 timesteps, 23 entities (22 players + ball), each with 6 features (x, y, vx, vy, ax, ay). The adjacency matrix A^(t) ∈ ℝ^(23 × 23) defines who is connected to whom at each timestep.

STGNN Architecture Overview

How spatial and temporal modules work together

The core idea of STGNNs is to alternate or combine spatial processing (GNNs) with temporal processing (RNNs, TCNs, or Transformers). There are two main design patterns.

The Two Main Design Patterns

The key architectural question is: how do we combine spatial and temporal information? Do we first extract spatial features from each frame, then model the temporal evolution? Or do we interleave spatial and temporal processing at every layer? Both approaches have merit, and the choice depends on your specific use case and computational budget.

Pattern A: Spatial-First

Apply GNN to each timestep independently, then feed the sequence of node embeddings to a temporal model.

H^(t) = GNN(G^(t)) for each t
Y = Temporal([H^(1), ..., H^(T)])

Pros: Simple, modular, easy to implement
Cons: Spatial and temporal learned separately

Pattern B: Interleaved ST-Blocks

Stack alternating spatial and temporal layers, allowing information to flow between both dimensions at each level.

H^(l+1) = Temporal(Spatial(H^(l)))
(repeat for L layers)

Pros: More expressive, joint ST learning
Cons: More complex, harder to train

Which to Choose?

For football trajectory prediction, Pattern B (Interleaved) typically performs better because player movements depend on both current neighbors (spatial) and recent history (temporal) — and these interact. However, Pattern A is a great starting point and baseline.

Spatial Module: GNN Layer

Capturing player relationships at each timestep

The spatial module is typically a standard GNN (GCN, GAT, or GraphSAGE) applied independently to each timestep's graph. This captures who is interacting with whom at that moment.

Crucially, the same GNN weights are shared across all timesteps — we're not learning separate networks for t=1, t=2, etc. This parameter sharing makes the model efficient and helps it generalize: the same "how to aggregate neighbor information" logic applies whether it's the 1st frame or the 50th frame.

Spatial Layer Operation

GCN spatial layer (per timestep):

H^(t,l+1) = σ(D̃^(-½) Ã^(t) D̃^(-½) H^(t,l) W_s)

H^(t,l) = Node embeddings at time t, layer l

Ã^(t) = Adjacency matrix at time t (with self-loops)

W_s = Shared spatial weight matrix (same across time)

σ = Activation function (ReLU)

Graph Construction for Football

One of the most important design decisions is how to define edges between players. Unlike social networks where friendships are explicit, football interactions are implicit — we must infer them from positions. There's no single "correct" answer; the best choice depends on your task and computational budget.

Distance-Based Edges

Connect players within a distance threshold.

A_ij^(t) = 1 if ||p_i^(t) - p_j^(t)|| < threshold

Common threshold: 10-20 meters

K-Nearest Neighbors

Each player connects to their K closest players.

A_ij^(t) = 1 if j ∈ KNN(i, K)

Typical K: 5-10 neighbors

Fully Connected (per team)

All teammates connected; opponents connected separately.

Two complete subgraphs + inter-team edges

Simple but ignores distance

Learned Adjacency

Let the model learn which connections matter.

A^(t) = softmax(MLP([h_i || h_j]))

Most expressive, most expensive

Edge Features

Don't just use binary edges! Add edge features like distance, relative velocity, angle, team membership. For a GAT, these can be incorporated into the attention mechanism: α_ij = attention(h_i, h_j, e_ij).

Temporal Module Options

Capturing dynamics across time

The temporal module processes the sequence of spatial embeddings. After the GNN has encoded each timestep's graph into node embeddings, we need to model how these embeddings evolve over time. There are three main architectural families, each with distinct strengths and trade-offs.

The choice of temporal module significantly impacts both model performance and computational cost. For real-time applications (like live match analysis), you might prefer faster options; for offline analysis where accuracy is paramount, you can afford more expensive architectures.

Option 1: Recurrent Networks (LSTM/GRU)

Recurrent networks are the classic choice for sequence modeling. They process one timestep at a time, maintaining a "hidden state" that summarizes everything seen so far. For each node (player), we run a separate LSTM that takes the spatial embedding at each timestep and updates its hidden state.

LSTM for Temporal Modeling

Process the sequence of node embeddings through an LSTM, maintaining a hidden state across time. The LSTM's gates (forget, input, output) learn what to remember and what to forget — crucial for deciding whether a player's position 2 seconds ago is still relevant.

h_i^(t), c_i^(t) = LSTM(H_spatial_i^(t), h_i^(t-1), c_i^(t-1))

The hidden state h_i^(t) captures the temporal context for player i up to time t. The cell state c_i^(t) provides longer-term memory. After processing all timesteps, the final hidden state can be used for prediction, or we can use hidden states at all timesteps for dense prediction.

✓ Handles variable-length sequences

✓ Good for long-range dependencies

✗ Sequential processing (slow)

✗ Vanishing gradients for very long seq

When to use LSTMs: LSTMs work well when you have variable-length sequences and need to make predictions at arbitrary future horizons. They're also a good choice when your sequences are relatively short (<100 timesteps) and you want a simple, well-understood baseline. The downside is that training can be slow since you can't parallelize across time.

Option 2: Temporal Convolutional Networks (TCN)

TCNs apply 1D convolutions along the time axis, treating time like a spatial dimension. Instead of "remembering" through a hidden state, they look at a fixed-size window of past timesteps through the convolutional kernel. The key innovation is dilated convolutions: by skipping timesteps, we can see further back in time without increasing parameters.

TCN for Temporal Modeling

Apply 1D convolutions along the time axis. By stacking layers with increasing dilation factors (1, 2, 4, 8...), the receptive field grows exponentially — allowing the network to "see" 128 timesteps back with just 7 layers, while keeping the same number of parameters as looking back 7 timesteps!

H^(l+1) = ReLU(Conv1D(H^(l), kernel_size=k, dilation=d^l))

TCNs use causal convolutions — they only look at past timesteps, not future ones. This makes them suitable for real-time prediction where we can't peek ahead. The output at time t only depends on inputs at times ≤ t.

✓ Parallel processing (fast!)

✓ Stable gradients

✓ Flexible receptive field via dilation

✗ Fixed input length at training time

When to use TCNs: TCNs are the workhorse of many STGNN architectures (like ST-GCN and STGCN from the research literature). They're fast to train because convolutions can be parallelized across time. Choose TCNs when you have fixed-length sequences, want fast training, and don't need to handle very long-range dependencies beyond your receptive field.

Option 3: Temporal Attention / Transformers

Transformers use self-attention to allow each timestep to "look at" any other timestep directly, without the information having to flow through intermediate states. This is particularly powerful for capturing long-range dependencies — if a player's movement at t=50 is directly influenced by their position at t=5, attention can model this directly.

Temporal Attention

Use self-attention across timesteps. Each timestep computes a Query (what am I looking for?), and all timesteps provide Keys (what do I contain?) and Values (what information do I provide?). The attention weights determine how much each past timestep contributes to the current timestep's representation.

H_out = softmax(QK^T / √d) V where Q, K, V = linear(H_in)

The attention weights are interpretable: if α_t,t' is high, it means the representation at time t heavily relies on information from time t'. This can reveal which past moments are most predictive of future movement — e.g., "the moment the midfielder received the ball" might have high attention weight.

✓ Captures any temporal dependency directly

✓ Parallel processing

✓ Interpretable attention weights

✗ O(T²) complexity — expensive for long sequences

When to use Transformers: Transformers excel when you need to capture complex, long-range temporal dependencies and can afford the computational cost. They're ideal for understanding complete plays (10-30 seconds) where early movements set up later outcomes. The quadratic complexity means they don't scale well to very long sequences (hundreds of timesteps), but for typical football analysis windows, they're often the most expressive choice.

Comparison Table

Aspect	LSTM/GRU	TCN	Transformer
Parallelizable	No ❌	Yes ✓	Yes ✓
Long-range deps	Good	Needs dilation	Excellent
Complexity	O(T)	O(T)	O(T²)
Variable length	Yes ✓	Tricky	Yes ✓
Memory usage	Low	Medium	High
Best football use	Short-term prediction	STGCN-style models	Full play analysis

Hybrid Approaches

In practice, many state-of-the-art models combine approaches. For example, you might use TCN for local temporal patterns (the last 0.5 seconds) and attention for longer-range patterns (capturing how the play started). Or use an LSTM encoder with attention-based decoding. Don't feel limited to just one option!

The Bottom Line

For modern football analytics with tracking data, STGNNs are the gold standard. They capture the full richness of the data: player identities, spatial relationships, and temporal dynamics. While simpler models (RNN-only, GNN-only) can work as baselines, STGNNs consistently achieve state-of-the-art results on trajectory prediction, action recognition, and value estimation tasks.

Football Applications

STGNNs on the pitch

STGNNs are perfectly suited for football analytics because matches are fundamentally spatiotemporal — player positions and interactions evolving over time. Here are the key applications:

🎯 Player Trajectory Prediction

The flagship application. Given 2 seconds of tracking data, predict where each of the 22 players will be in the next 3-5 seconds.

Input: X ∈ ℝ^(50×23×6) — 50 frames, 23 entities, [x,y,vx,vy,ax,ay]
Output: Ŷ ∈ ℝ^(75×22×2) — 75 future frames, 22 players, [x,y]
Metrics: ADE, FDE, collision rate

⚽ Ball Trajectory Prediction

Predict where the ball will go next, considering player positions and movements. Crucial for anticipating passes and shots.

Challenge: Ball physics (bouncing, spinning) + player interactions (kicks, headers)
Approach: Hybrid model with physics priors + STGNN for player context

📊 Expected Possession Value (EPV)

Predict the probability of a possession ending in a goal, given the current spatiotemporal context. Powers real-time valuation.

Input: Sequence of frames leading to current state
Output: Probability in [0, 1] of eventual goal
Architecture: STGNN encoder → graph pooling → MLP → sigmoid

🔄 Action Recognition & Anticipation

Classify what action is happening (pass, shot, dribble) and predict what will happen next.

Recognition: Classify current action from trajectory context
Anticipation: Predict next action before it happens
Use case: Real-time commentary, defensive alerts

🏃 Counterfactual Analysis

"What would have happened if the defender had moved differently?" Generate alternative futures to evaluate decisions.

Approach: Train CVAE-STGNN to sample multiple possible futures
Analysis: Compare actual vs. counterfactual trajectories
Insight: Quantify "space created" or "threat neutralized"

🎮 Game State Representation

Use STGNN to learn a compact embedding of the current game state for downstream tasks.

Embedding: h_game = Readout(STGNN(trajectory_sequence))
Uses: Similar play retrieval, tactical clustering, pre-match analysis

Why STGNNs Are Perfect for Football

Football tracking data is inherently a dynamic graph: fixed nodes (players) with time-varying edges (who's near whom) and features (positions, velocities). No other architecture captures this structure as naturally. The spatial component models tactical shape and interactions; the temporal component models how plays develop and unfold.

Evaluation Metrics

How to measure STGNN performance

Evaluating trajectory predictions requires specialized metrics that capture different aspects of prediction quality.

Average Displacement Error (ADE)

ADE = (1/T_pred) Σ_t ||ŷ^t - y^t||₂

Average L2 distance between predicted and ground truth positions across all prediction timesteps. Lower is better.

Final Displacement Error (FDE)

FDE = ||ŷ^T_pred - y^T_pred||₂

L2 error at the final prediction timestep only. Important when endpoint accuracy matters most.

Miss Rate @ Threshold

MR@2m = % of predictions with FDE > 2 meters

Percentage of predictions that exceed a distance threshold. Captures failure cases.

Negative Log-Likelihood (NLL)

NLL = -log p(y | μ̂, Σ̂)

For probabilistic models. Measures how well the predicted distribution covers the true future.

Football-Specific Metrics

Collision Rate

Percentage of predicted trajectories that result in player-player collisions. Should be low!

Off-Pitch Rate

Percentage of predictions that leave the pitch boundaries. Should be ~0%.

Formation Preservation

Do predicted formations maintain realistic team shapes? Measure with Procrustes distance.

Physics Plausibility

Are predicted velocities and accelerations within human limits? Check max speed, jerk.

Full Architecture Comparison

When to use what

Let's bring together everything we've learned across the series and compare all the architectures.

Aspect	CNN	RNN	GNN	STGNN
Data type	Grids (images)	Sequences	Static graphs	Dynamic graphs
Spatial info	✓ (grid)	✗	✓ (graph)	✓ (graph)
Temporal info	✗	✓	✗	✓
Key operation	Convolution	Recurrence	Message passing	ST message passing
Complexity	O(H×W)	O(T)	O(N+E)	O(T×(N+E))
Best football use	Heatmaps	Event seq	Snapshots	Trajectories ⭐

The Bottom Line