โšฝ๐Ÿ’ท๐Ÿ“Š
5. Spatiotemporal GNNs for Football
The ultimate architecture for football analytics โ€” combining graph neural networks with temporal modeling to understand how player interactions evolve during a match.
Spatiotemporal LearningAdvanced Deep Learning55 min read
The Challenge: Space AND Time

In the previous articles, we learned about GNNs for modeling spatial relationships between players, and RNNs for modeling temporal patterns. But football involves both simultaneously โ€” player relationships that evolve over time.

Think about a through ball: to understand whether it will succeed, you need to know where the striker is relative to defenders (spatial), how fast everyone is moving (temporal), and how these relationships are changing โ€” is the defender closing down or is the striker pulling away? This interplay between space and time is what makes football so tactically rich, and it's exactly what STGNNs are designed to capture.

The Problem: Neither GNNs nor RNNs Alone Are Enough

GNNs alone capture player relationships at a single snapshot but miss how these relationships evolve. RNNs alone capture temporal dynamics but treat each player independently โ€” missing crucial interactions. A midfielder's run only makes sense in the context of both who is nearby and when they started moving.

The Solution: Spatiotemporal GNNs

Spatiotemporal GNNs (STGNNs) combine the best of both worlds. They model spatial relationships between entities (players) using graph neural networks, while simultaneously capturing temporal dynamics using recurrent or convolutional architectures over time. The result: a model that understands both who interacts with whom and how these interactions evolve.

What Makes Football Data Spatiotemporal?

Modern tracking systems capture player positions at high frequency (typically 25 frames per second). This creates a rich dataset where each "frame" is a snapshot of all 22 players plus the ball, and consecutive frames show how everyone moves. The challenge is that you can't analyze these frames independently โ€” a player's movement only makes sense in the context of what happened before and what their teammates/opponents are doing.

Spatial Dimension
  • โ€ข 22 players with (x, y) positions
  • โ€ข Distances and angles between players
  • โ€ข Team formations and shapes
  • โ€ข Passing lanes and spaces
  • โ€ข Proximity-based interactions
Temporal Dimension
  • โ€ข Tracking data at 25 Hz (25 frames/sec)
  • โ€ข Velocities and accelerations
  • โ€ข Sequential plays and movements
  • โ€ข Phase transitions (attack โ†’ defense)
  • โ€ข Historical context for predictions
Spatiotemporal Data: Graphs That Change Over TimeTimet = 0Graph Gโฐฮ”tt = 1Graph Gยนฮ”tt = 2Graph GยฒNodes move, edges change, but identity persists across time
Formal Definition: Spatiotemporal Graphs
The mathematical foundation

A spatiotemporal graph extends the static graph concept to include time. Instead of a single graph, we have a sequence of graphs โ€” one for each timestep.

A spatiotemporal graph sequence:
๐’ข = { G^(1), G^(2), ..., G^(T) } where G^(t) = (V^(t), E^(t), X^(t))
T = Number of timesteps in the sequence
G^(t) = Graph at timestep t
V^(t) = Set of nodes at time t (players)
E^(t) = Set of edges at time t (may change!)
X^(t) = Node feature matrix at time t (N ร— F)

Key Properties

Static vs. Dynamic Graphs

Static: Same nodes and edges across time. Only node features X^(t) change.

Dynamic: Both structure (edges) and features can change. More realistic for football!

Node Correspondence

In football, we typically have fixed node identity โ€” player 7 at t=0 is the same as player 7 at t=10. This makes temporal modeling easier since we can track individuals.

The Spatiotemporal Signal

We can represent the full spatiotemporal data as a 3D tensor:
X โˆˆ โ„^(T ร— N ร— F)
T = Time dimension (e.g., 50 frames = 2 seconds at 25 Hz)
N = Node dimension (e.g., 22 players + 1 ball = 23)
F = Feature dimension (e.g., [x, y, vx, vy, ax, ay] = 6 features)
Football Example

For a 2-second window of tracking data: X โˆˆ โ„^(50 ร— 23 ร— 6). This means 50 timesteps, 23 entities (22 players + ball), each with 6 features (x, y, vx, vy, ax, ay). The adjacency matrix A^(t) โˆˆ โ„^(23 ร— 23) defines who is connected to whom at each timestep.

STGNN Architecture Overview
How spatial and temporal modules work together

The core idea of STGNNs is to alternate or combine spatial processing (GNNs) with temporal processing (RNNs, TCNs, or Transformers). There are two main design patterns.

Spatiotemporal GNN Architecture: Spatial + TemporalInput: Graph SequenceG^(t-2)G^(t-1)G^tSpatial Module(GCN / GAT / GraphSAGE)Captures player relationshipsTemporal Module(LSTM / GRU / TCN / Transformer)Captures time dynamicsPrediction Head(MLP / Decoder)Future trajectories, actionsPredictionsลท^(t+1), ลท^(t+2), ...Key Design Choice: How to Combine Spatial & Temporal?Option A: Spatial-first (GNN per frame โ†’ RNN across time)Option B: Interleaved (alternate GNN and temporal layers)

The Two Main Design Patterns

The key architectural question is: how do we combine spatial and temporal information? Do we first extract spatial features from each frame, then model the temporal evolution? Or do we interleave spatial and temporal processing at every layer? Both approaches have merit, and the choice depends on your specific use case and computational budget.

Two Main Design PatternsPattern A: Spatial-First (Encoder-Processor)GNNG^(t-2)GNNG^(t-1)GNNG^tLSTM / Transformer (across time)OutputGNN encodes each frame โ†’ RNN models sequencePattern B: Interleaved (ST-Block)SpatialTemporalSpatialTemporalOutputAlternate spatial and temporal at each layerWhen to Use Which?Pattern A: Simpler, good baselinePattern B: More expressive, SOTAPattern B allows spatial info to evolve with temporal context โ€” often better for complex dynamics
Pattern A: Spatial-First

Apply GNN to each timestep independently, then feed the sequence of node embeddings to a temporal model.

H^(t) = GNN(G^(t)) for each t
Y = Temporal([H^(1), ..., H^(T)])
Pros: Simple, modular, easy to implement
Cons: Spatial and temporal learned separately
Pattern B: Interleaved ST-Blocks

Stack alternating spatial and temporal layers, allowing information to flow between both dimensions at each level.

H^(l+1) = Temporal(Spatial(H^(l)))
(repeat for L layers)
Pros: More expressive, joint ST learning
Cons: More complex, harder to train
Which to Choose?

For football trajectory prediction, Pattern B (Interleaved) typically performs better because player movements depend on both current neighbors (spatial) and recent history (temporal) โ€” and these interact. However, Pattern A is a great starting point and baseline.

Spatial Module: GNN Layer
Capturing player relationships at each timestep

The spatial module is typically a standard GNN (GCN, GAT, or GraphSAGE) applied independently to each timestep's graph. This captures who is interacting with whom at that moment.

Crucially, the same GNN weights are shared across all timesteps โ€” we're not learning separate networks for t=1, t=2, etc. This parameter sharing makes the model efficient and helps it generalize: the same "how to aggregate neighbor information" logic applies whether it's the 1st frame or the 50th frame.

Spatial Layer Operation

GCN spatial layer (per timestep):
H^(t,l+1) = ฯƒ(Dฬƒ^(-ยฝ) รƒ^(t) Dฬƒ^(-ยฝ) H^(t,l) W_s)
H^(t,l) = Node embeddings at time t, layer l
รƒ^(t) = Adjacency matrix at time t (with self-loops)
W_s = Shared spatial weight matrix (same across time)
ฯƒ = Activation function (ReLU)

Graph Construction for Football

One of the most important design decisions is how to define edges between players. Unlike social networks where friendships are explicit, football interactions are implicit โ€” we must infer them from positions. There's no single "correct" answer; the best choice depends on your task and computational budget.

Distance-Based Edges

Connect players within a distance threshold.

A_ij^(t) = 1 if ||p_i^(t) - p_j^(t)|| < threshold
Common threshold: 10-20 meters
K-Nearest Neighbors

Each player connects to their K closest players.

A_ij^(t) = 1 if j โˆˆ KNN(i, K)
Typical K: 5-10 neighbors
Fully Connected (per team)

All teammates connected; opponents connected separately.

Two complete subgraphs + inter-team edges
Simple but ignores distance
Learned Adjacency

Let the model learn which connections matter.

A^(t) = softmax(MLP([h_i || h_j]))
Most expressive, most expensive
Edge Features

Don't just use binary edges! Add edge features like distance, relative velocity, angle, team membership. For a GAT, these can be incorporated into the attention mechanism: ฮฑ_ij = attention(h_i, h_j, e_ij).

Temporal Module Options
Capturing dynamics across time

The temporal module processes the sequence of spatial embeddings. After the GNN has encoded each timestep's graph into node embeddings, we need to model how these embeddings evolve over time. There are three main architectural families, each with distinct strengths and trade-offs.

The choice of temporal module significantly impacts both model performance and computational cost. For real-time applications (like live match analysis), you might prefer faster options; for offline analysis where accuracy is paramount, you can afford more expensive architectures.

Option 1: Recurrent Networks (LSTM/GRU)

Recurrent networks are the classic choice for sequence modeling. They process one timestep at a time, maintaining a "hidden state" that summarizes everything seen so far. For each node (player), we run a separate LSTM that takes the spatial embedding at each timestep and updates its hidden state.

LSTM for Temporal Modeling

Process the sequence of node embeddings through an LSTM, maintaining a hidden state across time. The LSTM's gates (forget, input, output) learn what to remember and what to forget โ€” crucial for deciding whether a player's position 2 seconds ago is still relevant.

h_i^(t), c_i^(t) = LSTM(H_spatial_i^(t), h_i^(t-1), c_i^(t-1))

The hidden state h_i^(t) captures the temporal context for player i up to time t. The cell state c_i^(t) provides longer-term memory. After processing all timesteps, the final hidden state can be used for prediction, or we can use hidden states at all timesteps for dense prediction.

โœ“ Handles variable-length sequences
โœ“ Good for long-range dependencies
โœ— Sequential processing (slow)
โœ— Vanishing gradients for very long seq

When to use LSTMs: LSTMs work well when you have variable-length sequences and need to make predictions at arbitrary future horizons. They're also a good choice when your sequences are relatively short (<100 timesteps) and you want a simple, well-understood baseline. The downside is that training can be slow since you can't parallelize across time.

Option 2: Temporal Convolutional Networks (TCN)

TCNs apply 1D convolutions along the time axis, treating time like a spatial dimension. Instead of "remembering" through a hidden state, they look at a fixed-size window of past timesteps through the convolutional kernel. The key innovation is dilated convolutions: by skipping timesteps, we can see further back in time without increasing parameters.

Temporal Convolution Network (TCN): 1D Conv Over TimeTime โ†’Input:h^0h^1h^2h^3h^4h^5h^6h^7kernel size = 3Conv1DoutDilated Convolution(see further back in time)d=1:d=2:d=4:skipStacking dilated convolutions โ†’ exponentially growing receptive field without losing resolution
TCN for Temporal Modeling

Apply 1D convolutions along the time axis. By stacking layers with increasing dilation factors (1, 2, 4, 8...), the receptive field grows exponentially โ€” allowing the network to "see" 128 timesteps back with just 7 layers, while keeping the same number of parameters as looking back 7 timesteps!

H^(l+1) = ReLU(Conv1D(H^(l), kernel_size=k, dilation=d^l))

TCNs use causal convolutions โ€” they only look at past timesteps, not future ones. This makes them suitable for real-time prediction where we can't peek ahead. The output at time t only depends on inputs at times โ‰ค t.

โœ“ Parallel processing (fast!)
โœ“ Stable gradients
โœ“ Flexible receptive field via dilation
โœ— Fixed input length at training time

When to use TCNs: TCNs are the workhorse of many STGNN architectures (like ST-GCN and STGCN from the research literature). They're fast to train because convolutions can be parallelized across time. Choose TCNs when you have fixed-length sequences, want fast training, and don't need to handle very long-range dependencies beyond your receptive field.

Option 3: Temporal Attention / Transformers

Transformers use self-attention to allow each timestep to "look at" any other timestep directly, without the information having to flow through intermediate states. This is particularly powerful for capturing long-range dependencies โ€” if a player's movement at t=50 is directly influenced by their position at t=5, attention can model this directly.

Temporal Attention: Learning Which Timesteps Matterh^tQueryKeys/Values:h^0h^1h^2h^3h^4ฮฑ=0.05ฮฑ=0.10ฮฑ=0.45ฮฑ=0.25ฮฑ=0.15ฮฃ ฮฑ_i ยท h^iModel learns: "Recent pass matters more than position 5 seconds ago"
Temporal Attention

Use self-attention across timesteps. Each timestep computes a Query (what am I looking for?), and all timesteps provide Keys (what do I contain?) and Values (what information do I provide?). The attention weights determine how much each past timestep contributes to the current timestep's representation.

H_out = softmax(QK^T / โˆšd) V where Q, K, V = linear(H_in)

The attention weights are interpretable: if ฮฑ_t,t' is high, it means the representation at time t heavily relies on information from time t'. This can reveal which past moments are most predictive of future movement โ€” e.g., "the moment the midfielder received the ball" might have high attention weight.

โœ“ Captures any temporal dependency directly
โœ“ Parallel processing
โœ“ Interpretable attention weights
โœ— O(Tยฒ) complexity โ€” expensive for long sequences

When to use Transformers: Transformers excel when you need to capture complex, long-range temporal dependencies and can afford the computational cost. They're ideal for understanding complete plays (10-30 seconds) where early movements set up later outcomes. The quadratic complexity means they don't scale well to very long sequences (hundreds of timesteps), but for typical football analysis windows, they're often the most expressive choice.

Comparison Table

AspectLSTM/GRUTCNTransformer
ParallelizableNo โŒYes โœ“Yes โœ“
Long-range depsGoodNeeds dilationExcellent
ComplexityO(T)O(T)O(Tยฒ)
Variable lengthYes โœ“TrickyYes โœ“
Memory usageLowMediumHigh
Best football useShort-term predictionSTGCN-style modelsFull play analysis
Hybrid Approaches

In practice, many state-of-the-art models combine approaches. For example, you might use TCN for local temporal patterns (the last 0.5 seconds) and attention for longer-range patterns (capturing how the play started). Or use an LSTM encoder with attention-based decoding. Don't feel limited to just one option!

The Bottom Line

For modern football analytics with tracking data, STGNNs are the gold standard. They capture the full richness of the data: player identities, spatial relationships, and temporal dynamics. While simpler models (RNN-only, GNN-only) can work as baselines, STGNNs consistently achieve state-of-the-art results on trajectory prediction, action recognition, and value estimation tasks.

Football Applications
STGNNs on the pitch

STGNNs are perfectly suited for football analytics because matches are fundamentally spatiotemporal โ€” player positions and interactions evolving over time. Here are the key applications:

Trajectory Prediction: From Past to FutureHistory (observed)P1P2STGNNEncoder + DecoderPrediction (future)t-3t (now)t+1t+3Observed historyPredicted future
๐ŸŽฏ Player Trajectory Prediction

The flagship application. Given 2 seconds of tracking data, predict where each of the 22 players will be in the next 3-5 seconds.

Input: X โˆˆ โ„^(50ร—23ร—6) โ€” 50 frames, 23 entities, [x,y,vx,vy,ax,ay]
Output: ลถ โˆˆ โ„^(75ร—22ร—2) โ€” 75 future frames, 22 players, [x,y]
Metrics: ADE, FDE, collision rate
โšฝ Ball Trajectory Prediction

Predict where the ball will go next, considering player positions and movements. Crucial for anticipating passes and shots.

Challenge: Ball physics (bouncing, spinning) + player interactions (kicks, headers)
Approach: Hybrid model with physics priors + STGNN for player context
๐Ÿ“Š Expected Possession Value (EPV)

Predict the probability of a possession ending in a goal, given the current spatiotemporal context. Powers real-time valuation.

Input: Sequence of frames leading to current state
Output: Probability in [0, 1] of eventual goal
Architecture: STGNN encoder โ†’ graph pooling โ†’ MLP โ†’ sigmoid
๐Ÿ”„ Action Recognition & Anticipation

Classify what action is happening (pass, shot, dribble) and predict what will happen next.

Recognition: Classify current action from trajectory context
Anticipation: Predict next action before it happens
Use case: Real-time commentary, defensive alerts
๐Ÿƒ Counterfactual Analysis

"What would have happened if the defender had moved differently?" Generate alternative futures to evaluate decisions.

Approach: Train CVAE-STGNN to sample multiple possible futures
Analysis: Compare actual vs. counterfactual trajectories
Insight: Quantify "space created" or "threat neutralized"
๐ŸŽฎ Game State Representation

Use STGNN to learn a compact embedding of the current game state for downstream tasks.

Embedding: h_game = Readout(STGNN(trajectory_sequence))
Uses: Similar play retrieval, tactical clustering, pre-match analysis
Why STGNNs Are Perfect for Football

Football tracking data is inherently a dynamic graph: fixed nodes (players) with time-varying edges (who's near whom) and features (positions, velocities). No other architecture captures this structure as naturally. The spatial component models tactical shape and interactions; the temporal component models how plays develop and unfold.

Evaluation Metrics
How to measure STGNN performance

Evaluating trajectory predictions requires specialized metrics that capture different aspects of prediction quality.

Average Displacement Error (ADE)
ADE = (1/T_pred) ฮฃ_t ||ลท^t - y^t||โ‚‚

Average L2 distance between predicted and ground truth positions across all prediction timesteps. Lower is better.

Final Displacement Error (FDE)
FDE = ||ลท^T_pred - y^T_pred||โ‚‚

L2 error at the final prediction timestep only. Important when endpoint accuracy matters most.

Miss Rate @ Threshold
MR@2m = % of predictions with FDE > 2 meters

Percentage of predictions that exceed a distance threshold. Captures failure cases.

Negative Log-Likelihood (NLL)
NLL = -log p(y | ฮผฬ‚, ฮฃฬ‚)

For probabilistic models. Measures how well the predicted distribution covers the true future.

Football-Specific Metrics

Collision Rate

Percentage of predicted trajectories that result in player-player collisions. Should be low!

Off-Pitch Rate

Percentage of predictions that leave the pitch boundaries. Should be ~0%.

Formation Preservation

Do predicted formations maintain realistic team shapes? Measure with Procrustes distance.

Physics Plausibility

Are predicted velocities and accelerations within human limits? Check max speed, jerk.

Full Architecture Comparison
When to use what

Let's bring together everything we've learned across the series and compare all the architectures.

AspectCNNRNNGNNSTGNN
Data typeGrids (images)SequencesStatic graphsDynamic graphs
Spatial infoโœ“ (grid)โœ—โœ“ (graph)โœ“ (graph)
Temporal infoโœ—โœ“โœ—โœ“
Key operationConvolutionRecurrenceMessage passingST message passing
ComplexityO(Hร—W)O(T)O(N+E)O(Tร—(N+E))
Best football useHeatmapsEvent seqSnapshotsTrajectories โญ
The Bottom Line

For modern football analytics with tracking data, STGNNs are the gold standard. They capture the full richness of the data: player identities, spatial relationships, and temporal dynamics. While simpler models (RNN-only, GNN-only) can work as baselines, STGNNs consistently achieve state-of-the-art results on trajectory prediction, action recognition, and value estimation tasks.