In the previous articles, we learned about GNNs for modeling spatial relationships between players, and RNNs for modeling temporal patterns. But football involves both simultaneously โ player relationships that evolve over time.
Think about a through ball: to understand whether it will succeed, you need to know where the striker is relative to defenders (spatial), how fast everyone is moving (temporal), and how these relationships are changing โ is the defender closing down or is the striker pulling away? This interplay between space and time is what makes football so tactically rich, and it's exactly what STGNNs are designed to capture.
GNNs alone capture player relationships at a single snapshot but miss how these relationships evolve. RNNs alone capture temporal dynamics but treat each player independently โ missing crucial interactions. A midfielder's run only makes sense in the context of both who is nearby and when they started moving.
Spatiotemporal GNNs (STGNNs) combine the best of both worlds. They model spatial relationships between entities (players) using graph neural networks, while simultaneously capturing temporal dynamics using recurrent or convolutional architectures over time. The result: a model that understands both who interacts with whom and how these interactions evolve.
What Makes Football Data Spatiotemporal?
Modern tracking systems capture player positions at high frequency (typically 25 frames per second). This creates a rich dataset where each "frame" is a snapshot of all 22 players plus the ball, and consecutive frames show how everyone moves. The challenge is that you can't analyze these frames independently โ a player's movement only makes sense in the context of what happened before and what their teammates/opponents are doing.
- โข 22 players with (x, y) positions
- โข Distances and angles between players
- โข Team formations and shapes
- โข Passing lanes and spaces
- โข Proximity-based interactions
- โข Tracking data at 25 Hz (25 frames/sec)
- โข Velocities and accelerations
- โข Sequential plays and movements
- โข Phase transitions (attack โ defense)
- โข Historical context for predictions
A spatiotemporal graph extends the static graph concept to include time. Instead of a single graph, we have a sequence of graphs โ one for each timestep.
Key Properties
Static: Same nodes and edges across time. Only node features X^(t) change.
Dynamic: Both structure (edges) and features can change. More realistic for football!
In football, we typically have fixed node identity โ player 7 at t=0 is the same as player 7 at t=10. This makes temporal modeling easier since we can track individuals.
The Spatiotemporal Signal
For a 2-second window of tracking data: X โ โ^(50 ร 23 ร 6). This means 50 timesteps, 23 entities (22 players + ball), each with 6 features (x, y, vx, vy, ax, ay). The adjacency matrix A^(t) โ โ^(23 ร 23) defines who is connected to whom at each timestep.
The core idea of STGNNs is to alternate or combine spatial processing (GNNs) with temporal processing (RNNs, TCNs, or Transformers). There are two main design patterns.
The Two Main Design Patterns
The key architectural question is: how do we combine spatial and temporal information? Do we first extract spatial features from each frame, then model the temporal evolution? Or do we interleave spatial and temporal processing at every layer? Both approaches have merit, and the choice depends on your specific use case and computational budget.
Apply GNN to each timestep independently, then feed the sequence of node embeddings to a temporal model.
Y = Temporal([H^(1), ..., H^(T)])
Cons: Spatial and temporal learned separately
Stack alternating spatial and temporal layers, allowing information to flow between both dimensions at each level.
(repeat for L layers)
Cons: More complex, harder to train
For football trajectory prediction, Pattern B (Interleaved) typically performs better because player movements depend on both current neighbors (spatial) and recent history (temporal) โ and these interact. However, Pattern A is a great starting point and baseline.
The spatial module is typically a standard GNN (GCN, GAT, or GraphSAGE) applied independently to each timestep's graph. This captures who is interacting with whom at that moment.
Crucially, the same GNN weights are shared across all timesteps โ we're not learning separate networks for t=1, t=2, etc. This parameter sharing makes the model efficient and helps it generalize: the same "how to aggregate neighbor information" logic applies whether it's the 1st frame or the 50th frame.
Spatial Layer Operation
Graph Construction for Football
One of the most important design decisions is how to define edges between players. Unlike social networks where friendships are explicit, football interactions are implicit โ we must infer them from positions. There's no single "correct" answer; the best choice depends on your task and computational budget.
Connect players within a distance threshold.
Each player connects to their K closest players.
All teammates connected; opponents connected separately.
Let the model learn which connections matter.
Don't just use binary edges! Add edge features like distance, relative velocity, angle, team membership. For a GAT, these can be incorporated into the attention mechanism: ฮฑ_ij = attention(h_i, h_j, e_ij).
The temporal module processes the sequence of spatial embeddings. After the GNN has encoded each timestep's graph into node embeddings, we need to model how these embeddings evolve over time. There are three main architectural families, each with distinct strengths and trade-offs.
The choice of temporal module significantly impacts both model performance and computational cost. For real-time applications (like live match analysis), you might prefer faster options; for offline analysis where accuracy is paramount, you can afford more expensive architectures.
Option 1: Recurrent Networks (LSTM/GRU)
Recurrent networks are the classic choice for sequence modeling. They process one timestep at a time, maintaining a "hidden state" that summarizes everything seen so far. For each node (player), we run a separate LSTM that takes the spatial embedding at each timestep and updates its hidden state.
Process the sequence of node embeddings through an LSTM, maintaining a hidden state across time. The LSTM's gates (forget, input, output) learn what to remember and what to forget โ crucial for deciding whether a player's position 2 seconds ago is still relevant.
The hidden state h_i^(t) captures the temporal context for player i up to time t. The cell state c_i^(t) provides longer-term memory. After processing all timesteps, the final hidden state can be used for prediction, or we can use hidden states at all timesteps for dense prediction.
When to use LSTMs: LSTMs work well when you have variable-length sequences and need to make predictions at arbitrary future horizons. They're also a good choice when your sequences are relatively short (<100 timesteps) and you want a simple, well-understood baseline. The downside is that training can be slow since you can't parallelize across time.
Option 2: Temporal Convolutional Networks (TCN)
TCNs apply 1D convolutions along the time axis, treating time like a spatial dimension. Instead of "remembering" through a hidden state, they look at a fixed-size window of past timesteps through the convolutional kernel. The key innovation is dilated convolutions: by skipping timesteps, we can see further back in time without increasing parameters.
Apply 1D convolutions along the time axis. By stacking layers with increasing dilation factors (1, 2, 4, 8...), the receptive field grows exponentially โ allowing the network to "see" 128 timesteps back with just 7 layers, while keeping the same number of parameters as looking back 7 timesteps!
TCNs use causal convolutions โ they only look at past timesteps, not future ones. This makes them suitable for real-time prediction where we can't peek ahead. The output at time t only depends on inputs at times โค t.
When to use TCNs: TCNs are the workhorse of many STGNN architectures (like ST-GCN and STGCN from the research literature). They're fast to train because convolutions can be parallelized across time. Choose TCNs when you have fixed-length sequences, want fast training, and don't need to handle very long-range dependencies beyond your receptive field.
Option 3: Temporal Attention / Transformers
Transformers use self-attention to allow each timestep to "look at" any other timestep directly, without the information having to flow through intermediate states. This is particularly powerful for capturing long-range dependencies โ if a player's movement at t=50 is directly influenced by their position at t=5, attention can model this directly.
Use self-attention across timesteps. Each timestep computes a Query (what am I looking for?), and all timesteps provide Keys (what do I contain?) and Values (what information do I provide?). The attention weights determine how much each past timestep contributes to the current timestep's representation.
The attention weights are interpretable: if ฮฑ_t,t' is high, it means the representation at time t heavily relies on information from time t'. This can reveal which past moments are most predictive of future movement โ e.g., "the moment the midfielder received the ball" might have high attention weight.
When to use Transformers: Transformers excel when you need to capture complex, long-range temporal dependencies and can afford the computational cost. They're ideal for understanding complete plays (10-30 seconds) where early movements set up later outcomes. The quadratic complexity means they don't scale well to very long sequences (hundreds of timesteps), but for typical football analysis windows, they're often the most expressive choice.
Comparison Table
| Aspect | LSTM/GRU | TCN | Transformer |
|---|---|---|---|
| Parallelizable | No โ | Yes โ | Yes โ |
| Long-range deps | Good | Needs dilation | Excellent |
| Complexity | O(T) | O(T) | O(Tยฒ) |
| Variable length | Yes โ | Tricky | Yes โ |
| Memory usage | Low | Medium | High |
| Best football use | Short-term prediction | STGCN-style models | Full play analysis |
In practice, many state-of-the-art models combine approaches. For example, you might use TCN for local temporal patterns (the last 0.5 seconds) and attention for longer-range patterns (capturing how the play started). Or use an LSTM encoder with attention-based decoding. Don't feel limited to just one option!
For modern football analytics with tracking data, STGNNs are the gold standard. They capture the full richness of the data: player identities, spatial relationships, and temporal dynamics. While simpler models (RNN-only, GNN-only) can work as baselines, STGNNs consistently achieve state-of-the-art results on trajectory prediction, action recognition, and value estimation tasks.
STGNNs are perfectly suited for football analytics because matches are fundamentally spatiotemporal โ player positions and interactions evolving over time. Here are the key applications:
The flagship application. Given 2 seconds of tracking data, predict where each of the 22 players will be in the next 3-5 seconds.
Output: ลถ โ โ^(75ร22ร2) โ 75 future frames, 22 players, [x,y]
Metrics: ADE, FDE, collision rate
Predict where the ball will go next, considering player positions and movements. Crucial for anticipating passes and shots.
Approach: Hybrid model with physics priors + STGNN for player context
Predict the probability of a possession ending in a goal, given the current spatiotemporal context. Powers real-time valuation.
Output: Probability in [0, 1] of eventual goal
Architecture: STGNN encoder โ graph pooling โ MLP โ sigmoid
Classify what action is happening (pass, shot, dribble) and predict what will happen next.
Anticipation: Predict next action before it happens
Use case: Real-time commentary, defensive alerts
"What would have happened if the defender had moved differently?" Generate alternative futures to evaluate decisions.
Analysis: Compare actual vs. counterfactual trajectories
Insight: Quantify "space created" or "threat neutralized"
Use STGNN to learn a compact embedding of the current game state for downstream tasks.
Uses: Similar play retrieval, tactical clustering, pre-match analysis
Football tracking data is inherently a dynamic graph: fixed nodes (players) with time-varying edges (who's near whom) and features (positions, velocities). No other architecture captures this structure as naturally. The spatial component models tactical shape and interactions; the temporal component models how plays develop and unfold.
Evaluating trajectory predictions requires specialized metrics that capture different aspects of prediction quality.
Average L2 distance between predicted and ground truth positions across all prediction timesteps. Lower is better.
L2 error at the final prediction timestep only. Important when endpoint accuracy matters most.
Percentage of predictions that exceed a distance threshold. Captures failure cases.
For probabilistic models. Measures how well the predicted distribution covers the true future.
Football-Specific Metrics
Percentage of predicted trajectories that result in player-player collisions. Should be low!
Percentage of predictions that leave the pitch boundaries. Should be ~0%.
Do predicted formations maintain realistic team shapes? Measure with Procrustes distance.
Are predicted velocities and accelerations within human limits? Check max speed, jerk.
Let's bring together everything we've learned across the series and compare all the architectures.
| Aspect | CNN | RNN | GNN | STGNN |
|---|---|---|---|---|
| Data type | Grids (images) | Sequences | Static graphs | Dynamic graphs |
| Spatial info | โ (grid) | โ | โ (graph) | โ (graph) |
| Temporal info | โ | โ | โ | โ |
| Key operation | Convolution | Recurrence | Message passing | ST message passing |
| Complexity | O(HรW) | O(T) | O(N+E) | O(Tร(N+E)) |
| Best football use | Heatmaps | Event seq | Snapshots | Trajectories โญ |
For modern football analytics with tracking data, STGNNs are the gold standard. They capture the full richness of the data: player identities, spatial relationships, and temporal dynamics. While simpler models (RNN-only, GNN-only) can work as baselines, STGNNs consistently achieve state-of-the-art results on trajectory prediction, action recognition, and value estimation tasks.