In the previous articles, we explored architectures that treat all inputs somewhat equally. CNNs apply the same filters everywhere; RNNs process each timestep with the same weights; GNNs aggregate all neighbors with equal or fixed importance. But in the real world — and especially in football — not everything deserves equal attention.
Consider an RNN processing a 10-second football sequence. All the information from the first 8 seconds must squeeze through a single hidden state vector to reach the prediction. Important early events (like a player starting a run) get compressed and potentially lost. We need a way to directly access any part of the input, weighted by relevance.
Real Football Examples
When predicting where a midfielder will pass, the ball-receiving players matter far more than the goalkeeper 60 meters away. But a standard model treats all 22 players equally.
The crucial moment was 5 seconds ago when possession changed — but an RNN might have "forgotten" that by the time the shot happens. We need to directly reference that moment.
A defender should focus on the attacker they're marking, not the player on the opposite wing. The model needs to learn these selective focus patterns.
During a corner kick, the players in the box are what matter — the holding midfielders staying back are largely irrelevant for predicting the outcome.
Attention mechanisms solve this by letting the model learn which inputs are relevant for each output. Instead of treating all players/timesteps equally, the model computes attention weights that dynamically scale each input's contribution. High weight = important; low weight = ignore.
Attention can be understood through an analogy: imagine you're searching for information in a database. You have a query (what you're looking for), the database has keys (labels for each entry), and each entry has values (the actual information). Attention computes how well your query matches each key, then returns a weighted combination of the values.
The Three Components
The query represents what the current element is seeking. In football, if the striker is trying to understand the game state, their embedding becomes the query.
Keys are labels that describe what each element offers. Each player's embedding is projected into a key that describes their characteristics (position, role, state).
Values contain the actual information that gets aggregated. Once attention weights are computed, we take a weighted sum of values.
The Attention Formula
Step-by-Step Example
K_all = [h_1, h_2, h_3, h_4, h_5] · W^K = [key for each player]
V_all = [h_1, h_2, h_3, h_4, h_5] · W^V = [value for each player]
scores = [0.2, 0.5, 1.1, 2.3, 1.8] (example values)
↑ LW has highest attention! (makes sense — best passing option)
→ Output is dominated by LW's information!
Self-attention is a special case where the queries, keys, and values all come from the same sequence. Every element attends to every other element (including itself), creating a complete picture of relationships within the sequence.
In football terms: every player looks at every other player to understand the full game state. The striker considers the goalkeeper's position, the midfielders' runs, the defenders' positioning — all weighted by relevance to their own situation.
Output = softmax(Q · K^T / √d) · V ∈ ℝ^(N × d)
Properties of Self-Attention
If you reorder the players, the outputs are reordered the same way. No fixed position assumptions — perfect for sets of players!
Unlike GNNs where you define edges, self-attention connects everyone. The model learns which connections matter.
Unlike RNNs, all attention computations can happen in parallel — no sequential bottleneck. Massive speedup on GPUs!
Attention weights are directly inspectable — you can visualize which players the model focuses on for each decision.
Self-attention computes an N×N matrix, making it O(N²) in both memory and time. For 22 players, that's 484 attention pairs — totally fine. But for very long sequences (1000+ timesteps), this becomes expensive. Various "efficient attention" methods exist to address this.
A single attention pattern can only capture one type of relationship at a time. But players have multiple relevant relationships simultaneously: who's in passing range? Who's marking whom? Who has space? Multi-head attention runs multiple attention mechanisms in parallel, each learning a different pattern.
where head_i = Attention(Q·W_i^Q, K·W_i^K, V·W_i^V)
What Each Head Might Learn (Football)
Attends to nearby players — those within immediate passing/marking range.
Attends to teammates in the same tactical unit (e.g., the defensive line).
Strongly attends to whoever has/is near the ball — the center of action.
Attends to players moving in similar/opposing directions — tracking runs.
Defenders attend to the attackers they're marking, attackers to their markers.
Attends to players in open space — potential passing/running options.
Standard Transformer uses 8 heads. For a 512-dimensional model, each head operates in d_k = 512/8 = 64 dimensions. Total parameters are roughly the same as a single large attention, but you get 8 different perspectives. The final output concatenates all heads and projects back to the model dimension.
Self-attention is permutation equivariant — it treats inputs as a set, not a sequence. But for temporal data (like tracking over time), order matters! A player at t=0 is different from the same player at t=10, even if their features are identical. Positional encodings inject position information.
Types of Positional Encoding
Uses sine and cosine functions of different frequencies to create unique position vectors.
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Cons: No learnable parameters
Learn a separate embedding vector for each position (up to max sequence length).
Cons: Fixed max length, doesn't generalize to longer sequences
Encode relative distance between positions rather than absolute position.
Good for: Tasks where relative timing matters more than absolute
Encodes position by rotating the embedding vector — elegant mathematical properties.
Benefit: Naturally captures relative position via rotation
For football trajectories, "position" typically means time step (t=0, t=1, ...). But you might also encode player identity (player 1-22) or even spatial position on pitch as separate embeddings added to the input. The right choice depends on your task.
The Transformer (Vaswani et al., 2017) combined all these ideas into a complete architecture that revolutionized deep learning. Originally for machine translation, it's now the foundation of GPT, BERT, and countless other models — including state-of-the-art sports analytics systems.
Transformer Components
Processes the input sequence (e.g., 50 frames of player positions). Each layer has:
- • Multi-head self-attention: Every position attends to every other position
- • Feed-forward network: 2-layer MLP applied position-wise
- • Residual connections: Add input to output (helps gradient flow)
- • Layer normalization: Stabilizes training
Generates output sequence (e.g., future trajectories). Additional components:
- • Masked self-attention: Can only attend to past outputs (no peeking at future!)
- • Cross-attention: Attends to encoder output — "what input is relevant for this output?"
- • Same feed-forward, residuals, and layer norm as encoder
Applied independently to each position after attention:
Typically expands dimensionality 4× (e.g., 512 → 2048 → 512). This is where "processing" happens after attention gathers information.
Transformer Variants
Just the encoder. Good for understanding/classification tasks. Bidirectional attention.
Just the decoder. Good for generation. Causal (left-to-right) attention.
Full architecture. Good for sequence-to-sequence tasks.
RNNs process sequences sequentially — information from early timesteps must flow through every intermediate state to reach later timesteps. Attention provides direct access to any part of the sequence, fundamentally changing how information flows.
Key Advantages of Attention
In RNNs, information from t=0 must pass through O(T) steps to reach t=T. In attention, it's a direct connection — O(1) path length. No signal degradation over distance!
RNNs compute h_t from h_{t-1} — inherently sequential. Attention computes all outputs simultaneously. Massive speedup on GPUs, especially for long sequences.
Attention weights are directly inspectable — you can see which inputs the model focused on for each output. RNN hidden states are opaque.
RNNs suffer from gradients vanishing over long sequences. Attention has direct gradient paths, making training more stable.
Attention's O(N²) complexity means it uses more memory than RNNs for long sequences. For a 1000-token sequence, that's 1 million attention pairs! Various "efficient attention" methods (Linear Attention, Sparse Attention, FlashAttention) address this.
Attention is particularly powerful for football because the game is fundamentally about selective focus. Players don't consider all 21 other players equally — they focus on immediate threats, passing options, and tactical targets. Attention learns these focus patterns from data.
Predict where each player will move by attending to relevant nearby players and past positions.
Temporal attention: Which past moments are predictive of my next move?
Models: TranSPORTmer, Graph Attention Networks for trajectories
Predict pass destination by attending to receiving players weighted by openness, distance, and tactical role.
Keys/Values: All potential receivers
Attention: Learns to weight open players in good positions highest
Model sequences of match events (pass, dribble, shot, ...) with Transformers to predict next actions.
Attention: Captures dependencies between events (e.g., turnover → counter-attack)
Output: Probability distribution over next event type/location
Use attention to identify which players are executing coordinated movements or tactical plays.
Clustering: Players with high mutual attention form tactical units
Insight: Identify pressing triggers, overload patterns, defensive structures
Enhance expected goals and possession value models with attention over the game state.
With attention: Model learns which players/positions matter for goal probability
Interpretability: Attention weights show "why" a shot was high/low xG
GATs combine GNNs with attention: instead of aggregating all neighbors equally, attention weights are learned for each edge. This is perfect for football — not all nearby players are equally relevant. The marking defender matters more than a distant teammate.
Standard: 8 heads for d=512. More heads = more patterns, but diminishing returns. For football with 22 players, 4-8 heads typically suffices.
Original Transformer: 6 encoder + 6 decoder layers. For smaller datasets (like football tracking), 2-4 layers often works better to avoid overfitting.
Apply dropout to attention weights and FFN outputs. Typical rate: 0.1. Helps prevent overfitting, especially with limited training data.
Pre-norm (norm before attention) vs. post-norm (norm after). Pre-norm often trains more stably, especially for deeper models.
Transformers are sensitive to initial learning rate. Use linear warmup (4000 steps typical), then decay. Critical for stable training.
Memory scales as O(N²). For 25 Hz tracking data, a 4-second window = 100 frames = 10,000 attention pairs. Manageable, but be aware.
- • Combine with GNNs: Use attention for temporal and GNNs for spatial, or GATs for both
- • Player embeddings: Add learnable embeddings for each player identity (not just position)
- • Team awareness: Include team indicator in embeddings so attention can learn team-specific patterns
- • Relative positions: Consider encoding relative (not absolute) pitch positions
- ✓ Attention: dynamic weighting of inputs
- ✓ Query, Key, Value framework
- ✓ Self-attention: every element attends to all
- ✓ Multi-head: multiple attention patterns
- ✓ Positional encoding: adding order info
- ✓ Transformer architecture: encoder-decoder
- ✓ Advantages over RNNs: parallelism, direct paths
- ✓ Football applications: trajectories, passes, tactics
- • Attention lets models focus on what matters
- • Transformers replaced RNNs as the dominant architecture
- • Perfect for football: naturally models variable-importance relationships
- • Combine with GNNs for spatiotemporal football data
- • Attention weights are interpretable — see what the model focuses on
Attention is the key innovation that enabled the deep learning revolution of the 2020s. From GPT to image generation to sports analytics, attention mechanisms are everywhere. For football, the combination of Graph Neural Networks (spatial relationships) + Attention/Transformers (temporal and importance weighting) creates the most powerful models available today. This combination — explored in the STGNN literature — represents the cutting edge of sports analytics AI.
Ready to see attention in action? 🎯 Check out our implementation guides for Graph Attention Networks and Transformer-based trajectory prediction.