💷📊
6. Attention & Transformers
The revolutionary mechanism that lets neural networks focus on what matters — enabling models to dynamically weight the importance of different players, positions, and moments in football analytics.
Attention MechanismTransformers45 min read
The Problem: Not Everything Matters Equally

In the previous articles, we explored architectures that treat all inputs somewhat equally. CNNs apply the same filters everywhere; RNNs process each timestep with the same weights; GNNs aggregate all neighbors with equal or fixed importance. But in the real world — and especially in football — not everything deserves equal attention.

The Bottleneck Problem

Consider an RNN processing a 10-second football sequence. All the information from the first 8 seconds must squeeze through a single hidden state vector to reach the prediction. Important early events (like a player starting a run) get compressed and potentially lost. We need a way to directly access any part of the input, weighted by relevance.

Real Football Examples

Predicting a Pass

When predicting where a midfielder will pass, the ball-receiving players matter far more than the goalkeeper 60 meters away. But a standard model treats all 22 players equally.

Analyzing a Counter-Attack

The crucial moment was 5 seconds ago when possession changed — but an RNN might have "forgotten" that by the time the shot happens. We need to directly reference that moment.

Defensive Coverage

A defender should focus on the attacker they're marking, not the player on the opposite wing. The model needs to learn these selective focus patterns.

Set Piece Analysis

During a corner kick, the players in the box are what matter — the holding midfielders staying back are largely irrelevant for predicting the outcome.

The Solution: Attention

Attention mechanisms solve this by letting the model learn which inputs are relevant for each output. Instead of treating all players/timesteps equally, the model computes attention weights that dynamically scale each input's contribution. High weight = important; low weight = ignore.

The Attention Mechanism: Query, Key, Value
The core building block of modern deep learning

Attention can be understood through an analogy: imagine you're searching for information in a database. You have a query (what you're looking for), the database has keys (labels for each entry), and each entry has values (the actual information). Attention computes how well your query matches each key, then returns a weighted combination of the values.

The Attention Mechanism: Query, Key, ValueQuery (Q)"What am I looking for?"StrikerKeys (K)"What do I contain?"GKCBCMLWRWAttention Scores (Q·K)α=0.05α=0.10α=0.20α=0.40α=0.25softmax → weights sum to 1Values (V)"What information do I provide?"[x,y,v][x,y,v][x,y,v][x,y,v][x,y,v]Output = Σ α_i · V_iWhat's happening?The striker "asks":"Who's giving me the ball?"LW has highest attention→ most relevant for pass

The Three Components

Query (Q): "What am I looking for?"

The query represents what the current element is seeking. In football, if the striker is trying to understand the game state, their embedding becomes the query.

Q = X · W^Q (project input into query space)
Key (K): "What do I contain?"

Keys are labels that describe what each element offers. Each player's embedding is projected into a key that describes their characteristics (position, role, state).

K = X · W^K (project input into key space)
Value (V): "What information do I provide?"

Values contain the actual information that gets aggregated. Once attention weights are computed, we take a weighted sum of values.

V = X · W^V (project input into value space)

The Attention Formula

Scaled Dot-Product Attention:
Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V
Q · K^T = Compute similarity between queries and keys (dot product)
/ √d_k = Scale by square root of key dimension (prevents vanishing gradients)
softmax = Convert scores to probabilities (weights sum to 1)
· V = Weighted sum of values using attention weights

Step-by-Step Example

Setup: 5 players
GK: embedding h_1 = [0.1, 0.2, ...]
CB: embedding h_2 = [0.3, 0.1, ...]
CM: embedding h_3 = [0.5, 0.4, ...]
LW: embedding h_4 = [0.7, 0.6, ...]
ST: embedding h_5 = [0.8, 0.5, ...] ← query
Question: Who should ST focus on?
The striker (ST) wants to know which players are most relevant for their next action. We compute attention from ST to all players.
Step 1: Compute Q, K, V
Q_ST = h_5 · W^Q = [query vector for striker]
K_all = [h_1, h_2, h_3, h_4, h_5] · W^K = [key for each player]
V_all = [h_1, h_2, h_3, h_4, h_5] · W^V = [value for each player]
Step 2: Compute Attention Scores
scores = Q_ST · K_all^T / √d_k
scores = [0.2, 0.5, 1.1, 2.3, 1.8] (example values)
Step 3: Apply Softmax
α = softmax(scores) = [0.05, 0.08, 0.15, 0.45, 0.27]
↑ LW has highest attention! (makes sense — best passing option)
Step 4: Weighted Sum of Values
output = 0.05·V_GK + 0.08·V_CB + 0.15·V_CM + 0.45·V_LW + 0.27·V_ST
→ Output is dominated by LW's information!
Result: The striker's updated representation now emphasizes information from the left winger — the most relevant player for their current situation!
Self-Attention: Everyone Attends to Everyone
The foundation of Transformers

Self-attention is a special case where the queries, keys, and values all come from the same sequence. Every element attends to every other element (including itself), creating a complete picture of relationships within the sequence.

In football terms: every player looks at every other player to understand the full game state. The striker considers the goalkeeper's position, the midfielders' runs, the defenders' positioning — all weighted by relevance to their own situation.

Self-Attention: Every Element Attends to Every OtherInput: Player embeddingsGKCBCMSTLWAttention Matrix (5×5)GKGKCBCBCMCMSTSTLWLWEach player "looks at" all other playersLearns who is relevant for each positionDiagonal = self-attention weightOff-diagonal = attention to others
Self-attention for N players:
Q = X · W^Q, K = X · W^K, V = X · W^V (all from same X)
Output = softmax(Q · K^T / √d) · V ∈ ℝ^(N × d)
The attention matrix A = softmax(QK^T/√d) is N×N — every player attends to every other player. A[i,j] tells us how much player i attends to player j.

Properties of Self-Attention

Permutation Equivariant

If you reorder the players, the outputs are reordered the same way. No fixed position assumptions — perfect for sets of players!

Fully Connected

Unlike GNNs where you define edges, self-attention connects everyone. The model learns which connections matter.

Parallelizable

Unlike RNNs, all attention computations can happen in parallel — no sequential bottleneck. Massive speedup on GPUs!

Interpretable

Attention weights are directly inspectable — you can visualize which players the model focuses on for each decision.

The Quadratic Cost

Self-attention computes an N×N matrix, making it O(N²) in both memory and time. For 22 players, that's 484 attention pairs — totally fine. But for very long sequences (1000+ timesteps), this becomes expensive. Various "efficient attention" methods exist to address this.

Multi-Head Attention: Multiple Perspectives
Why one attention pattern isn't enough

A single attention pattern can only capture one type of relationship at a time. But players have multiple relevant relationships simultaneously: who's in passing range? Who's marking whom? Who has space? Multi-head attention runs multiple attention mechanisms in parallel, each learning a different pattern.

Multi-Head Attention: Multiple Attention Patterns in ParallelInputPlayerEmbeddings(N × d)Head 1Position focusHead 2Velocity focusHead 3Team focus...Concat[H1; H2; H3; ...]LinearW^OOutWhy multiple heads?Each head learns different relationshipsOne for position, one for velocity, one for team...
Multi-head attention formula:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W^O
where head_i = Attention(Q·W_i^Q, K·W_i^K, V·W_i^V)
Each head has its own W^Q, W^K, W^V projections, learning different aspects of relevance.

What Each Head Might Learn (Football)

Head 1: Proximity

Attends to nearby players — those within immediate passing/marking range.

Head 2: Team Structure

Attends to teammates in the same tactical unit (e.g., the defensive line).

Head 3: Ball Focus

Strongly attends to whoever has/is near the ball — the center of action.

Head 4: Velocity

Attends to players moving in similar/opposing directions — tracking runs.

Head 5: Marking

Defenders attend to the attackers they're marking, attackers to their markers.

Head 6: Space

Attends to players in open space — potential passing/running options.

Typical Configuration

Standard Transformer uses 8 heads. For a 512-dimensional model, each head operates in d_k = 512/8 = 64 dimensions. Total parameters are roughly the same as a single large attention, but you get 8 different perspectives. The final output concatenates all heads and projects back to the model dimension.

Positional Encoding: Adding Order
How Transformers know position without recurrence

Self-attention is permutation equivariant — it treats inputs as a set, not a sequence. But for temporal data (like tracking over time), order matters! A player at t=0 is different from the same player at t=10, even if their features are identical. Positional encodings inject position information.

Positional Encoding: Adding Sequence Order InformationWithout position info:{GK, CB, CM, ST} = {ST, CM, CB, GK}❌ Can't distinguish order!With positional encoding:GK+ PE(t=0)CB+ PE(t=1)CM+ PE(t=2)ST+ PE(t=3)✓ Each position is unique!Sinusoidal Encoding:PE(pos, 2i) = sin(pos / 10000^(2i/d)) , PE(pos, 2i+1) = cos(...)

Types of Positional Encoding

Sinusoidal (Original)

Uses sine and cosine functions of different frequencies to create unique position vectors.

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Pros: Works for any sequence length, captures relative positions
Cons: No learnable parameters
Learned Embeddings

Learn a separate embedding vector for each position (up to max sequence length).

PE(pos) = Embedding[pos] (lookup table)
Pros: More flexible, can learn task-specific patterns
Cons: Fixed max length, doesn't generalize to longer sequences
Relative Position (RPE)

Encode relative distance between positions rather than absolute position.

Attention modified: A_ij += bias(i - j)
Used in: Transformer-XL, Music Transformer
Good for: Tasks where relative timing matters more than absolute
Rotary Position (RoPE)

Encodes position by rotating the embedding vector — elegant mathematical properties.

Rotation matrices applied to Q and K
Used in: LLaMA, modern LLMs
Benefit: Naturally captures relative position via rotation
Football: What Position Means

For football trajectories, "position" typically means time step (t=0, t=1, ...). But you might also encode player identity (player 1-22) or even spatial position on pitch as separate embeddings added to the input. The right choice depends on your task.

The Transformer Architecture
The complete picture: encoder-decoder with attention

The Transformer (Vaswani et al., 2017) combined all these ideas into a complete architecture that revolutionized deep learning. Originally for machine translation, it's now the foundation of GPT, BERT, and countless other models — including state-of-the-art sports analytics systems.

Transformer Architecture: Encoder-Decoder with AttentionENCODERInput Embedding+ Positional EncodingMulti-Head Self-AttentionAdd & NormFeed-Forward NetworkAdd & Norm×NEncoder OutputDECODEROutput Embedding+ Positional EncodingMasked Self-AttentionCross-Attention (to Enc)Feed-Forward NetworkOutput ProbabilitiesPredictionsN = 6 layers in original Transformer

Transformer Components

Encoder Stack

Processes the input sequence (e.g., 50 frames of player positions). Each layer has:

  • Multi-head self-attention: Every position attends to every other position
  • Feed-forward network: 2-layer MLP applied position-wise
  • Residual connections: Add input to output (helps gradient flow)
  • Layer normalization: Stabilizes training
Decoder Stack

Generates output sequence (e.g., future trajectories). Additional components:

  • Masked self-attention: Can only attend to past outputs (no peeking at future!)
  • Cross-attention: Attends to encoder output — "what input is relevant for this output?"
  • • Same feed-forward, residuals, and layer norm as encoder
The Feed-Forward Network

Applied independently to each position after attention:

FFN(x) = ReLU(x · W_1 + b_1) · W_2 + b_2

Typically expands dimensionality 4× (e.g., 512 → 2048 → 512). This is where "processing" happens after attention gathers information.

Transformer Variants

Encoder-Only (BERT)

Just the encoder. Good for understanding/classification tasks. Bidirectional attention.

Football: Classify game states, predict labels
Decoder-Only (GPT)

Just the decoder. Good for generation. Causal (left-to-right) attention.

Football: Autoregressive trajectory generation
Encoder-Decoder (T5)

Full architecture. Good for sequence-to-sequence tasks.

Football: Input past → output future trajectories
Why Attention Beats Recurrence
The advantages that made Transformers dominant

RNNs process sequences sequentially — information from early timesteps must flow through every intermediate state to reach later timesteps. Attention provides direct access to any part of the sequence, fundamentally changing how information flows.

RNN vs Attention: Information FlowRNN: Sequential Patht=0t=1t=2t=3t=4❌ Long path from t=0 to t=4Information degrades over distanceAttention: Direct Accesst=0t=1t=2t=3t=4✓ Direct path to any timestepEqual access regardless of distanceRNNAttentionPath length: O(T)Path length: O(1)Sequential, slowParallel, fast

Key Advantages of Attention

1. Constant Path Length

In RNNs, information from t=0 must pass through O(T) steps to reach t=T. In attention, it's a direct connection — O(1) path length. No signal degradation over distance!

2. Parallelization

RNNs compute h_t from h_{t-1} — inherently sequential. Attention computes all outputs simultaneously. Massive speedup on GPUs, especially for long sequences.

3. Interpretability

Attention weights are directly inspectable — you can see which inputs the model focused on for each output. RNN hidden states are opaque.

4. No Vanishing Gradients

RNNs suffer from gradients vanishing over long sequences. Attention has direct gradient paths, making training more stable.

The Tradeoff: Memory

Attention's O(N²) complexity means it uses more memory than RNNs for long sequences. For a 1000-token sequence, that's 1 million attention pairs! Various "efficient attention" methods (Linear Attention, Sparse Attention, FlashAttention) address this.

Attention in Football Analytics
Where attention mechanisms shine on the pitch

Attention is particularly powerful for football because the game is fundamentally about selective focus. Players don't consider all 21 other players equally — they focus on immediate threats, passing options, and tactical targets. Attention learns these focus patterns from data.

Attention in Football: "Who Should I Focus On?"GKCBCBCMCMLWRWSTStriker (query) attends most to LW — best passing optionThicker line = higher attention weight
🎯 Trajectory Prediction

Predict where each player will move by attending to relevant nearby players and past positions.

Spatial attention: Which other players influence my movement?
Temporal attention: Which past moments are predictive of my next move?
Models: TranSPORTmer, Graph Attention Networks for trajectories
⚽ Pass Prediction

Predict pass destination by attending to receiving players weighted by openness, distance, and tactical role.

Query: Ball carrier's embedding
Keys/Values: All potential receivers
Attention: Learns to weight open players in good positions highest
📊 Event Sequence Modeling

Model sequences of match events (pass, dribble, shot, ...) with Transformers to predict next actions.

Input: Sequence of event embeddings (type, location, player)
Attention: Captures dependencies between events (e.g., turnover → counter-attack)
Output: Probability distribution over next event type/location
🔍 Tactical Pattern Recognition

Use attention to identify which players are executing coordinated movements or tactical plays.

Visualization: Attention weights reveal which players are "working together"
Clustering: Players with high mutual attention form tactical units
Insight: Identify pressing triggers, overload patterns, defensive structures
🎮 xG and EPV Models

Enhance expected goals and possession value models with attention over the game state.

Traditional: Handcrafted features (distance, angle, defenders)
With attention: Model learns which players/positions matter for goal probability
Interpretability: Attention weights show "why" a shot was high/low xG
Graph Attention Networks (GAT)

GATs combine GNNs with attention: instead of aggregating all neighbors equally, attention weights are learned for each edge. This is perfect for football — not all nearby players are equally relevant. The marking defender matters more than a distant teammate.

Practical Considerations
What to know when implementing attention
Number of Heads

Standard: 8 heads for d=512. More heads = more patterns, but diminishing returns. For football with 22 players, 4-8 heads typically suffices.

Number of Layers

Original Transformer: 6 encoder + 6 decoder layers. For smaller datasets (like football tracking), 2-4 layers often works better to avoid overfitting.

Dropout

Apply dropout to attention weights and FFN outputs. Typical rate: 0.1. Helps prevent overfitting, especially with limited training data.

Layer Normalization

Pre-norm (norm before attention) vs. post-norm (norm after). Pre-norm often trains more stably, especially for deeper models.

Learning Rate Warmup

Transformers are sensitive to initial learning rate. Use linear warmup (4000 steps typical), then decay. Critical for stable training.

Sequence Length

Memory scales as O(N²). For 25 Hz tracking data, a 4-second window = 100 frames = 10,000 attention pairs. Manageable, but be aware.

Football-Specific Tips
  • Combine with GNNs: Use attention for temporal and GNNs for spatial, or GATs for both
  • Player embeddings: Add learnable embeddings for each player identity (not just position)
  • Team awareness: Include team indicator in embeddings so attention can learn team-specific patterns
  • Relative positions: Consider encoding relative (not absolute) pitch positions
Summary & What's Next
What You Learned
  • ✓ Attention: dynamic weighting of inputs
  • ✓ Query, Key, Value framework
  • ✓ Self-attention: every element attends to all
  • ✓ Multi-head: multiple attention patterns
  • ✓ Positional encoding: adding order info
  • ✓ Transformer architecture: encoder-decoder
  • ✓ Advantages over RNNs: parallelism, direct paths
  • ✓ Football applications: trajectories, passes, tactics
Key Takeaways
  • • Attention lets models focus on what matters
  • • Transformers replaced RNNs as the dominant architecture
  • • Perfect for football: naturally models variable-importance relationships
  • • Combine with GNNs for spatiotemporal football data
  • • Attention weights are interpretable — see what the model focuses on
The Big Picture

Attention is the key innovation that enabled the deep learning revolution of the 2020s. From GPT to image generation to sports analytics, attention mechanisms are everywhere. For football, the combination of Graph Neural Networks (spatial relationships) + Attention/Transformers (temporal and importance weighting) creates the most powerful models available today. This combination — explored in the STGNN literature — represents the cutting edge of sports analytics AI.

Ready to see attention in action? 🎯 Check out our implementation guides for Graph Attention Networks and Transformer-based trajectory prediction.