Probaballer - Football Analytics & Betting Insights

4. Graph Neural Networks (GNNs)

How neural networks learn from relationships — modeling the complex web of interactions between players, passes, and positions on the football pitch.

Graph LearningDeep Learning50 min read

Why Do We Need GNNs?

In the previous articles, we learned about CNNs for spatial patterns in grids and RNNs for temporal patterns in sequences. But what about data that doesn't fit neatly into a grid or a sequence?

The Problem: Not Everything is a Grid or Sequence

Consider a football match with 22 players on the pitch. Each player interacts with multiple others — passing, pressing, marking. These relationships form a network, not a grid (like an image) or a chain (like a sentence). CNNs assume regular spatial structure. RNNs assume a single sequence. Neither captures the arbitrary connectivity between entities in a network.

The Insight: Graphs Are Everywhere

Many real-world systems are naturally described as graphs — collections of entities (nodes) connected by relationships (edges). Social networks, molecules, road systems, and yes, football teams. To learn from this data, we need architectures that respect the graph structure.

Examples of Graph-Structured Data

In General

• Social networks (users + friendships)
• Molecules (atoms + bonds)
• Citation networks (papers + references)
• Road networks (intersections + roads)
• Knowledge graphs (entities + relations)

In Football

• Players on pitch (players + interactions)
• Passing networks (players + passes)
• Team formations (positions + connections)
• Pressing structures (defenders + coverage)
• Tactical relationships (roles + dependencies)

Graph Neural Networks solve this by operating directly on graph-structured data. They learn to combine information from connected nodes, allowing each entity to understand its context within the network.

What is a Graph?

The mathematical structure behind networks

Before diving into GNNs, let's make sure we understand what a graph is. A graph is simply a way to represent entities and their relationships.

A Simple Graph

Formal Definition

A graph G is defined as:

G = (V, E)

V = Set of nodes (vertices) — the entities (e.g., players)

E = Set of edges — the relationships between entities (e.g., passes, proximity)

Additional Graph Components

XNode features — attributes for each node (e.g., player position, velocity)

E_attrEdge features — attributes for each edge (e.g., distance, pass probability)

AAdjacency matrix — N×N matrix where A[i,j]=1 if nodes i,j connected

N(v)Neighborhood — set of nodes connected to node v

Types of Graphs

Undirected Graph

Edges have no direction — if A connects to B, then B connects to A. Example: Players within 10 meters of each other.

Directed Graph

Edges have direction — A→B doesn't mean B→A. Example: Passing network (passes go one direction).

Weighted Graph

Edges have weights representing strength/importance. Example: Number of passes between players.

Dynamic Graph

Structure changes over time. Example: Player positions and interactions evolve during a match.

Football as a Graph

Football Graph Construction

In football analytics, we typically construct graphs where nodes = players and edges = interactions. Edges can be defined by: (1) spatial proximity (within X meters), (2) passing relationships, (3) marking assignments, or (4) fully connected within teams. Node features include position (x, y), velocity, acceleration, team, and role.

The Core Idea: Message Passing

How GNNs propagate information through the graph

The fundamental operation in GNNs is message passing (also called neighborhood aggregation). The idea is beautifully simple: each node updates its representation by gathering and combining information from its neighbors.

The Key Insight

In a football context: a central midfielder's representation should incorporate information about nearby teammates (passing options), nearby opponents (pressing threats), and the ball position. After message passing, the CM "knows" about its local context — who's around, where the space is, what options exist.

The Three Steps of Message Passing

1. MESSAGE: Create Messages from Neighbors

Each neighbor creates a "message" — a transformed version of its features to send to the target node.

m_j = MESSAGE(h_j, e_ij) — often just h_j or a linear transform W·h_j

2. AGGREGATE: Combine All Neighbor Messages

Combine messages from all neighbors into a single aggregated message. Must be permutation invariant (order doesn't matter).

m_N(i) = AGGREGATE({ m_j : j ∈ N(i) }) — sum, mean, max, or attention

3. UPDATE: Compute New Node Representation

Combine the node's own features with the aggregated neighbor message to produce an updated representation.

h'_i = UPDATE(h_i, m_N(i)) — often σ(W·[h_i || m_N(i)]) or similar

The General Message Passing Formula

In one equation:

h_i^(l+1) = UPDATE(h_i^(l), AGGREGATE({ MESSAGE(h_j^(l), e_ij) : j ∈ N(i) }))

h_i^(l) = Node i's representation at layer l

N(i) = Neighbors of node i

e_ij = Edge features between i and j (optional)

l = Layer index (each layer = one round of message passing)

Why Permutation Invariance Matters

The aggregation function must give the same result regardless of the order we process neighbors. If a CM has neighbors [CB, RB, LW, RW], aggregating in order [CB, RB, LW, RW] must equal [LW, CB, RW, RB]. Sum, mean, and max all have this property. This is crucial because graphs have no natural ordering of nodes!

Stacking Layers: The Receptive Field

How deeper networks see further into the graph

After one message passing layer, each node knows about its immediate neighbors (1-hop). But what about nodes further away? By stacking multiple layers, information propagates further through the graph.

Layer 1

Each node aggregates info from 1-hop neighbors. The CM knows about adjacent players.

Layer 2

Each node now has info about 2-hop neighbors. The CM knows about neighbors' neighbors.

Layer L

Each node has aggregated info from all nodes within L hops. The CM understands the broader tactical picture.

The Over-Smoothing Problem

Don't stack too many layers! With many layers, all nodes end up with similar representations because they've all aggregated information from the entire graph. This is called over-smoothing. For most tasks, 2-4 layers is optimal. In football graphs with ~22 nodes, even 2-3 layers often covers the full graph.

Football Analogy

Think of message passing like players communicating during a match. In one "round" (layer), each player talks to their immediate neighbors. After two rounds, information about the whole defensive line has reached the striker via the midfield. After three rounds, everyone has a sense of the full team shape. More rounds just add noise.

Aggregation Functions

Different ways to combine neighbor information

The choice of aggregation function significantly impacts what the GNN can learn. Different functions have different strengths and weaknesses.

SUM Aggregation

m = Σ_{j∈N(i)} h_j

Adds up all neighbor features. Preserves information about how many neighbors have certain features.

Football use: "How much total pressing intensity surrounds this player?"

MEAN Aggregation

m = (1/|N(i)|) × Σ_{j∈N(i)} h_j

Averages neighbor features. Normalizes by node degree, making it robust to varying connectivity.

Football use: "What's the average quality of passing options?"

MAX Aggregation

m = max_{j∈N(i)} h_j (element-wise)

Takes maximum value per feature. Captures the most extreme/important neighbor.

Football use: "What's the most dangerous threat nearby?"

ATTENTION Aggregation

m = Σ_{j∈N(i)} α_ij × h_j (learned α)

Learns importance weights for each neighbor. Most expressive but most expensive.

Football use: "Weight the ball carrier higher than distant teammates."

Which to Choose?

SUM when neighbor count matters (e.g., "surrounded by opponents"). MEAN for general-purpose, stable training. MAX when you care about extremes. ATTENTION when different neighbors have different importance and you have enough data to learn weights.

Popular GNN Architectures

The most widely used GNN variants

Different GNN architectures implement message passing with different design choices. Here are the most influential ones:

GCN (Graph Convolutional Network)

Kipf & Welling, 2017

The foundational modern GNN. Uses normalized mean aggregation with self-loops.

H^(l+1) = σ(D̃^(-½) Ã D̃^(-½) H^(l) W^(l))

Ã = A + I (add self-loops), D̃ = degree matrix of Ã. Symmetric normalization prevents exploding/vanishing features.

GraphSAGE (Sample and Aggregate)

Hamilton et al., 2017

Samples a fixed number of neighbors instead of using all. Enables mini-batch training on huge graphs.

h_i^(l+1) = σ(W · CONCAT(h_i^(l), AGG({h_j : j ∈ sample(N(i))})))

Can use mean, max, or LSTM aggregators. Scales to graphs with millions of nodes.

GAT (Graph Attention Network)

Veličković et al., 2018

Learns attention weights for each edge, allowing the model to focus on important neighbors.

h_i^(l+1) = σ(Σ_{j∈N(i)} α_ij W h_j^(l)), where α_ij = softmax(LeakyReLU(a^T[Wh_i || Wh_j]))

Multi-head attention (like Transformers) provides stability. Most expressive but computationally expensive.

GIN (Graph Isomorphism Network)

Xu et al., 2019

Provably as powerful as the WL graph isomorphism test. Uses sum aggregation with MLP update.

h_i^(l+1) = MLP((1 + ε) · h_i^(l) + Σ_{j∈N(i)} h_j^(l))

ε is a learnable or fixed parameter. Theoretically optimal for graph-level discrimination.

GCN Deep Dive: The Most Common GNN

Understanding the Graph Convolutional Network in detail

The Graph Convolutional Network (GCN) is the most widely used GNN architecture. Let's break down exactly how it works with a step-by-step example.

The GCN Layer Equation

Matrix form:

H^(l+1) = σ(D̃^(-½) Ã D̃^(-½) H^(l) W^(l))

Per-node form:

h_i^(l+1) = σ(W^(l) · Σ_{j∈N(i)∪{i}} (1/√(d_i · d_j)) · h_j^(l))

Symbol Definitions

ÃA + I — adjacency matrix with self-loops added

D̃Degree matrix of Ã (diagonal, D̃_ii = Σ_j Ã_ij)

H^(l)Node feature matrix at layer l (N × F_l)

W^(l)Learnable weight matrix (F_l × F_{l+1})

σActivation function (typically ReLU)

d_iDegree of node i (number of connections + 1)

Worked Example: GCN on a Triangle

Graph: 3 players in a triangle

Nodes: CB, CM, ST (all connected)

Adjacency A =
[0, 1, 1] (CB)
[1, 0, 1] (CM)
[1, 1, 0] (ST)

Node Features (x, y position):

CB: h_1 = [20, 30] (deep position)

CM: h_2 = [50, 50] (central)

ST: h_3 = [70, 50] (forward)

Step 1: Add Self-Loops (Ã = A + I)

Ã = [1, 1, 1; 1, 1, 1; 1, 1, 1] (all 1s since fully connected + self)

Step 2: Compute Degrees

d_1 = d_2 = d_3 = 3 (each node connects to 2 neighbors + self)
D̃ = diag([3, 3, 3])

Step 3: Symmetric Normalization

D̃^(-½) Ã D̃^(-½) = (1/3) × Ã = [1/3, 1/3, 1/3; ...]
(Each neighbor contributes equally, weighted by 1/√(d_i × d_j) = 1/3)

Step 4: Aggregate for CM (node 2)

h_2^agg = (1/3)×h_1 + (1/3)×h_2 + (1/3)×h_3
h_2^agg = (1/3)×[20,30] + (1/3)×[50,50] + (1/3)×[70,50]
h_2^agg = [6.7+16.7+23.3, 10+16.7+16.7] = [46.7, 43.3]

Step 5: Transform and Activate

h_2^(1) = ReLU(W × [46.7, 43.3])
(W transforms to new feature dimension, ReLU adds non-linearity)

Result: CM's new features now incorporate information about CB and ST positions!

From Nodes to Graphs: Readout Functions

Making predictions about entire graphs, not just nodes

Sometimes we want to predict properties of individual nodes (e.g., "will this player receive a pass?"). Other times, we want to predict properties of the entire graph (e.g., "will this attacking situation result in a goal?"). For graph-level predictions, we need a readout (or pooling) function.

Types of Graph Tasks

Node-Level

Predict for each node. Output: N predictions.

Example: Predict each player's next position

Edge-Level

Predict for each edge. Output: E predictions.

Example: Predict pass success between player pairs

Graph-Level

Predict for entire graph. Output: 1 prediction.

Example: Predict xG for this attacking situation

Readout Functions

Global readout:

h_G = READOUT({h_i : i ∈ V}) = sum, mean, or max over all nodes

Sum/Mean Readout

h_G = Σ_i h_i or (1/N) Σ_i h_i

Simple aggregation of all node features. Works well for many tasks.

Hierarchical Pooling

Progressively coarsen the graph (cluster nodes, pool, repeat). Captures multi-scale structure.

Examples: DiffPool, SAGPool, MinCutPool

Attention Readout

Learn which nodes are most important for the graph-level prediction.

h_G = Σ_i α_i h_i (learned attention weights)

Set2Set

Uses an LSTM to iteratively attend to nodes, building a summary. More expressive than simple aggregation.

Football Example: xG from Graph

To predict xG for a shot situation: (1) Build graph with shooter + nearby players as nodes, (2) Run GNN layers to let each player "understand" their context, (3) Apply readout (e.g., attention focusing on shooter and goalkeeper), (4) Feed graph representation to MLP → xG probability.

Training GNNs

How to learn the right parameters

GNNs are trained using the same principles as other neural networks: forward pass, compute loss, backpropagate, update weights. The main difference is that the forward pass involves the graph structure.

1Forward pass: Run message passing layers, get node/edge/graph representations

2Prediction head: Apply MLP to representations to get predictions

3Compute loss: Compare predictions to ground truth labels

4Backprop: Compute gradients for all W matrices in all layers

5Update: Apply optimizer (Adam, SGD) to update weights

Common Loss Functions

Node Classification

Cross-entropy loss on labeled nodes

L = -Σ_i y_i log(ŷ_i)

Node Regression

MSE on continuous targets

L = (1/N) Σ_i (ŷ_i - y_i)²

Link Prediction

Binary cross-entropy on edge existence

L = -Σ_{(i,j)} [y_ij log(σ(h_i·h_j)) + ...]

Graph Classification

Cross-entropy on graph-level labels

L = -y_G log(ŷ_G)

Practical Training Tips

Use Skip Connections

Add residual connections (h^(l+1) = h^(l+1) + h^(l)) to help gradient flow and reduce over-smoothing.

Layer Normalization

Apply LayerNorm after each GNN layer to stabilize training.

Dropout on Edges

Randomly drop edges during training (DropEdge) to prevent overfitting to graph structure.

2-3 Layers Usually Enough

For small graphs (like 22 players), 2-3 layers typically covers the full graph. More layers can hurt.

Football Applications

GNNs on the pitch

GNNs are particularly well-suited for football analytics because football is fundamentally about relationships between players. Here are the main applications:

Player Trajectory Prediction

Predict where each player will move in the next 1-5 seconds, considering team context and opponent positions.

Graph: 22 players as nodes, edges by proximity | Node features: position, velocity, acceleration | Output: Future (x, y) per player

Pass Success Prediction

Predict the probability of a pass succeeding between any two players, given the current game state.

Graph: All players | Task: Edge classification | Model: GNN + edge-level prediction head

Expected Possession Value (EPV)

Predict the value of the current game state — how likely is the attacking team to score vs. lose possession?

Graph: Current frame snapshot | Task: Graph-level regression | Output: Probability of eventual goal

Formation Recognition

Classify the current tactical formation (4-3-3, 4-4-2, 3-5-2, etc.) from player positions.

Graph: 10 outfield players (one team) | Task: Graph classification | Labels: Formation categories

Defensive Coverage Analysis

Identify gaps in defensive structure by analyzing the graph of defenders and their coverage zones.

Graph: Defending team | Edges: Coverage overlap | Task: Identify weak edges/nodes

Team Style Embedding

Learn a vector representation of team playing style from their passing network structure.

Graph: Aggregated passing network (90 mins) | Task: Graph embedding | Use: Similarity, clustering

Why GNNs Excel in Football

Football involves permutation invariant reasoning — whether we label the CM as player 4 or player 8 shouldn't change predictions. GNNs naturally handle this. They also capture relational reasoning — the striker's value depends on the quality of service from midfielders, the goalkeeper's position, the defensive line height, etc. No other architecture captures these dependencies as naturally.

GNNs vs CNNs vs RNNs

Understanding when to use each architecture

Each architecture is designed for different data structures. Here's how they compare:

Aspect	CNN	RNN	GNN
Data structure	Regular grids (images)	Sequences (time series)	Arbitrary graphs
Connectivity	Fixed (neighbors in grid)	Fixed (prev → next)	Flexible (any connections)
Key operation	Convolution (sliding kernel)	Recurrence (hidden state)	Message passing
Football use	Pitch heatmaps, video frames	Event sequences, trajectories	Player interactions
Permutation	Not invariant	Not invariant	Invariant ✓

Combining Architectures

In practice, we often combine these! For spatiotemporal football data: use GNNs for spatial relationships between players at each timestep, and RNNs/Transformers for temporal relationships across timesteps. This combination is called a Spatiotemporal GNN — which is exactly what we'll cover in the next article!

Summary & What's Next

What You Learned

✓ Graphs: nodes, edges, adjacency matrices
✓ Why CNNs/RNNs can't handle graph data
✓ Message passing: aggregate neighbor info
✓ Aggregation functions: sum, mean, max, attention
✓ GNN architectures: GCN, GraphSAGE, GAT, GIN
✓ Stacking layers expands receptive field
✓ Readout for graph-level predictions
✓ Football applications: trajectories, passes, xG

Coming Next in This Series

5. Spatiotemporal GNNs for Football
Combining GNNs with RNNs/Transformers for dynamic graph data
Advanced football analytics use cases

Excited about Graph Neural Networks? 👍 Dive into the code! Check out our GNN GitHub repository for implementations, tutorials, and resources to kickstart your hands-on journey.

View GNN Repository on GitHub