Probaballer - Football Analytics & Betting Insights

3. Recurrent Neural Networks (RNNs)

How neural networks learn from sequences — remembering the past to understand the present, from player trajectories to match narratives.

Sequence ModelingDeep Learning45 min read

Why Do We Need RNNs?

In the previous articles, we learned about feedforward neural networks (MLPs) and CNNs. Both process each input independently — they have no memory of previous inputs. But many real-world problems involve sequences where order matters.

The Problem: No Memory

Consider predicting a player's next position. A feedforward network sees only the current position — it doesn't know if the player is running forward, backward, or standing still. Without context from previous time steps, it can't understand motion or intent.

The Insight: Sequential Data Is Everywhere

Football is inherently sequential: a pass leads to a dribble leads to a shot. A player's movement trajectory unfolds over time. Match momentum ebbs and flows. To model these patterns, we need networks that can remember and reason over time.

Examples of Sequential Data

In General

• Text (words in a sentence)
• Speech (audio samples over time)
• Video (frames in sequence)
• Stock prices (values over time)
• Music (notes in order)

In Football

• Player trajectories (x, y positions over time)
• Event sequences (pass → dribble → shot)
• Ball movement patterns
• Team formation evolution
• Match momentum/pressure trends

Recurrent Neural Networks solve this by introducing recurrent connections — the network's output at one time step becomes part of its input at the next time step, creating a form of memory.

The Basic RNN Architecture

Adding memory through recurrence

An RNN processes sequences one element at a time, maintaining a hidden state that acts as memory. At each time step, it combines the current input with the previous hidden state to produce an output and a new hidden state.

RNN Cell Structure

The RNN Equations

Hidden state update:

hₜ = tanh(Wₕ · hₜ₋₁ + Wₓ · xₜ + b)

Output:

yₜ = Wᵧ · hₜ + bᵧ

hₜ: Hidden state at time t (the "memory")

xₜ: Input at time t

Wₕ, Wₓ, Wᵧ: Weight matrices (learned)

tanh: Activation function (squashes to [-1, 1])

Unrolling Through Time

To understand how RNNs process sequences, we "unroll" the network through time. Each time step uses the same weights — this is called weight sharing across time, similar to how CNNs share weights across space.

Unrolled RNN

Symbol Definitions

hₜHidden state at time t — the network's memory

xₜInput vector at time t

yₜOutput at time t

WWeight matrices (shared across all time steps)

Football Analogy

Think of the hidden state as a coach's mental model of the game. At each moment, the coach combines what they see now (xₜ = current player positions) with their memory of how play has developed (hₜ₋₁). Their updated understanding (hₜ) informs their decisions and predictions about what will happen next.

Types of Sequence Tasks

Different input-output configurations

RNNs are flexible — they can handle various input/output configurations depending on the task:

Many-to-One

Process a sequence, output a single value. Use the final hidden state for classification/regression.

Example: Classify a possession sequence as "goal" or "no goal"

One-to-Many

Single input generates a sequence of outputs. Often used with a "seed" input.

Example: Generate a tactical play from an initial formation

Many-to-Many (Aligned)

Output at each time step. Input and output sequences have the same length.

Example: Predict next position for each player at each frame

Encoder-Decoder

Encode input sequence into a context vector, then decode to output sequence of different length.

Example: Translate event sequence to natural language commentary

The Vanishing Gradient Problem

Why basic RNNs struggle with long sequences

Basic RNNs have a critical flaw: they struggle to learn long-range dependencies. When training with backpropagation through time (BPTT), gradients must flow backward through many time steps. At each step, they get multiplied by the weight matrix and activation derivatives.

The Problem

If these multiplied values are consistently less than 1, gradients shrink exponentially — this is the vanishing gradient. If greater than 1, they explode. Either way, the network can't learn connections between distant time steps.

The Math Behind Vanishing Gradients

Gradient through T time steps:

∂L/∂h₁ = ∂L/∂hₜ × ∏ₖ₌₂ᵀ (∂hₖ/∂hₖ₋₁)

Each term ∂hₖ/∂hₖ₋₁ involves:

∂hₖ/∂hₖ₋₁ = Wₕᵀ × diag(tanh'(zₖ))

Since tanh'(x) ≤ 1 and is often much smaller, repeated multiplication causes gradients to vanish. For a 100-step sequence, even 0.9¹⁰⁰ ≈ 0.00003!

Football Consequence

A basic RNN tracking a 90-minute match would "forget" early events by halftime. It couldn't learn that a first-half tactical change led to a second-half goal. We need architectures that can maintain memory over longer periods.

LSTM: Long Short-Term Memory

Solving the vanishing gradient with gates

LSTMs (Hochreiter & Schmidhuber, 1997) solve the vanishing gradient problem with a clever architecture featuring gates that control information flow. The key innovation is a cell state — a highway that runs through time with minimal modification.

The Four Gates of LSTM

1. Forget Gate (fₜ)

Decides what information to discard from the cell state. Outputs values between 0 (forget completely) and 1 (keep entirely).

fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)

2. Input Gate (iₜ)

Decides which new information to store in the cell state.

iₜ = σ(Wi · [hₜ₋₁, xₜ] + bi)

3. Candidate Values (c̃ₜ)

Creates new candidate values that could be added to the cell state.

c̃ₜ = tanh(Wc · [hₜ₋₁, xₜ] + bc)

4. Output Gate (oₜ)

Decides what parts of the cell state to output as the hidden state.

oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo)

Cell State Update

New cell state:

cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ

(forget old info) + (add new info)

Hidden state output:

hₜ = oₜ ⊙ tanh(cₜ)

⊙ denotes element-wise multiplication (Hadamard product)

Why LSTMs Work

The cell state acts as a highway — information can flow unchanged through many time steps. The forget gate can be close to 1, allowing gradients to flow back without vanishing. The network learns what to remember and forget, rather than forgetting everything exponentially.

Football Analogy

The LSTM is like a football analyst with a notepad. The cell state is the notepad — persistent notes about the match. The forget gate decides "this substitution info is now outdated, cross it out." The input gate decides "this formation change is important, write it down." The output gate decides "for predicting the next event, I need to focus on recent pressing intensity."

GRU: A Simpler Alternative

Fewer gates, similar performance

Gated Recurrent Units (Cho et al., 2014) simplify the LSTM architecture by combining the forget and input gates into a single update gate, and merging the cell state and hidden state.

Update Gate (zₜ)

Controls how much of the past to keep vs. how much new info to add.

zₜ = σ(Wz · [hₜ₋₁, xₜ])

Reset Gate (rₜ)

Controls how much of the past hidden state to use when computing the candidate.

rₜ = σ(Wr · [hₜ₋₁, xₜ])

Candidate hidden state:

h̃ₜ = tanh(W · [rₜ ⊙ hₜ₋₁, xₜ])

Final hidden state:

hₜ = (1 - zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ

LSTM vs GRU

LSTM

• 4 gates, separate cell state
• More parameters (~4x hidden size²)
• Often better on very long sequences
• More expressive but slower

GRU

• 2 gates, no separate cell state
• Fewer parameters (~3x hidden size²)
• Often sufficient for most tasks
• Faster training and inference

Which to Choose?

In practice, LSTM and GRU perform similarly on most tasks. Start with GRU for faster iteration, switch to LSTM if you need to model very long-range dependencies. For football tracking data at 25 FPS, GRU is often sufficient.

Bidirectional RNNs

Looking both forward and backward

Standard RNNs only see the past — at time t, they've only processed x₁ through xₜ. But sometimes context from the future is also useful. Bidirectional RNNs run two parallel RNNs: one forward, one backward.

Forward pass:

h→ₜ = RNN_forward(xₜ, h→ₜ₋₁)

Backward pass:

h←ₜ = RNN_backward(xₜ, h←ₜ₊₁)

Combined output:

hₜ = [h→ₜ ; h←ₜ] (concatenation)

When to Use Bidirectional

Use bidirectional RNNs when you have the complete sequence available before making predictions (offline processing). Don't use for real-time predictions where you can't see the future. In football: great for post-match analysis, not for live predictions.

Football Applications

RNNs on the pitch

RNNs and their variants are widely used in football analytics wherever sequential patterns matter:

Trajectory Prediction

Predict where players and the ball will move next, based on their movement history. Essential for anticipating plays.

Input: Sequence of (x, y, vx, vy) per player | Output: Future positions | Model: LSTM encoder-decoder

Event Sequence Modeling

Model the flow of events (pass, dribble, shot) to predict what action comes next or to assess possession quality.

Input: Sequence of event embeddings | Output: Next event probability, xG | Model: GRU with attention

Match Momentum & Pressure

Track how match dynamics evolve over time — which team is dominating, when momentum shifts occur.

Input: Rolling window of match statistics | Output: Momentum score, win probability | Model: Bidirectional LSTM

Injury Risk from Movement Patterns

Analyze player movement sequences to detect fatigue or abnormal patterns that might indicate injury risk.

Input: Speed, acceleration, direction changes over time | Output: Risk score | Model: LSTM classifier

Tactical Pattern Recognition

Identify recurring tactical patterns in how teams build up play or defend.

Input: Formation sequences, player movement patterns | Output: Pattern clusters, play labels | Model: LSTM autoencoder

The Limitation

RNNs treat sequences as 1D chains — they process one element after another. But football involves 22 players interacting simultaneously. To model these complex interactions, we need architectures that can handle graph structures — that's where Graph Neural Networks come in (next article)!

Training RNNs

Backpropagation through time

RNNs are trained using Backpropagation Through Time (BPTT) — essentially regular backpropagation applied to the unrolled network.

1Unroll the RNN for T time steps

2Forward pass: Compute outputs y₁, y₂, ..., yₜ

3Compute loss: Sum losses at each time step (if applicable)

4Backward pass: Propagate gradients back through all time steps

5Accumulate gradients for shared weights across all time steps

6Update weights using optimizer (Adam, SGD, etc.)

Practical Training Tips

Gradient Clipping

Clip gradients to a maximum norm (e.g., 1.0) to prevent exploding gradients.

Truncated BPTT

For very long sequences, backpropagate only through the last k steps.

Layer Normalization

Apply LayerNorm inside LSTM/GRU cells for stable training.

Dropout

Apply recurrent dropout (same mask across time) to prevent overfitting.

Summary & What's Next

What You Learned

✓ Why feedforward networks can't handle sequences
✓ RNN architecture: hidden states as memory
✓ The vanishing gradient problem
✓ LSTM: gates to control memory flow
✓ GRU: simpler alternative to LSTM
✓ Bidirectional RNNs for full context
✓ Football applications (trajectories, events)

Coming Next in This Series

4. Graph Neural Networks (GNNs)
→ Modeling relationships between entities
5. Spatiotemporal GNNs for Football
→ Combining graphs with time

Key Takeaway

RNNs give neural networks memory — the ability to reason about sequences and temporal patterns. LSTMs and GRUs solve the vanishing gradient problem through gating mechanisms. But football isn't just a sequence — it's a network of 22 interacting players. To model these complex spatial relationships, we need Graph Neural Networks — coming up next!