โšฝ๐Ÿ’ท๐Ÿ“Š
3. Recurrent Neural Networks (RNNs)
How neural networks learn from sequences โ€” remembering the past to understand the present, from player trajectories to match narratives.
Sequence ModelingDeep Learning45 min read
Why Do We Need RNNs?

In the previous articles, we learned about feedforward neural networks (MLPs) and CNNs. Both process each input independently โ€” they have no memory of previous inputs. But many real-world problems involve sequences where order matters.

The Problem: No Memory

Consider predicting a player's next position. A feedforward network sees only the current position โ€” it doesn't know if the player is running forward, backward, or standing still. Without context from previous time steps, it can't understand motion or intent.

The Insight: Sequential Data Is Everywhere

Football is inherently sequential: a pass leads to a dribble leads to a shot. A player's movement trajectory unfolds over time. Match momentum ebbs and flows. To model these patterns, we need networks that can remember and reason over time.

Examples of Sequential Data

In General
  • โ€ข Text (words in a sentence)
  • โ€ข Speech (audio samples over time)
  • โ€ข Video (frames in sequence)
  • โ€ข Stock prices (values over time)
  • โ€ข Music (notes in order)
In Football
  • โ€ข Player trajectories (x, y positions over time)
  • โ€ข Event sequences (pass โ†’ dribble โ†’ shot)
  • โ€ข Ball movement patterns
  • โ€ข Team formation evolution
  • โ€ข Match momentum/pressure trends

Recurrent Neural Networks solve this by introducing recurrent connections โ€” the network's output at one time step becomes part of its input at the next time step, creating a form of memory.

The Basic RNN Architecture
Adding memory through recurrence

An RNN processes sequences one element at a time, maintaining a hidden state that acts as memory. At each time step, it combines the current input with the previous hidden state to produce an output and a new hidden state.

RNN Cell Structure
RNN Cell: Processing One Time Stepxโ‚œInputhโ‚œโ‚‹โ‚Previous Hidden StateRNN Cellhโ‚œ = tanh(Wโ‚•hโ‚œโ‚‹โ‚ + Wโ‚“xโ‚œ + b)yโ‚œOutputhโ‚œCurrent Hidden StateRecurrentConnection

The RNN Equations

Hidden state update:
hโ‚œ = tanh(Wโ‚• ยท hโ‚œโ‚‹โ‚ + Wโ‚“ ยท xโ‚œ + b)
Output:
yโ‚œ = Wแตง ยท hโ‚œ + bแตง
hโ‚œ: Hidden state at time t (the "memory")
xโ‚œ: Input at time t
Wโ‚•, Wโ‚“, Wแตง: Weight matrices (learned)
tanh: Activation function (squashes to [-1, 1])

Unrolling Through Time

To understand how RNNs process sequences, we "unroll" the network through time. Each time step uses the same weights โ€” this is called weight sharing across time, similar to how CNNs share weights across space.

Unrolled RNN
Unrolled RNN: Processing a Sequencet = 1t = 2t = 3t = Thโ‚€xโ‚RNNyโ‚xโ‚‚RNNyโ‚‚hโ‚xโ‚ƒRNNyโ‚ƒhโ‚‚...xโ‚œRNNyโ‚œSame weights (W) shared across all time stepsHidden state hโ‚œ carries information from all previous inputs

Symbol Definitions

hโ‚œHidden state at time t โ€” the network's memory
xโ‚œInput vector at time t
yโ‚œOutput at time t
WWeight matrices (shared across all time steps)
Football Analogy

Think of the hidden state as a coach's mental model of the game. At each moment, the coach combines what they see now (xโ‚œ = current player positions) with their memory of how play has developed (hโ‚œโ‚‹โ‚). Their updated understanding (hโ‚œ) informs their decisions and predictions about what will happen next.

Types of Sequence Tasks
Different input-output configurations

RNNs are flexible โ€” they can handle various input/output configurations depending on the task:

Many-to-OneSentimentAnalysisOne-to-ManyImageCaptioningMany-to-ManyPOS Tagging,Player TrackingEncoder-DecoderEncodeContextDecodeTranslation,SummarizationFootball Analytics ExamplesMany-to-OneMatch sequence โ†’Final score predictionMany-to-ManyPlayer positions โ†’Next positionsEncoder-DecoderPass sequence โ†’Expected possession pathRNNs excel when the order of events matters!
Many-to-One

Process a sequence, output a single value. Use the final hidden state for classification/regression.

Example: Classify a possession sequence as "goal" or "no goal"
One-to-Many

Single input generates a sequence of outputs. Often used with a "seed" input.

Example: Generate a tactical play from an initial formation
Many-to-Many (Aligned)

Output at each time step. Input and output sequences have the same length.

Example: Predict next position for each player at each frame
Encoder-Decoder

Encode input sequence into a context vector, then decode to output sequence of different length.

Example: Translate event sequence to natural language commentary
The Vanishing Gradient Problem
Why basic RNNs struggle with long sequences

Basic RNNs have a critical flaw: they struggle to learn long-range dependencies. When training with backpropagation through time (BPTT), gradients must flow backward through many time steps. At each step, they get multiplied by the weight matrix and activation derivatives.

The Vanishing Gradient Problemt=7t=6t=5t=4t=3t=2t=1Gradient magnitude:Strong~0Gradients shrink exponentially as they backpropagate through time
The Problem

If these multiplied values are consistently less than 1, gradients shrink exponentially โ€” this is the vanishing gradient. If greater than 1, they explode. Either way, the network can't learn connections between distant time steps.

The Math Behind Vanishing Gradients

Gradient through T time steps:
โˆ‚L/โˆ‚hโ‚ = โˆ‚L/โˆ‚hโ‚œ ร— โˆโ‚–โ‚Œโ‚‚แต€ (โˆ‚hโ‚–/โˆ‚hโ‚–โ‚‹โ‚)
Each term โˆ‚hโ‚–/โˆ‚hโ‚–โ‚‹โ‚ involves:
โˆ‚hโ‚–/โˆ‚hโ‚–โ‚‹โ‚ = Wโ‚•แต€ ร— diag(tanh'(zโ‚–))
Since tanh'(x) โ‰ค 1 and is often much smaller, repeated multiplication causes gradients to vanish. For a 100-step sequence, even 0.9ยนโฐโฐ โ‰ˆ 0.00003!
Football Consequence

A basic RNN tracking a 90-minute match would "forget" early events by halftime. It couldn't learn that a first-half tactical change led to a second-half goal. We need architectures that can maintain memory over longer periods.

LSTM: Long Short-Term Memory
Solving the vanishing gradient with gates

LSTMs (Hochreiter & Schmidhuber, 1997) solve the vanishing gradient problem with a clever architecture featuring gates that control information flow. The key innovation is a cell state โ€” a highway that runs through time with minimal modification.

LSTM Cell: The Four GatesCell State (cโ‚œ) - The "Memory Highway"cโ‚œโ‚‹โ‚hโ‚œโ‚‹โ‚xโ‚œInputForgetGate (fโ‚œ)ฯƒ(Wfยท[hโ‚œโ‚‹โ‚,xโ‚œ]+bf)ร—InputGate (iโ‚œ)ฯƒ(Wiยท[hโ‚œโ‚‹โ‚,xโ‚œ]+bi)Candidate(cฬƒโ‚œ)tanh(Wcยท[hโ‚œโ‚‹โ‚,xโ‚œ]+bc)ร—+OutputGate (oโ‚œ)ฯƒ(Woยท[hโ‚œโ‚‹โ‚,xโ‚œ]+bo)tanhร—hโ‚œcโ‚œThe Four Gates:Forget: What to remove from memoryInput: What new info to storeCandidate: New values to addOutput: What to output from memoryฯƒ = sigmoid (0-1), tanh = (-1 to 1)

The Four Gates of LSTM

1. Forget Gate (fโ‚œ)

Decides what information to discard from the cell state. Outputs values between 0 (forget completely) and 1 (keep entirely).

fโ‚œ = ฯƒ(Wf ยท [hโ‚œโ‚‹โ‚, xโ‚œ] + bf)
2. Input Gate (iโ‚œ)

Decides which new information to store in the cell state.

iโ‚œ = ฯƒ(Wi ยท [hโ‚œโ‚‹โ‚, xโ‚œ] + bi)
3. Candidate Values (cฬƒโ‚œ)

Creates new candidate values that could be added to the cell state.

cฬƒโ‚œ = tanh(Wc ยท [hโ‚œโ‚‹โ‚, xโ‚œ] + bc)
4. Output Gate (oโ‚œ)

Decides what parts of the cell state to output as the hidden state.

oโ‚œ = ฯƒ(Wo ยท [hโ‚œโ‚‹โ‚, xโ‚œ] + bo)

Cell State Update

New cell state:
cโ‚œ = fโ‚œ โŠ™ cโ‚œโ‚‹โ‚ + iโ‚œ โŠ™ cฬƒโ‚œ
(forget old info) + (add new info)
Hidden state output:
hโ‚œ = oโ‚œ โŠ™ tanh(cโ‚œ)
โŠ™ denotes element-wise multiplication (Hadamard product)
Why LSTMs Work

The cell state acts as a highway โ€” information can flow unchanged through many time steps. The forget gate can be close to 1, allowing gradients to flow back without vanishing. The network learns what to remember and forget, rather than forgetting everything exponentially.

Football Analogy

The LSTM is like a football analyst with a notepad. The cell state is the notepad โ€” persistent notes about the match. The forget gate decides "this substitution info is now outdated, cross it out." The input gate decides "this formation change is important, write it down." The output gate decides "for predicting the next event, I need to focus on recent pressing intensity."

GRU: A Simpler Alternative
Fewer gates, similar performance

Gated Recurrent Units (Cho et al., 2014) simplify the LSTM architecture by combining the forget and input gates into a single update gate, and merging the cell state and hidden state.

Update Gate (zโ‚œ)

Controls how much of the past to keep vs. how much new info to add.

zโ‚œ = ฯƒ(Wz ยท [hโ‚œโ‚‹โ‚, xโ‚œ])
Reset Gate (rโ‚œ)

Controls how much of the past hidden state to use when computing the candidate.

rโ‚œ = ฯƒ(Wr ยท [hโ‚œโ‚‹โ‚, xโ‚œ])
Candidate hidden state:
hฬƒโ‚œ = tanh(W ยท [rโ‚œ โŠ™ hโ‚œโ‚‹โ‚, xโ‚œ])
Final hidden state:
hโ‚œ = (1 - zโ‚œ) โŠ™ hโ‚œโ‚‹โ‚ + zโ‚œ โŠ™ hฬƒโ‚œ

LSTM vs GRU

LSTM
  • โ€ข 4 gates, separate cell state
  • โ€ข More parameters (~4x hidden sizeยฒ)
  • โ€ข Often better on very long sequences
  • โ€ข More expressive but slower
GRU
  • โ€ข 2 gates, no separate cell state
  • โ€ข Fewer parameters (~3x hidden sizeยฒ)
  • โ€ข Often sufficient for most tasks
  • โ€ข Faster training and inference
Which to Choose?

In practice, LSTM and GRU perform similarly on most tasks. Start with GRU for faster iteration, switch to LSTM if you need to model very long-range dependencies. For football tracking data at 25 FPS, GRU is often sufficient.

Bidirectional RNNs
Looking both forward and backward

Standard RNNs only see the past โ€” at time t, they've only processed xโ‚ through xโ‚œ. But sometimes context from the future is also useful. Bidirectional RNNs run two parallel RNNs: one forward, one backward.

Bidirectional RNNx1x2x3x4x5hโ†’1hโ†’2hโ†’3hโ†’4hโ†’5Forwardhโ†1hโ†2hโ†3hโ†4hโ†5BackwardOutput = [hโ†’ ; hโ†] (concatenated)
Forward pass:
hโ†’โ‚œ = RNN_forward(xโ‚œ, hโ†’โ‚œโ‚‹โ‚)
Backward pass:
hโ†โ‚œ = RNN_backward(xโ‚œ, hโ†โ‚œโ‚Šโ‚)
Combined output:
hโ‚œ = [hโ†’โ‚œ ; hโ†โ‚œ] (concatenation)
When to Use Bidirectional

Use bidirectional RNNs when you have the complete sequence available before making predictions (offline processing). Don't use for real-time predictions where you can't see the future. In football: great for post-match analysis, not for live predictions.

Football Applications
RNNs on the pitch

RNNs and their variants are widely used in football analytics wherever sequential patterns matter:

Trajectory Prediction

Predict where players and the ball will move next, based on their movement history. Essential for anticipating plays.

Input: Sequence of (x, y, vx, vy) per player | Output: Future positions | Model: LSTM encoder-decoder
Event Sequence Modeling

Model the flow of events (pass, dribble, shot) to predict what action comes next or to assess possession quality.

Input: Sequence of event embeddings | Output: Next event probability, xG | Model: GRU with attention
Match Momentum & Pressure

Track how match dynamics evolve over time โ€” which team is dominating, when momentum shifts occur.

Input: Rolling window of match statistics | Output: Momentum score, win probability | Model: Bidirectional LSTM
Injury Risk from Movement Patterns

Analyze player movement sequences to detect fatigue or abnormal patterns that might indicate injury risk.

Input: Speed, acceleration, direction changes over time | Output: Risk score | Model: LSTM classifier
Tactical Pattern Recognition

Identify recurring tactical patterns in how teams build up play or defend.

Input: Formation sequences, player movement patterns | Output: Pattern clusters, play labels | Model: LSTM autoencoder
The Limitation

RNNs treat sequences as 1D chains โ€” they process one element after another. But football involves 22 players interacting simultaneously. To model these complex interactions, we need architectures that can handle graph structures โ€” that's where Graph Neural Networks come in (next article)!

Training RNNs
Backpropagation through time

RNNs are trained using Backpropagation Through Time (BPTT) โ€” essentially regular backpropagation applied to the unrolled network.

1Unroll the RNN for T time steps
2Forward pass: Compute outputs yโ‚, yโ‚‚, ..., yโ‚œ
3Compute loss: Sum losses at each time step (if applicable)
4Backward pass: Propagate gradients back through all time steps
5Accumulate gradients for shared weights across all time steps
6Update weights using optimizer (Adam, SGD, etc.)

Practical Training Tips

Gradient Clipping

Clip gradients to a maximum norm (e.g., 1.0) to prevent exploding gradients.

Truncated BPTT

For very long sequences, backpropagate only through the last k steps.

Layer Normalization

Apply LayerNorm inside LSTM/GRU cells for stable training.

Dropout

Apply recurrent dropout (same mask across time) to prevent overfitting.

Summary & What's Next
What You Learned
  • โœ“ Why feedforward networks can't handle sequences
  • โœ“ RNN architecture: hidden states as memory
  • โœ“ The vanishing gradient problem
  • โœ“ LSTM: gates to control memory flow
  • โœ“ GRU: simpler alternative to LSTM
  • โœ“ Bidirectional RNNs for full context
  • โœ“ Football applications (trajectories, events)
Coming Next in This Series
  • 4. Graph Neural Networks (GNNs)
  • โ†’ Modeling relationships between entities
  • 5. Spatiotemporal GNNs for Football
  • โ†’ Combining graphs with time
Key Takeaway

RNNs give neural networks memory โ€” the ability to reason about sequences and temporal patterns. LSTMs and GRUs solve the vanishing gradient problem through gating mechanisms. But football isn't just a sequence โ€” it's a network of 22 interacting players. To model these complex spatial relationships, we need Graph Neural Networks โ€” coming up next!