In the previous articles, we learned about feedforward neural networks (MLPs) and CNNs. Both process each input independently โ they have no memory of previous inputs. But many real-world problems involve sequences where order matters.
Consider predicting a player's next position. A feedforward network sees only the current position โ it doesn't know if the player is running forward, backward, or standing still. Without context from previous time steps, it can't understand motion or intent.
Football is inherently sequential: a pass leads to a dribble leads to a shot. A player's movement trajectory unfolds over time. Match momentum ebbs and flows. To model these patterns, we need networks that can remember and reason over time.
Examples of Sequential Data
- โข Text (words in a sentence)
- โข Speech (audio samples over time)
- โข Video (frames in sequence)
- โข Stock prices (values over time)
- โข Music (notes in order)
- โข Player trajectories (x, y positions over time)
- โข Event sequences (pass โ dribble โ shot)
- โข Ball movement patterns
- โข Team formation evolution
- โข Match momentum/pressure trends
Recurrent Neural Networks solve this by introducing recurrent connections โ the network's output at one time step becomes part of its input at the next time step, creating a form of memory.
An RNN processes sequences one element at a time, maintaining a hidden state that acts as memory. At each time step, it combines the current input with the previous hidden state to produce an output and a new hidden state.
The RNN Equations
Unrolling Through Time
To understand how RNNs process sequences, we "unroll" the network through time. Each time step uses the same weights โ this is called weight sharing across time, similar to how CNNs share weights across space.
Symbol Definitions
Think of the hidden state as a coach's mental model of the game. At each moment, the coach combines what they see now (xโ = current player positions) with their memory of how play has developed (hโโโ). Their updated understanding (hโ) informs their decisions and predictions about what will happen next.
RNNs are flexible โ they can handle various input/output configurations depending on the task:
Process a sequence, output a single value. Use the final hidden state for classification/regression.
Single input generates a sequence of outputs. Often used with a "seed" input.
Output at each time step. Input and output sequences have the same length.
Encode input sequence into a context vector, then decode to output sequence of different length.
Basic RNNs have a critical flaw: they struggle to learn long-range dependencies. When training with backpropagation through time (BPTT), gradients must flow backward through many time steps. At each step, they get multiplied by the weight matrix and activation derivatives.
If these multiplied values are consistently less than 1, gradients shrink exponentially โ this is the vanishing gradient. If greater than 1, they explode. Either way, the network can't learn connections between distant time steps.
The Math Behind Vanishing Gradients
A basic RNN tracking a 90-minute match would "forget" early events by halftime. It couldn't learn that a first-half tactical change led to a second-half goal. We need architectures that can maintain memory over longer periods.
LSTMs (Hochreiter & Schmidhuber, 1997) solve the vanishing gradient problem with a clever architecture featuring gates that control information flow. The key innovation is a cell state โ a highway that runs through time with minimal modification.
The Four Gates of LSTM
Decides what information to discard from the cell state. Outputs values between 0 (forget completely) and 1 (keep entirely).
Decides which new information to store in the cell state.
Creates new candidate values that could be added to the cell state.
Decides what parts of the cell state to output as the hidden state.
Cell State Update
The cell state acts as a highway โ information can flow unchanged through many time steps. The forget gate can be close to 1, allowing gradients to flow back without vanishing. The network learns what to remember and forget, rather than forgetting everything exponentially.
The LSTM is like a football analyst with a notepad. The cell state is the notepad โ persistent notes about the match. The forget gate decides "this substitution info is now outdated, cross it out." The input gate decides "this formation change is important, write it down." The output gate decides "for predicting the next event, I need to focus on recent pressing intensity."
Gated Recurrent Units (Cho et al., 2014) simplify the LSTM architecture by combining the forget and input gates into a single update gate, and merging the cell state and hidden state.
Controls how much of the past to keep vs. how much new info to add.
Controls how much of the past hidden state to use when computing the candidate.
LSTM vs GRU
- โข 4 gates, separate cell state
- โข More parameters (~4x hidden sizeยฒ)
- โข Often better on very long sequences
- โข More expressive but slower
- โข 2 gates, no separate cell state
- โข Fewer parameters (~3x hidden sizeยฒ)
- โข Often sufficient for most tasks
- โข Faster training and inference
In practice, LSTM and GRU perform similarly on most tasks. Start with GRU for faster iteration, switch to LSTM if you need to model very long-range dependencies. For football tracking data at 25 FPS, GRU is often sufficient.
Standard RNNs only see the past โ at time t, they've only processed xโ through xโ. But sometimes context from the future is also useful. Bidirectional RNNs run two parallel RNNs: one forward, one backward.
Use bidirectional RNNs when you have the complete sequence available before making predictions (offline processing). Don't use for real-time predictions where you can't see the future. In football: great for post-match analysis, not for live predictions.
RNNs and their variants are widely used in football analytics wherever sequential patterns matter:
Predict where players and the ball will move next, based on their movement history. Essential for anticipating plays.
Model the flow of events (pass, dribble, shot) to predict what action comes next or to assess possession quality.
Track how match dynamics evolve over time โ which team is dominating, when momentum shifts occur.
Analyze player movement sequences to detect fatigue or abnormal patterns that might indicate injury risk.
Identify recurring tactical patterns in how teams build up play or defend.
RNNs treat sequences as 1D chains โ they process one element after another. But football involves 22 players interacting simultaneously. To model these complex interactions, we need architectures that can handle graph structures โ that's where Graph Neural Networks come in (next article)!
RNNs are trained using Backpropagation Through Time (BPTT) โ essentially regular backpropagation applied to the unrolled network.
Practical Training Tips
Clip gradients to a maximum norm (e.g., 1.0) to prevent exploding gradients.
For very long sequences, backpropagate only through the last k steps.
Apply LayerNorm inside LSTM/GRU cells for stable training.
Apply recurrent dropout (same mask across time) to prevent overfitting.
- โ Why feedforward networks can't handle sequences
- โ RNN architecture: hidden states as memory
- โ The vanishing gradient problem
- โ LSTM: gates to control memory flow
- โ GRU: simpler alternative to LSTM
- โ Bidirectional RNNs for full context
- โ Football applications (trajectories, events)
- 4. Graph Neural Networks (GNNs)
- โ Modeling relationships between entities
- 5. Spatiotemporal GNNs for Football
- โ Combining graphs with time
RNNs give neural networks memory โ the ability to reason about sequences and temporal patterns. LSTMs and GRUs solve the vanishing gradient problem through gating mechanisms. But football isn't just a sequence โ it's a network of 22 interacting players. To model these complex spatial relationships, we need Graph Neural Networks โ coming up next!