In the previous article, we learned about fully connected neural networks (MLPs) where every neuron connects to every neuron in the next layer. They work great for tabular data like xG features, but they have a serious problem with images.
Consider a small 224×224 color image. That's 224 × 224 × 3 = 150,528 input values. If the first hidden layer has 1000 neurons, we need 150,528 × 1000 = 150 million weights — just for the first layer! This is computationally impossible and would massively overfit.
When you look at a photo of a footballer, you don't examine every pixel independently. You recognize local patterns — an edge here, a curve there, a face, a ball, a goal. These patterns can appear anywhere in the image. A ball in the top-left corner looks the same as a ball in the bottom-right.
Convolutional Neural Networks exploit these two key insights:
Each neuron only looks at a small local region of the input (e.g., a 3×3 patch), not the entire image. This dramatically reduces parameters.
The same filter (weights) is applied everywhere across the image. If we learn an edge detector, it works in all locations.
Convolution is the heart of CNNs. It's a simple operation: slide a small matrix called a kernel (or filter) across the input image, and at each position, compute a weighted sum.
Place the kernel over a small region of the input image (e.g., top-left corner)
Multiply each kernel value by the corresponding input value, then sum all products
Move the kernel to the next position and repeat, producing an output value at each location
The Math Behind Convolution
Symbol Definitions
Worked Example: Edge Detection
| 100 | 100 | 50 |
| 100 | 100 | 50 |
| -1 | 0 | +1 |
| -1 | 0 | +1 |
(100×-1) + (100×0) + (50×+1) +
(100×-1) + (100×0) + (50×+1)
Output = -100 + 0 + 50 - 100 + 0 + 50 - 100 + 0 + 50
Output = -150
Different kernels detect different features. In traditional image processing, engineers designed these by hand. The magic of CNNs is that they learn optimal kernels automatically from data!
Hierarchical Feature Learning
The real power emerges when we stack multiple convolutional layers. Each layer builds on the previous one, detecting increasingly complex patterns:
Detects simple patterns like horizontal, vertical, and diagonal edges. The building blocks of all visual patterns.
Combines edges to detect textures (grass, jersey patterns), corners, and simple shapes. In football: the hexagonal pattern of a ball.
Detects meaningful parts: a player's head, a foot, the goal frame, jersey numbers. Combines textures into recognizable elements.
Recognizes complete objects: "that's a goalkeeper diving", "that's a corner kick situation", "that's an offside position".
When watching a match, you don't consciously think "I see a vertical edge next to a curved edge forming a circular pattern on green texture" — you just see "ball on the pitch". CNNs work the same way: early layers see edges, later layers see "ball on pitch".
After convolution, we often apply pooling (also called subsampling). Pooling reduces the spatial dimensions of feature maps, which:
- Reduces computation — fewer values to process in subsequent layers
- Provides translation invariance — small shifts in input don't change output much
- Increases receptive field — later layers "see" more of the original image
Types of Pooling
If a kernel is looking for a vertical edge, we care about whether an edge exists in a region, not exactly where. Max pooling says: "There's a strong edge somewhere in this 2×2 area" — the precise location within that region doesn't matter.
A complete CNN combines multiple convolutional layers, pooling layers, and finally fully-connected layers for the final prediction. Here's the typical flow:
The Standard Pattern
Output Size Formula
After convolution, the output size depends on input size, kernel size, stride, and padding:
Parameter Counting Example
Parameters = (3 × 3 × 32 + 1) × 64
Parameters = 289 × 64 = 18,496
Compare this to a fully-connected layer with the same input/output: if the input were a 32×32 image with 32 channels flattened (32,768 values) connecting to 64 neurons, that's 32,768 × 64 = 2.1 million parameters! Convolution is far more efficient.
CNNs are trained exactly like regular neural networks — using backpropagation and gradient descent. The key difference is that gradients flow through convolutional and pooling layers, updating the kernel weights.
What Gets Learned?
The kernel values are the learnable parameters. During training, the network discovers which patterns are useful for the task:
Kernels are initialized randomly. They detect meaningless noise patterns.
Kernels have learned to detect edges, textures, and task-relevant features.
You never tell the CNN "look for edges" or "detect circles". You just show it thousands of labeled examples, and it discovers that detecting edges in layer 1, combining them into shapes in layer 2, and recognizing objects in layer 3 is the optimal strategy. The kernels emerge from the data!
Over the years, researchers have designed increasingly sophisticated CNN architectures. Here are the most influential ones:
The original CNN by Yann LeCun. Used for handwritten digit recognition (MNIST).
Won ImageNet 2012 by a huge margin. Sparked the deep learning revolution.
Showed that depth matters. Used only 3×3 kernels, stacked very deep.
Introduced skip connections, enabling networks with 100+ layers without degradation.
CNNs have revolutionized football analytics by enabling automated analysis of broadcast footage, tracking data visualization, and tactical pattern recognition.
Detect and track all players, referees, and the ball in broadcast video. Foundation for all video-based analytics.
Detect body keypoints (head, shoulders, hips, knees, feet) for biomechanical analysis of technique.
Classify what's happening: shot, pass, tackle, dribble, foul. Enables automated event tagging.
Convert player positions into "heatmap-like" images showing controlled space, then use CNNs to evaluate situations.
Even when you have tracking data (x, y coordinates), you can "rasterize" it into an image-like grid. This lets you use CNNs to detect spatial patterns like "is there a passing lane?", "is the defense compact?", or "is there space behind the defensive line?" — treating tactical situations like image recognition!
While 2D convolutions are the most common, CNNs can operate on data of different dimensions:
For sequential data: time series, audio, text
For spatial data: images, pitch heatmaps
For spatiotemporal data: video (height × width × time)
Standard 2D CNNs process single images — they see what is in a frame but not how things move. For understanding actions in football (a shot, a tackle, a through ball), we need to capture temporal patterns across multiple frames. Enter 3D CNNs.
Instead of a 2D kernel (e.g., 3×3) sliding across height and width, a 3D kernel (e.g., 3×3×3) slides across height, width, and time. This lets the network learn spatiotemporal features — patterns that exist both in space and across time.
2D vs 3D Convolution
The 3D Convolution Operation
What 3D Kernels Learn
Just as 2D kernels learn spatial patterns (edges, textures), 3D kernels learn spatiotemporal patterns:
Detect edges that move in a specific direction — like a player running or a ball traveling.
Learn implicit motion without explicit optical flow computation — the network discovers motion features automatically.
Detect basic motion patterns like "leg swinging forward" or "body rotating" that compose complex actions.
Recognize patterns that repeat over time — like a player's running gait or dribbling rhythm.
Famous 3D CNN Architectures
First widely successful 3D CNN. Simple architecture using 3×3×3 kernels throughout.
"Inflates" pretrained 2D ImageNet weights into 3D. Copies 2D kernel weights across the temporal dimension.
Two pathways: Slow (few frames, high resolution) for spatial details, Fast (many frames, low resolution) for motion.
Football Applications of 3D CNNs
Classify events in video clips: shot, pass, cross, tackle, foul, goal, save. Essential for automated match tagging.
Score video segments by "excitement level" to automatically generate match highlights without manual editing.
Analyze technique in actions like shooting, dribbling, or goalkeeping — assess quality, not just classify.
Predict what will happen next from video context — will the player shoot? Pass? Which direction?
Computational Considerations
Adding the temporal dimension dramatically increases computation and memory. A 3×3×3 kernel has 3× more parameters than a 3×3 kernel, and processing 16 frames requires 16× more activations. Common strategies to manage this:
- • Smaller spatial resolution (112×112 instead of 224×224)
- • Fewer frames (8-32 frames, sampled sparsely)
- • (2+1)D factorization: Separate 2D spatial and 1D temporal convolutions
- • Mixed precision training (FP16 instead of FP32)
- • 2D CNN + LSTM: Extract 2D features per frame, then model temporal relationships with RNNs (lighter weight)
- • Two-Stream: Separate networks for RGB frames and optical flow (explicit motion)
- • Transformers: Increasingly popular for video understanding (ViViT, TimeSformer)
- ✓ Why MLPs fail on images (too many parameters)
- ✓ Convolution: sliding kernels over inputs
- ✓ Kernels detect local patterns (edges, textures)
- ✓ Pooling reduces size, adds invariance
- ✓ Hierarchical feature learning (edges → objects)
- ✓ Famous architectures (LeNet, AlexNet, ResNet)
- ✓ Football applications (tracking, pose, actions)
- 3. Recurrent Neural Networks (RNNs & LSTMs)
- → How networks learn from sequences over time
- 4. Graph Neural Networks (GNNs)
- 5. Spatiotemporal GNNs for Football
CNNs exploit the spatial structure of images through local connectivity and weight sharing. By stacking convolutional layers, they learn hierarchical representations — from simple edges to complex objects. This same principle applies to any grid-like data, including rasterized football pitch representations. But what about sequential data, like a player's movement over time? That's where RNNs come in — next article!