💷📊
2. Convolutional Neural Networks (CNNs)
How neural networks learn to "see" — from detecting edges to understanding complex patterns in images, videos, and even football pitch data.
Computer VisionDeep Learning40 min read
Why Do We Need CNNs?

In the previous article, we learned about fully connected neural networks (MLPs) where every neuron connects to every neuron in the next layer. They work great for tabular data like xG features, but they have a serious problem with images.

The Problem: Images Are HUGE

Consider a small 224×224 color image. That's 224 × 224 × 3 = 150,528 input values. If the first hidden layer has 1000 neurons, we need 150,528 × 1000 = 150 million weights — just for the first layer! This is computationally impossible and would massively overfit.

The Insight: Local Patterns Matter

When you look at a photo of a footballer, you don't examine every pixel independently. You recognize local patterns — an edge here, a curve there, a face, a ball, a goal. These patterns can appear anywhere in the image. A ball in the top-left corner looks the same as a ball in the bottom-right.

Convolutional Neural Networks exploit these two key insights:

1. Local Connectivity

Each neuron only looks at a small local region of the input (e.g., a 3×3 patch), not the entire image. This dramatically reduces parameters.

2. Weight Sharing

The same filter (weights) is applied everywhere across the image. If we learn an edge detector, it works in all locations.

The Convolution Operation
The core building block of CNNs

Convolution is the heart of CNNs. It's a simple operation: slide a small matrix called a kernel (or filter) across the input image, and at each position, compute a weighted sum.

1. Position Kernel

Place the kernel over a small region of the input image (e.g., top-left corner)

2. Multiply & Sum

Multiply each kernel value by the corresponding input value, then sum all products

3. Slide & Repeat

Move the kernel to the next position and repeat, producing an output value at each location

Convolution Operation Visualized
Input Image101010101×Kernel (Filter)10-110-110-1=Element-wise multiply, then sum:(1×1)+(0×0)+(1×-1)+(0×1)+(1×0)+(0×-1)+(1×1)+(0×0)+(1×-1)= 1 + 0 - 1 + 0 + 0 + 0 + 1 + 0 - 1= 0Output0Slide kernelacross image5 × 53 × 33 × 3

The Math Behind Convolution

In words:
For each position (i, j) in the output, overlay the kernel on the input centered at (i, j), multiply corresponding elements, and sum everything up.
In math:
Output(i, j) = Σₘ Σₙ Input(i+m, j+n) × Kernel(m, n)
Where m and n iterate over the kernel dimensions (e.g., -1 to 1 for a 3×3 kernel)

Symbol Definitions

KernelSmall matrix of learnable weights (e.g., 3×3)
StrideHow many pixels to move the kernel each step
PaddingExtra zeros added around input edges
Feature MapOutput of applying a kernel (also called activation map)

Worked Example: Edge Detection

Input Patch (3×3):
| 100 | 100 | 50 |
| 100 | 100 | 50 |
| 100 | 100 | 50 |
Bright on left, dark on right
Vertical Edge Kernel (3×3):
| -1 | 0 | +1 |
| -1 | 0 | +1 |
| -1 | 0 | +1 |
Detects vertical brightness changes
Calculation:
Output = (100×-1) + (100×0) + (50×+1) +
(100×-1) + (100×0) + (50×+1) +
(100×-1) + (100×0) + (50×+1)

Output = -100 + 0 + 50 - 100 + 0 + 50 - 100 + 0 + 50
Output = -150
Result: Strong negative value (-150) indicates a vertical edge! (bright-to-dark transition from left to right)
What Do Kernels Detect?
From edges to complex features

Different kernels detect different features. In traditional image processing, engineers designed these by hand. The magic of CNNs is that they learn optimal kernels automatically from data!

Common Filter Types
Vertical Edge-101-101-101Detects left-rightbrightness changesHorizontal Edge-1-1-1000111Detects top-bottombrightness changesBlur (Average)1/91/91/91/91/91/91/91/91/9Averages neighborssmooths noiseSharpen0-10-15-10-10Enhances centervs neighborsCNNs learn these filters automatically from data!

Hierarchical Feature Learning

The real power emerges when we stack multiple convolutional layers. Each layer builds on the previous one, detecting increasingly complex patterns:

Layer 1: Edges & Gradients

Detects simple patterns like horizontal, vertical, and diagonal edges. The building blocks of all visual patterns.

Layer 2: Textures & Patterns

Combines edges to detect textures (grass, jersey patterns), corners, and simple shapes. In football: the hexagonal pattern of a ball.

Layer 3: Object Parts

Detects meaningful parts: a player's head, a foot, the goal frame, jersey numbers. Combines textures into recognizable elements.

Layer 4+: Whole Objects & Scenes

Recognizes complete objects: "that's a goalkeeper diving", "that's a corner kick situation", "that's an offside position".

Football Analogy

When watching a match, you don't consciously think "I see a vertical edge next to a curved edge forming a circular pattern on green texture" — you just see "ball on the pitch". CNNs work the same way: early layers see edges, later layers see "ball on pitch".

Pooling: Shrinking While Keeping What Matters
Reducing size and adding translation invariance

After convolution, we often apply pooling (also called subsampling). Pooling reduces the spatial dimensions of feature maps, which:

  • Reduces computation — fewer values to process in subsequent layers
  • Provides translation invariance — small shifts in input don't change output much
  • Increases receptive field — later layers "see" more of the original image
2×2 Max Pooling (stride 2)Input (4×4)1321291113235612Take maxfrom each2×2 regionOutput (2×2)9263Reduces size by 2×, keeps strongest signals

Types of Pooling

Max Pooling
Takes the maximum value from each pooling window.
MaxPool([1,3,2,9]) = 9
Most common — keeps strongest activations
Average Pooling
Takes the average value from each pooling window.
AvgPool([1,3,2,9]) = 3.75
Smoother — often used in final layers
Why Max Pooling Works

If a kernel is looking for a vertical edge, we care about whether an edge exists in a region, not exactly where. Max pooling says: "There's a strong edge somewhere in this 2×2 area" — the precise location within that region doesn't matter.

Complete CNN Architecture
Putting it all together

A complete CNN combines multiple convolutional layers, pooling layers, and finally fully-connected layers for the final prediction. Here's the typical flow:

CNN Architecture Overview
Input32×32×3Conv128×28×32Pool114×14×32Conv210×10×64Pool25×5×64Flatten1600FC128Output3 classesConvolutionPoolingFully Connected

The Standard Pattern

1Input: Raw image (e.g., 224×224×3 for RGB)
2Conv + ReLU: Apply kernels to detect features, add non-linearity
3Pooling: Reduce spatial size (e.g., 224→112→56→28)
4Repeat 2-3: Stack more Conv+Pool blocks, increasing depth (channels)
5Flatten: Convert 3D feature maps to 1D vector
6Fully Connected: Regular neural network layers for final reasoning
7Output: Softmax for classification, or other activation for regression

Output Size Formula

After convolution, the output size depends on input size, kernel size, stride, and padding:

Output Size = floor((Input - Kernel + 2×Padding) / Stride) + 1
Example: Input=32, Kernel=5, Padding=0, Stride=1
Output = (32 - 5 + 0) / 1 + 1 = 28

Parameter Counting Example

Conv Layer: 3×3 kernel, 32 input channels, 64 output channels
Parameters = (kernel_h × kernel_w × in_channels + 1) × out_channels
Parameters = (3 × 3 × 32 + 1) × 64
Parameters = 289 × 64 = 18,496
The "+1" is for the bias term per output channel

Compare this to a fully-connected layer with the same input/output: if the input were a 32×32 image with 32 channels flattened (32,768 values) connecting to 64 neurons, that's 32,768 × 64 = 2.1 million parameters! Convolution is far more efficient.

Training CNNs
Same principles, spatial awareness

CNNs are trained exactly like regular neural networks — using backpropagation and gradient descent. The key difference is that gradients flow through convolutional and pooling layers, updating the kernel weights.

1Forward pass: Image flows through Conv→Pool→Conv→Pool→FC→Output
2Compute loss: Compare prediction to ground truth label
3Backward pass: Compute gradients for all kernels and FC weights
4Update: Adjust kernel values to reduce loss

What Gets Learned?

The kernel values are the learnable parameters. During training, the network discovers which patterns are useful for the task:

Before Training

Kernels are initialized randomly. They detect meaningless noise patterns.

After Training

Kernels have learned to detect edges, textures, and task-relevant features.

The Beauty of Learning

You never tell the CNN "look for edges" or "detect circles". You just show it thousands of labeled examples, and it discovers that detecting edges in layer 1, combining them into shapes in layer 2, and recognizing objects in layer 3 is the optimal strategy. The kernels emerge from the data!

Famous CNN Architectures
Giants that shaped computer vision

Over the years, researchers have designed increasingly sophisticated CNN architectures. Here are the most influential ones:

LeNet-5 (1998)
Pioneer

The original CNN by Yann LeCun. Used for handwritten digit recognition (MNIST).

Architecture: Conv→Pool→Conv→Pool→FC→FC→Output | ~60K params
AlexNet (2012)
Breakthrough

Won ImageNet 2012 by a huge margin. Sparked the deep learning revolution.

Key innovations: ReLU activation, dropout, GPU training | ~60M params
VGGNet (2014)
Simplicity

Showed that depth matters. Used only 3×3 kernels, stacked very deep.

VGG-16: 16 weight layers, all 3×3 convolutions | ~138M params
ResNet (2015)
Revolutionary

Introduced skip connections, enabling networks with 100+ layers without degradation.

Key insight: Learn residual F(x) = H(x) - x instead of H(x) directly | ResNet-50: ~25M params
ImageNet Top-5 Error Rate Over Time
201028%
201216%
20147%
20153.6%
Human~5%
CNNs now surpass human-level performance on ImageNet classification!
Football Applications
CNNs on the pitch

CNNs have revolutionized football analytics by enabling automated analysis of broadcast footage, tracking data visualization, and tactical pattern recognition.

Player Detection & Tracking

Detect and track all players, referees, and the ball in broadcast video. Foundation for all video-based analytics.

Models: YOLO, Faster R-CNN, SSD | Output: Bounding boxes + player IDs per frame
Pose Estimation

Detect body keypoints (head, shoulders, hips, knees, feet) for biomechanical analysis of technique.

Applications: Shooting technique analysis, injury risk assessment, fatigue detection
Action Recognition

Classify what's happening: shot, pass, tackle, dribble, foul. Enables automated event tagging.

Approach: 3D CNNs or CNN + LSTM to capture temporal patterns in video clips
Pitch Control & Space Analysis

Convert player positions into "heatmap-like" images showing controlled space, then use CNNs to evaluate situations.

Input: Rasterized pitch with player positions | Output: Pass success probability, threat level
Why CNNs for Spatial Football Data?

Even when you have tracking data (x, y coordinates), you can "rasterize" it into an image-like grid. This lets you use CNNs to detect spatial patterns like "is there a passing lane?", "is the defense compact?", or "is there space behind the defensive line?" — treating tactical situations like image recognition!

Beyond 2D: Other Convolution Types

While 2D convolutions are the most common, CNNs can operate on data of different dimensions:

1D Convolution

For sequential data: time series, audio, text

Football use: Player speed/acceleration patterns over time
2D Convolution

For spatial data: images, pitch heatmaps

Football use: Frame-by-frame video analysis, pitch control maps
3D Convolution

For spatiotemporal data: video (height × width × time)

Football use: Action recognition in video clips
3D CNNs: Adding the Time Dimension
Understanding motion and temporal patterns in video

Standard 2D CNNs process single images — they see what is in a frame but not how things move. For understanding actions in football (a shot, a tackle, a through ball), we need to capture temporal patterns across multiple frames. Enter 3D CNNs.

The Key Insight

Instead of a 2D kernel (e.g., 3×3) sliding across height and width, a 3D kernel (e.g., 3×3×3) slides across height, width, and time. This lets the network learn spatiotemporal features — patterns that exist both in space and across time.

2D vs 3D Convolution

2D Convolution
Input: Single image (H × W × C)
Kernel: 3 × 3 × C
Slides over: Height and Width
Output: 2D feature map
Captures: "There's a ball at position (x, y)"
3D Convolution
Input: Video clip (T × H × W × C)
Kernel: 3 × 3 × 3 × C
Slides over: Time, Height, and Width
Output: 3D feature map
Captures: "The ball is moving left-to-right"

The 3D Convolution Operation

In words:
Take a 3D kernel (e.g., 3 frames × 3 pixels × 3 pixels) and slide it through the video volume. At each position, multiply corresponding elements and sum — just like 2D, but now including the temporal dimension.
In math:
Output(t, i, j) = Σₜ Σₘ Σₙ Input(t+τ, i+m, j+n) × Kernel(τ, m, n)
Where τ iterates over the temporal dimension of the kernel (e.g., -1 to 1 for a kernel spanning 3 frames)

What 3D Kernels Learn

Just as 2D kernels learn spatial patterns (edges, textures), 3D kernels learn spatiotemporal patterns:

Motion Edges

Detect edges that move in a specific direction — like a player running or a ball traveling.

Optical Flow Patterns

Learn implicit motion without explicit optical flow computation — the network discovers motion features automatically.

Action Primitives

Detect basic motion patterns like "leg swinging forward" or "body rotating" that compose complex actions.

Temporal Textures

Recognize patterns that repeat over time — like a player's running gait or dribbling rhythm.

Famous 3D CNN Architectures

C3D (2015)
Pioneer

First widely successful 3D CNN. Simple architecture using 3×3×3 kernels throughout.

Input: 16 frames × 112 × 112 | Use: Generic video feature extraction
I3D (2017)
Inflated 2D

"Inflates" pretrained 2D ImageNet weights into 3D. Copies 2D kernel weights across the temporal dimension.

Key insight: Transfer learning from images to video | Boost: Huge accuracy gains
SlowFast (2019)
Dual Pathway

Two pathways: Slow (few frames, high resolution) for spatial details, Fast (many frames, low resolution) for motion.

Slow: 4 frames, 64 channels | Fast: 32 frames, 8 channels | State-of-the-art

Football Applications of 3D CNNs

Action Recognition

Classify events in video clips: shot, pass, cross, tackle, foul, goal, save. Essential for automated match tagging.

Input: 2-5 second clip centered on event | Output: Action class probabilities
Highlight Detection

Score video segments by "excitement level" to automatically generate match highlights without manual editing.

Approach: Predict highlight probability per clip, threshold and stitch together
Skill Assessment

Analyze technique in actions like shooting, dribbling, or goalkeeping — assess quality, not just classify.

Output: Quality score, technique breakdown, comparison to reference
Anticipation & Prediction

Predict what will happen next from video context — will the player shoot? Pass? Which direction?

Use case: Goalkeeper training, defensive positioning analysis

Computational Considerations

3D CNNs Are Expensive!

Adding the temporal dimension dramatically increases computation and memory. A 3×3×3 kernel has 3× more parameters than a 3×3 kernel, and processing 16 frames requires 16× more activations. Common strategies to manage this:

  • Smaller spatial resolution (112×112 instead of 224×224)
  • Fewer frames (8-32 frames, sampled sparsely)
  • (2+1)D factorization: Separate 2D spatial and 1D temporal convolutions
  • Mixed precision training (FP16 instead of FP32)
When to Use 3D CNNs vs. Alternatives
Use 3D CNNs when: You need to understand motion/actions from raw video, have enough compute, and temporal patterns are complex.
Consider alternatives when:
  • 2D CNN + LSTM: Extract 2D features per frame, then model temporal relationships with RNNs (lighter weight)
  • Two-Stream: Separate networks for RGB frames and optical flow (explicit motion)
  • Transformers: Increasingly popular for video understanding (ViViT, TimeSformer)
Summary & What's Next
What You Learned
  • ✓ Why MLPs fail on images (too many parameters)
  • ✓ Convolution: sliding kernels over inputs
  • ✓ Kernels detect local patterns (edges, textures)
  • ✓ Pooling reduces size, adds invariance
  • ✓ Hierarchical feature learning (edges → objects)
  • ✓ Famous architectures (LeNet, AlexNet, ResNet)
  • ✓ Football applications (tracking, pose, actions)
Coming Next in This Series
  • 3. Recurrent Neural Networks (RNNs & LSTMs)
  • → How networks learn from sequences over time
  • 4. Graph Neural Networks (GNNs)
  • 5. Spatiotemporal GNNs for Football
Key Takeaway

CNNs exploit the spatial structure of images through local connectivity and weight sharing. By stacking convolutional layers, they learn hierarchical representations — from simple edges to complex objects. This same principle applies to any grid-like data, including rasterized football pitch representations. But what about sequential data, like a player's movement over time? That's where RNNs come in — next article!