💷📊
1. Intro to Neural Networks
A comprehensive, beginner-friendly guide to understanding neural networks from the ground up — with intuitive explanations, mathematical foundations, and football examples.
Deep LearningBeginner Friendly45 min read
What is a Neural Network?

A neural network is a computer system inspired by how the human brain works. Just as your brain uses billions of interconnected neurons to process information, recognize faces, and make decisions, artificial neural networks use mathematical functions connected together to learn patterns from data.

Think of it like this:

Imagine teaching a child to recognize a dog. You show them hundreds of pictures of dogs, pointing out features like four legs, fur, a tail, and floppy ears. Eventually, they learn to recognize dogs they have never seen before. Neural networks learn the same way — by seeing many examples and gradually figuring out the patterns that define each category.

More formally, a neural network is a parametric function — a mathematical formula with adjustable numbers (called parameters) that can be tuned to produce desired outputs. We write this as:

output = f(input; parameters)
ŷ = f(x; θ)
x = input data (e.g., shot distance, angle, pressure)
θ (theta) = learnable parameters (weights and biases)
ŷ (y-hat) = predicted output (e.g., probability of goal)

In football analytics, neural networks can learn to predict outcomes like:

  • Expected Goals (xG): Given shot distance, angle, and defender pressure, predict the probability of scoring
  • Trajectory Prediction: Given a player's past positions, predict where they'll be in 2 seconds
  • Match Outcome: Given team statistics, predict home win, draw, or away win probabilities
The Building Block: A Single Neuron
Understanding the fundamental unit of neural networks

Every neural network is built from simple units called neurons (also called nodes or units). A single neuron does something remarkably simple — it takes numbers in, does some basic math, and produces a number out.

1. Receive Inputs

Takes in one or more numbers (like shot distance, angle, defender pressure)

2. Weight & Sum

Multiplies each input by a weight (importance), then adds them all together

3. Activate

Passes the sum through a function to produce the final output

Single Neuron Architecture
x1x2xn...w1w2wnSum+ biasActivationFunctionybias (b)

The Math Behind a Neuron

Here's what a single neuron computes, shown in both plain English and mathematical notation:

Step 1: Weighted Sum
In words: Multiply each input by its weight, add them all up, then add the bias
z = (w1 × x1) + (w2 × x2) + ... + (wn × xn) + b
Step 2: Activation
In words: Pass the sum through an activation function to get the output
y = activation_function(z)

Symbol Definitions

xInput values (features like distance, angle)
wWeights (how important each input is)
bBias (baseline adjustment term)
zWeighted sum (before activation)
φActivation function (adds non-linearity)
yOutput (the neuron's final answer)

Worked Example: Predicting if a Shot is On Target

Our Inputs (x):
x1 = 15 (shot distance in meters)
x2 = 25 (shot angle in degrees)
x3 = 0.3 (defender pressure, 0-1)
Our Parameters (learned from data):
w1 = -0.05 (further = less accurate)
w2 = +0.02 (better angle helps)
w3 = -0.80 (pressure hurts a lot!)
b = +1.20 (bias/baseline)
Step 1: Calculate the Weighted Sum (z)
z = (w1 × x1) + (w2 × x2) + (w3 × x3) + b
z = (-0.05 × 15) + (0.02 × 25) + (-0.80 × 0.3) + 1.20
z = -0.75 + 0.50 - 0.24 + 1.20
z = 0.71
Step 2: Apply Sigmoid Activation
y = sigmoid(z) = 1 / (1 + e^(-z))
y = 1 / (1 + e^(-0.71))
y = 1 / (1 + 0.49)
y = 1 / 1.49
y = 0.67
Result: 67% probability the shot is on target!
Weights and Biases: The Learnable Parameters

The magic of neural networks lies in their ability to learn the right weights and biases from data. Initially, these start as random numbers. Through training, the network adjusts them to make better predictions.

Weights (w)

Weights determine how important each input is:

  • Large positive: Input strongly increases output
  • Large negative: Input strongly decreases output
  • Near zero: Input barely matters
Bias (b)

The bias is like a baseline or starting point:

  • • Shifts the output up or down
  • • Works regardless of input values
  • • The neuron's default tendency
Football Analogy

Imagine you're a scout evaluating strikers. You might weight finishing ability heavily (w=0.9), pace moderately (w=0.5), and heading lightly (w=0.2). Over time, you adjust these weights based on which strikers actually perform well. That's exactly what neural networks do!

Activation Functions: Adding Non-Linearity
Why neurons need more than just weighted sums

If neurons only computed weighted sums, stacking layers would be pointless — the whole network would just be one linear equation. Activation functions add non-linearity, allowing networks to learn complex, curved patterns.

ReLU (Rectified Linear Unit)
If negative, output zero. Otherwise, pass through.
ReLU(z) = max(0, z)
Simple, fast, most popular choice
Sigmoid
Squashes any number to probability (0 to 1).
σ(z) = 1 / (1 + e^(-z))
Great for binary classification output
Tanh
Like sigmoid, but ranges from -1 to 1.
tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))
Zero-centered, often trains faster
Softmax
Converts list of numbers to probabilities summing to 1.
softmax(zi) = e^zi / Σj e^zj
Perfect for multi-class (win/draw/lose)
Layers: From Simple to Deep Networks
How stacking neurons creates powerful models

A single neuron can only learn simple patterns. The real power comes from connecting many neurons together in layers. Each layer transforms its inputs, passing results to the next layer.

Feedforward Neural Network Architecture
x1x2x3x4h1h2h3h1h2yInput LayerHidden Layer 1Hidden Layer 2Output LayerW1W2W3
Input Layer

Receives raw data. Each neuron = one feature (distance, angle, etc.). Just passes data forward.

Hidden Layers

Where the magic happens! Early layers detect simple patterns, later layers combine them into complex concepts. More layers = deeper network = can learn more complex patterns.

Output Layer

Produces the final prediction. For xG: 1 neuron with sigmoid. For match outcome: 3 neurons with softmax.

How Neural Networks Learn: The Training Loop
The process of adjusting weights to make better predictions

Training is like teaching a student through practice tests. You show examples, check answers, explain mistakes, and repeat. This happens through a four-step cycle:

1
Forward Pass

Feed input through network to get prediction

2
Calculate Loss

Measure how wrong the prediction was

3
Backpropagation

Find how each weight contributed to error

4
Gradient Descent

Adjust weights to reduce error, repeat!

The Training Loop

These four steps repeat thousands of times. Each cycle = one iteration. One pass through entire dataset = one epoch. With each iteration, weights gradually shift toward values that produce accurate predictions.

Step 2: Loss Functions — Measuring Error

The loss function measures how wrong our predictions are. Higher = worse, lower = better. Training tries to minimize this number.

Loss = L(prediction, actual)
L = L(ŷ, y)
Mean Squared Error (MSE)
For regression (continuous values like xG)
L = (1/N) × Σ(ŷ - y)²
Average of squared differences
Binary Cross-Entropy (BCE)
For binary classification (goal/no goal)
L = -[y log(ŷ) + (1-y) log(1-ŷ)]
Penalizes confident wrong predictions heavily
Step 3: Backpropagation — Finding Who's Responsible

We know the error (loss), but have thousands of weights. Which weights caused the error? By how much? Backpropagation answers this using the chain rule from calculus.

The Key Insight

Backprop works backwards from output to input. For each weight, it calculates: "If I nudge this weight slightly, how much does the loss change?" This is called the gradient — it tells us the direction and magnitude of change needed.

The Chain Rule

If y = f(g(x)), then:
dy/dx = (dy/dg) × (dg/dx)
We chain derivatives through each layer from output to input

What Backprop Computes

For each weight w, backprop calculates the gradient: ∂L/∂w (how loss changes when we change w). This tells us:

  • Positive gradient: Increasing w increases loss → we should decrease w
  • Negative gradient: Increasing w decreases loss → we should increase w
  • Large gradient: This weight has big impact on the error
  • Small gradient: This weight barely affects the error
Step 4: Gradient Descent — Walking Downhill

Now we know how each weight contributes to the error (from backprop). Gradient descent uses this information to update the weights, nudging them in the direction that reduces loss.

Gradient Descent Visualization
Starthigh lossStep 1Step 2Step 3Goal!minimum lossWeights & Biases (parameters)Loss (error)HighLowRandom startGradient stepOptimal weights
The Intuition

Imagine you're blindfolded on a hilly landscape, trying to reach the lowest valley (minimum loss). You can feel the slope beneath your feet (the gradient). Gradient descent says: "Always step in the direction that goes most steeply downhill." Repeat until you reach the bottom!

The Update Rule

new_weight = old_weight - learning_rate × gradient
w_new = w_old - α × (∂L/∂w)
α (alpha) = learning rate (step size, typically 0.001 to 0.01)
∂L/∂w = gradient (from backpropagation)

How Backprop and Gradient Descent Work Together

1Forward pass: Compute prediction ŷ from input x
2Loss: Calculate how wrong we were: L = L(ŷ, y)
3Backprop: Compute gradient ∂L/∂w for every weight
4Gradient descent: Update each weight: w = w - α × ∂L/∂w
5Repeat thousands of times until loss is minimized!

Learning Rate Effects

α too large

Overshoots minimum, loss oscillates or explodes

α just right

Converges smoothly to minimum

α too small

Converges very slowly, may get stuck

Preventing Overfitting: Regularization

Networks can "memorize" training data instead of learning general patterns. This is called overfitting. Regularization techniques prevent this.

L2 Regularization (Weight Decay)

Add penalty for large weights to loss function. Encourages smaller, more distributed weights.

Dropout

Randomly "turn off" neurons during training. Forces network to not rely on any single neuron.

Early Stopping

Stop training when validation loss starts increasing, before overfitting occurs.

Batch Normalization

Normalize layer inputs. Stabilizes training and acts as mild regularizer.

Football Applications
Expected Goals (xG)

Predict probability of goal from shot features: distance, angle, body part, assist type, defender positions.

Architecture: MLP [10, 64, 32, 1] + sigmoid | Loss: Binary cross-entropy
Player Trajectory Prediction

Given past positions, predict where player will be in 1-5 seconds. Used for tactical analysis.

Architecture: LSTM or Transformer | Loss: MSE on coordinates
Match Outcome Prediction

Predict home win, draw, or away win probability based on team form, head-to-head history, and player availability.

Architecture: MLP with softmax output (3 classes) | Loss: Categorical cross-entropy
Summary & What's Next
What You Learned
  • ✓ Neurons compute weighted sums + activation
  • ✓ Weights and biases are learned from data
  • ✓ Activation functions add non-linearity
  • ✓ Layers stack to form deep networks
  • ✓ Loss functions measure prediction error
  • ✓ Backprop finds how weights affect error
  • ✓ Gradient descent updates weights to reduce error
Coming Next in This Series
  • 2. Convolutional Neural Networks (CNNs)
  • 3. Recurrent Neural Networks (RNNs & LSTMs)
  • 4. Graph Neural Networks (GNNs)
  • 5. Spatiotemporal GNNs for Football
Key Takeaway

Neural networks are just functions with learnable parameters. Training adjusts these parameters to minimize prediction error. The magic comes from stacking simple operations (weighted sums + activations) into deep architectures that can learn incredibly complex patterns — like predicting the probability of a goal from dozens of contextual features.