💷📊
Case Study

Understanding the SkillCorner STGNN: A Deep Dive

Why every design decision matters — from coordinate flipping to physics-informed loss functions.

~40 min readOriginal work by Zach CochranView Repository

Attribution: This analysis is based on Zach Cochran's submission to the SkillCorner Analytics Cup. Rather than just showing his code, we'll explain why each design decision was made and what would happen if you did it differently.

The Question: Where Should a Player Run?

Imagine you're a striker making an off-ball run. Your team has possession in midfield. Should you run behind the defense? Come short for a pass? Drift wide to stretch the defense?

This decision depends on everything happening simultaneously:

Spatial Context
  • • Where are the defenders?
  • • Where is the ball?
  • • Where are passing lanes?
  • • Where is space opening up?
Temporal Context
  • • How fast is the attack developing?
  • • Are defenders recovering?
  • • Is the ball moving quickly?
  • • What happened 1 second ago?
Relational Context
  • • Who is marking me?
  • • Can my teammate see me?
  • • Are other teammates running too?
  • • Is the defense organized?

This is why we need Spatiotemporal Graph Neural Networks — they can model all three dimensions simultaneously.

The Data: What SkillCorner Provides

Before diving into the model, let's understand what data we're working with. SkillCorner's open dataset contains tracking data and event annotations for several matches.

Tracking Data (10 FPS)

Every 0.1 seconds, we get the (x, y) position of:

  • • All 22 players on the pitch
  • • The ball (x, y, z)
  • • Ball state (in play, dead ball, etc.)

~6,000 frames per half = ~12,000 frames per match

Run Event Annotations

For each off-ball run, SkillCorner provides:

  • • Run type (behind, overlap, support, etc.)
  • • Start/end frames and positions
  • • Whether the run was targeted or received
  • • Whether the possession led to shot/goal
  • • xThreat and xPass completion probability

Critical: Coordinate Normalization

Raw tracking data has the home team attacking one way in the first half, and the opposite way in the second half. The away team is reversed.

Zach's solution: Transform all data so every attacking team moves left → right. This means:

  • • Use kloppy with static_home_away orientation
  • • Flip (x, y) coordinates for away team runs: x = -x, y = -y

Without this, "run toward goal" would mean +x for some runs and -x for others!

Why Extract ±2 Seconds Around Each Run?

A run doesn't happen in isolation. The 2 seconds before show the buildup — where defenders were positioned, how the attack developed. The 2 seconds after show the outcome — did the run create space? Did it lead to a pass?

At 10 FPS, this means ±20 frames of context around each run event.

Design Decision #1: Why Represent Players as a Graph?

What if we didn't use graphs?

❌ Flat Vector Approach

Concatenate all 22 player positions into one vector: [x₁, y₁, x₂, y₂, ...]

Problem: Player ordering is arbitrary. Swapping player 3 and player 7 in your vector would look like a completely different situation to the model, even if nothing changed on the pitch.

❌ Image/CNN Approach

Render players as pixels on a 2D grid and use convolutional neural networks.

Problem: Loses precise player identities. A CNN sees "something is in this region" but can't distinguish "Player A is here, and he's on my team, and he's sprinting at 8 m/s".

Why graphs are the right abstraction

In a graph, each player is a node with their own features (position, velocity, team). The relationships between players are edges.

Permutation Invariant
Order doesn't matter
Rich Node Features
Each player has identity
Edge Features
Relationships encoded

Zach's Graph Design

Nodes (23 total)
  • 22 players — each with position, velocity, acceleration, and team
  • 1 ball — treated as a special node with its own features
  • Runner flag — marks which player is making the run we're predicting
Edges (506 total)
  • Fully connected — every node connects to every other node
  • No self-loops — a player doesn't connect to themselves
  • Why? Let the model learn which connections matter via attention

Key Insight: Let the Model Learn Distance

You might think "only connect nearby players" is smarter. But a 60-meter diagonal pass to a far player does happen. By using a fully connected graph with distance as an edge feature, the model learns that far connections should usually get low attention — but can still use them when relevant.

Design Decision #2: What Should Each Node Know?

Each node needs features that capture both what a player is and what they're doing. Here's what Zach included and why:

FeatureWhat It CapturesWhy It Matters
x, yPosition on pitch (meters)Where is this player right now?
dx, dyVelocity components (m/s)Which direction are they moving? How fast?
speedVelocity magnitudeOverall quickness — are they sprinting or jogging?
directionHeading angle (radians)Which way are they facing/moving?
accelerationChange in speedAre they speeding up (attacking) or slowing down (recovering)?
acc_directionAcceleration headingIs the acceleration in their movement direction or a direction change?
is_runnerBinary flagWhich player's trajectory are we predicting?
is_ballBinary flagIs this the ball node? (Ball has position/velocity but no "team")
How Velocity Is Computed

Raw tracking data only gives positions. Zach computes kinematics from frame differences:

# At 10 FPS, each frame is 0.1 seconds
velocity_x = (x[t] - x[t-1]) / 0.1 # m/s
velocity_y = (y[t] - y[t-1]) / 0.1
speed = √(velocity_x² + velocity_y²)
direction = atan2(velocity_y, velocity_x) # radians
# Acceleration is change in velocity
accel_x = (velocity_x[t] - velocity_x[t-1]) / 0.1
accel_y = (velocity_y[t] - velocity_y[t-1]) / 0.1

Why Include Velocity and Acceleration?

Position alone is a snapshot. But football is about motion. A defender standing at position (30, 15) is very different from a defender sprinting through position (30, 15) at 9 m/s toward the ball.

Velocity tells the model intent. Acceleration tells it effort. Together, they let the model understand: "this defender is recovering hard" vs "this defender is ball-watching."

Why is the ball a node, not just a feature?

The ball has its own trajectory independent of any player. By making it a node:

  • • Every player can "attend" to the ball through graph attention
  • • The ball's velocity (is it a long ball? short pass?) informs all players
  • • Edge features can encode "is this an edge to/from the ball?"
Design Decision #3: What Should Edges Know?

The connection between two players carries crucial information. Zach encodes four features on each edge:

distance

Euclidean distance between the two players. This is the most important edge feature — nearby players influence each other more.

dist = √[(xⱼ - xᵢ)² + (yⱼ - yᵢ)²]
same_team

Binary: are these two players teammates? This lets the model learn different interaction patterns for teammates (passing) vs opponents (marking).

same_team = (teamᵢ == teamⱼ) && (teamᵢ ≠ ball)
ball_edge

Binary: is either node the ball? Player-to-ball edges are fundamentally different from player-to-player edges.

ball_edge = is_ballᵢ || is_ballⱼ
rel_speed

Relative velocity projected onto the edge direction. Are they moving toward each other or away?

rel_speed = (vⱼ - vᵢ) · d̂ᵢⱼ

Why Relative Speed Matters

If a defender and attacker are both at distance 5m, but the defender is closing the gap at 3 m/s, that's very different from if they're maintaining distance. Relative speed captures: "is this gap about to close or open?"

Design Decision #4: Why TransformerConv Over GAT?

There are many GNN layer types. Zach chose TransformerConvfrom PyTorch Geometric. Here's why:

GCN

Aggregates neighbors with fixed weights based on node degree.

❌ Can't use edge features at all

GAT

Learns attention weights based on node features only.

⚠️ Ignores edge features like distance

TransformerConv

Attention depends on node AND edge features.

✓ Distance affects attention directly

TransformerConv Attention Formula
αᵢⱼ = softmax((Wq·hᵢ)ᵀ · (Wk·hⱼ + Wₑ·eᵢⱼ) / √d)
↑ Edge features are added to keys

The term Wₑ·eᵢⱼ means the edge features (distance, same_team, etc.) directly influence how much attention player i pays to player j. A far-away opponent with negative relative speed (moving away) will get less attention than a nearby teammate with positive relative speed (approaching).

Architecture Details
  • 2 TransformerConv layers — enough to aggregate 2-hop neighborhood information
  • 4 attention heads — different heads can learn different interaction types
  • concat=False — average the heads instead of concatenating
  • 10% dropout — regularization to prevent overfitting
Design Decision #5: Modeling Time with Transformers

After the GNN processes each frame, we have a sequence of runner embeddings — one per frame. Now we need to model how the run evolves over time.

Why Not LSTM?

LSTMs process sequences step by step. Frame 50 can only access frame 49's hidden state, which summarizes frames 1-49. Information from early frames gets diluted.

Also: LSTMs are hard to parallelize during training.

Why Transformer Encoder?

Transformers use self-attention — every frame can directly attend to every other frame. Frame 50 can look directly at frame 5 if relevant.

Parallelizes perfectly: all attention computed at once.

Critical: Causal Masking

Without masking, a Transformer can "cheat" by looking at future frames when predicting the current position. But in reality, a player at frame 10 doesn't know what will happen at frame 20.

# Causal mask: frame t can only attend to frames 0...t
causal_mask = torch.triu(torch.ones(T, T), diagonal=1).bool()

# Example for T=4:
# [[0, 1, 1, 1],   <- frame 0 sees only itself
#  [0, 0, 1, 1],   <- frame 1 sees 0, 1
#  [0, 0, 0, 1],   <- frame 2 sees 0, 1, 2
#  [0, 0, 0, 0]]   <- frame 3 sees all
Learned Positional Encoding

Transformers have no built-in sense of order. We add learned positional embeddingsso the model knows "this is frame 5" vs "this is frame 50".

Why learned instead of sinusoidal? Football runs have specific temporal patterns (acceleration phase, steady state, deceleration) that learned embeddings can capture.

Design Decision #6: Physics-Informed Loss

The naive approach: minimize mean squared error between predicted and actual positions. But this can produce physically impossible trajectories.

What Goes Wrong with Position-Only Loss?

Imagine the model predicts: frame 1 at (10, 10), frame 2 at (10, 15), frame 3 at (10, 10). That's 5 meters in 0.1 seconds = 50 m/s = 180 km/h. Physically impossible — Usain Bolt peaks at ~12 m/s.

The model might find these "shortcuts" because they minimize position error, even though no human could actually run that way.

Zach's Solution: Multi-Component Loss

ℒ = ℒ_pos + 1.0·ℒ_vel + 0.1·ℒ_acc + 0.5·ℒ_speed
ℒ_pos
Position MSE
ℒ_vel
Velocity MSE
ℒ_acc
Acceleration MSE
ℒ_speed
Max 9.5 m/s
Outcome-Based Sample Weighting

Not all runs are equally valuable. Runs that led to shots or goals are exactly the patterns we want the model to learn.

sample_weight = 1.0                    # base
             + 2.0 * shot_label       # +2 if led to shot
             + 3.0 * goal_label       # +3 if led to goal

# Possible weights: 1.0, 3.0, 4.0, or 6.0
Putting It All Together: The Full Pipeline

Here's how data flows through the entire system, from raw tracking to predicted trajectory:

1
Data Ingestion
Load tracking + events from SkillCorner. Flip coordinates so all attacks go left→right.
2
Graph Construction
For each frame: 23 nodes (22 players + ball), 506 edges (fully connected). 10 node features, 4 edge features.
3
Spatial Encoding
Two TransformerConv layers with 4 attention heads. Each player's embedding now encodes their spatial context.
4
Runner Extraction
Extract just the runner's embedding from each frame. Stack into sequence [T, hidden_dim].
5
Temporal Encoding
Add learned positional embeddings. Pass through 2-layer Transformer Encoder with causal masking.
6
Position Prediction
Linear layer outputs (x, y) for each frame. Train with physics-informed multi-component loss.
Key Takeaways

Graphs Respect Football Structure

Players as nodes, relationships as edges — this matches how football actually works, unlike flat vectors or images.

Edge Features Are Critical

TransformerConv lets distance and team relationships directly influence attention weights — GAT can't do this.

Causal Masking Prevents Cheating

The temporal transformer can only see past frames, matching real-world decision-making constraints.

Physics Constraints Matter

Without velocity/acceleration loss and speed penalties, the model generates impossible trajectories.

Coordinate Normalization Is Essential

All attacks must go left→right, or the model sees 'run toward goal' as two different things.

Weight High-Value Runs

Runs that led to shots/goals get higher loss weight, biasing the model toward learning effective patterns.

Try It Yourself
# 1. Install dependencies
pip install pandas numpy matplotlib torch torch-geometric kloppy tqdm

# 2. Clone Zach's repo
git clone https://github.com/zcochran4275/skill_corner_analytics_cup_tracking_data_research
cd skill_corner_analytics_cup_tracking_data_research

# 3. Load data from a match
from scripts.get_data import collect_data_from_matches
matches = [1886347]  # SkillCorner open data match ID
possessions, run_features, tracking, player_to_team, merged = \
    collect_data_from_matches(matches)

# 4. Build a graph from a single frame
from scripts.model_building import build_graph_from_frame
run = merged.iloc[0]
frame_df = tracking[
    (tracking.run_id == run.event_id) & 
    (tracking.frame_id == run.frame_start)
]
graph = build_graph_from_frame(frame_df, run.player_id, player_to_team)
print(graph)  # Data(x=[23, 10], edge_index=[2, 506], ...)

# 5. Visualize the spatial graph
from scripts.visualization_tools import plot_spatial_graph
plot_spatial_graph(graph, title="Single Frame Graph")

# 6. Train the model
from scripts.model_building import TemporalRunnerGNN, TemporalRunnerDataset, train_model
from torch.utils.data import DataLoader

model = TemporalRunnerGNN(node_feat_dim=10, edge_dim=4, gnn_hidden_dim=64)
# ... create dataset and dataloader ...
model = train_model(model, device, dataloader, num_epochs=10)
What's Next?

This implementation is a strong foundation. Here are some directions to explore:

  • Multi-agent prediction: Predict trajectories for multiple players simultaneously
  • Counterfactual runs: "What if the player had run behind instead of coming short?"
  • Defensive modeling: Same architecture, but predict optimal defensive positioning
  • Real-time inference: Deploy for live match analysis