Understanding the SkillCorner STGNN: A Deep Dive
Why every design decision matters — from coordinate flipping to physics-informed loss functions.
Attribution: This analysis is based on Zach Cochran's submission to the SkillCorner Analytics Cup. Rather than just showing his code, we'll explain why each design decision was made and what would happen if you did it differently.
Imagine you're a striker making an off-ball run. Your team has possession in midfield. Should you run behind the defense? Come short for a pass? Drift wide to stretch the defense?
This decision depends on everything happening simultaneously:
- • Where are the defenders?
- • Where is the ball?
- • Where are passing lanes?
- • Where is space opening up?
- • How fast is the attack developing?
- • Are defenders recovering?
- • Is the ball moving quickly?
- • What happened 1 second ago?
- • Who is marking me?
- • Can my teammate see me?
- • Are other teammates running too?
- • Is the defense organized?
This is why we need Spatiotemporal Graph Neural Networks — they can model all three dimensions simultaneously.
Before diving into the model, let's understand what data we're working with. SkillCorner's open dataset contains tracking data and event annotations for several matches.
Every 0.1 seconds, we get the (x, y) position of:
- • All 22 players on the pitch
- • The ball (x, y, z)
- • Ball state (in play, dead ball, etc.)
~6,000 frames per half = ~12,000 frames per match
For each off-ball run, SkillCorner provides:
- • Run type (behind, overlap, support, etc.)
- • Start/end frames and positions
- • Whether the run was targeted or received
- • Whether the possession led to shot/goal
- • xThreat and xPass completion probability
Critical: Coordinate Normalization
Raw tracking data has the home team attacking one way in the first half, and the opposite way in the second half. The away team is reversed.
Zach's solution: Transform all data so every attacking team moves left → right. This means:
- • Use
kloppywithstatic_home_awayorientation - • Flip (x, y) coordinates for away team runs:
x = -x, y = -y
Without this, "run toward goal" would mean +x for some runs and -x for others!
A run doesn't happen in isolation. The 2 seconds before show the buildup — where defenders were positioned, how the attack developed. The 2 seconds after show the outcome — did the run create space? Did it lead to a pass?
At 10 FPS, this means ±20 frames of context around each run event.
What if we didn't use graphs?
Concatenate all 22 player positions into one vector: [x₁, y₁, x₂, y₂, ...]
Problem: Player ordering is arbitrary. Swapping player 3 and player 7 in your vector would look like a completely different situation to the model, even if nothing changed on the pitch.
Render players as pixels on a 2D grid and use convolutional neural networks.
Problem: Loses precise player identities. A CNN sees "something is in this region" but can't distinguish "Player A is here, and he's on my team, and he's sprinting at 8 m/s".
Why graphs are the right abstraction
In a graph, each player is a node with their own features (position, velocity, team). The relationships between players are edges.
Zach's Graph Design
- •22 players — each with position, velocity, acceleration, and team
- •1 ball — treated as a special node with its own features
- •Runner flag — marks which player is making the run we're predicting
- •Fully connected — every node connects to every other node
- •No self-loops — a player doesn't connect to themselves
- •Why? Let the model learn which connections matter via attention
Key Insight: Let the Model Learn Distance
You might think "only connect nearby players" is smarter. But a 60-meter diagonal pass to a far player does happen. By using a fully connected graph with distance as an edge feature, the model learns that far connections should usually get low attention — but can still use them when relevant.
Each node needs features that capture both what a player is and what they're doing. Here's what Zach included and why:
| Feature | What It Captures | Why It Matters |
|---|---|---|
| x, y | Position on pitch (meters) | Where is this player right now? |
| dx, dy | Velocity components (m/s) | Which direction are they moving? How fast? |
| speed | Velocity magnitude | Overall quickness — are they sprinting or jogging? |
| direction | Heading angle (radians) | Which way are they facing/moving? |
| acceleration | Change in speed | Are they speeding up (attacking) or slowing down (recovering)? |
| acc_direction | Acceleration heading | Is the acceleration in their movement direction or a direction change? |
| is_runner | Binary flag | Which player's trajectory are we predicting? |
| is_ball | Binary flag | Is this the ball node? (Ball has position/velocity but no "team") |
Raw tracking data only gives positions. Zach computes kinematics from frame differences:
Why Include Velocity and Acceleration?
Position alone is a snapshot. But football is about motion. A defender standing at position (30, 15) is very different from a defender sprinting through position (30, 15) at 9 m/s toward the ball.
Velocity tells the model intent. Acceleration tells it effort. Together, they let the model understand: "this defender is recovering hard" vs "this defender is ball-watching."
The ball has its own trajectory independent of any player. By making it a node:
- • Every player can "attend" to the ball through graph attention
- • The ball's velocity (is it a long ball? short pass?) informs all players
- • Edge features can encode "is this an edge to/from the ball?"
The connection between two players carries crucial information. Zach encodes four features on each edge:
Euclidean distance between the two players. This is the most important edge feature — nearby players influence each other more.
Binary: are these two players teammates? This lets the model learn different interaction patterns for teammates (passing) vs opponents (marking).
Binary: is either node the ball? Player-to-ball edges are fundamentally different from player-to-player edges.
Relative velocity projected onto the edge direction. Are they moving toward each other or away?
Why Relative Speed Matters
If a defender and attacker are both at distance 5m, but the defender is closing the gap at 3 m/s, that's very different from if they're maintaining distance. Relative speed captures: "is this gap about to close or open?"
There are many GNN layer types. Zach chose TransformerConvfrom PyTorch Geometric. Here's why:
Aggregates neighbors with fixed weights based on node degree.
❌ Can't use edge features at all
Learns attention weights based on node features only.
⚠️ Ignores edge features like distance
Attention depends on node AND edge features.
✓ Distance affects attention directly
The term Wₑ·eᵢⱼ means the edge features (distance, same_team, etc.) directly influence how much attention player i pays to player j. A far-away opponent with negative relative speed (moving away) will get less attention than a nearby teammate with positive relative speed (approaching).
- •2 TransformerConv layers — enough to aggregate 2-hop neighborhood information
- •4 attention heads — different heads can learn different interaction types
- •concat=False — average the heads instead of concatenating
- •10% dropout — regularization to prevent overfitting
After the GNN processes each frame, we have a sequence of runner embeddings — one per frame. Now we need to model how the run evolves over time.
LSTMs process sequences step by step. Frame 50 can only access frame 49's hidden state, which summarizes frames 1-49. Information from early frames gets diluted.
Also: LSTMs are hard to parallelize during training.
Transformers use self-attention — every frame can directly attend to every other frame. Frame 50 can look directly at frame 5 if relevant.
Parallelizes perfectly: all attention computed at once.
Without masking, a Transformer can "cheat" by looking at future frames when predicting the current position. But in reality, a player at frame 10 doesn't know what will happen at frame 20.
# Causal mask: frame t can only attend to frames 0...t causal_mask = torch.triu(torch.ones(T, T), diagonal=1).bool() # Example for T=4: # [[0, 1, 1, 1], <- frame 0 sees only itself # [0, 0, 1, 1], <- frame 1 sees 0, 1 # [0, 0, 0, 1], <- frame 2 sees 0, 1, 2 # [0, 0, 0, 0]] <- frame 3 sees all
Transformers have no built-in sense of order. We add learned positional embeddingsso the model knows "this is frame 5" vs "this is frame 50".
Why learned instead of sinusoidal? Football runs have specific temporal patterns (acceleration phase, steady state, deceleration) that learned embeddings can capture.
The naive approach: minimize mean squared error between predicted and actual positions. But this can produce physically impossible trajectories.
Imagine the model predicts: frame 1 at (10, 10), frame 2 at (10, 15), frame 3 at (10, 10). That's 5 meters in 0.1 seconds = 50 m/s = 180 km/h. Physically impossible — Usain Bolt peaks at ~12 m/s.
The model might find these "shortcuts" because they minimize position error, even though no human could actually run that way.
Zach's Solution: Multi-Component Loss
Not all runs are equally valuable. Runs that led to shots or goals are exactly the patterns we want the model to learn.
sample_weight = 1.0 # base
+ 2.0 * shot_label # +2 if led to shot
+ 3.0 * goal_label # +3 if led to goal
# Possible weights: 1.0, 3.0, 4.0, or 6.0Here's how data flows through the entire system, from raw tracking to predicted trajectory:
Graphs Respect Football Structure
Players as nodes, relationships as edges — this matches how football actually works, unlike flat vectors or images.
Edge Features Are Critical
TransformerConv lets distance and team relationships directly influence attention weights — GAT can't do this.
Causal Masking Prevents Cheating
The temporal transformer can only see past frames, matching real-world decision-making constraints.
Physics Constraints Matter
Without velocity/acceleration loss and speed penalties, the model generates impossible trajectories.
Coordinate Normalization Is Essential
All attacks must go left→right, or the model sees 'run toward goal' as two different things.
Weight High-Value Runs
Runs that led to shots/goals get higher loss weight, biasing the model toward learning effective patterns.
# 1. Install dependencies
pip install pandas numpy matplotlib torch torch-geometric kloppy tqdm
# 2. Clone Zach's repo
git clone https://github.com/zcochran4275/skill_corner_analytics_cup_tracking_data_research
cd skill_corner_analytics_cup_tracking_data_research
# 3. Load data from a match
from scripts.get_data import collect_data_from_matches
matches = [1886347] # SkillCorner open data match ID
possessions, run_features, tracking, player_to_team, merged = \
collect_data_from_matches(matches)
# 4. Build a graph from a single frame
from scripts.model_building import build_graph_from_frame
run = merged.iloc[0]
frame_df = tracking[
(tracking.run_id == run.event_id) &
(tracking.frame_id == run.frame_start)
]
graph = build_graph_from_frame(frame_df, run.player_id, player_to_team)
print(graph) # Data(x=[23, 10], edge_index=[2, 506], ...)
# 5. Visualize the spatial graph
from scripts.visualization_tools import plot_spatial_graph
plot_spatial_graph(graph, title="Single Frame Graph")
# 6. Train the model
from scripts.model_building import TemporalRunnerGNN, TemporalRunnerDataset, train_model
from torch.utils.data import DataLoader
model = TemporalRunnerGNN(node_feat_dim=10, edge_dim=4, gnn_hidden_dim=64)
# ... create dataset and dataloader ...
model = train_model(model, device, dataloader, num_epochs=10)This implementation is a strong foundation. Here are some directions to explore:
- •Multi-agent prediction: Predict trajectories for multiple players simultaneously
- •Counterfactual runs: "What if the player had run behind instead of coming short?"
- •Defensive modeling: Same architecture, but predict optimal defensive positioning
- •Real-time inference: Deploy for live match analysis