Probaballer - Football Analytics & Betting Insights

Research Deep DiveicSPORTS 2024

SportsNGEN: Sustained Generation of Realistic Multi-Player Sports Gameplay

A transformer-based simulation engine that generates realistic, sustained gameplay by learning from real player and ball tracking data — capable of simulating entire matches.

Sustained SimulationPlayer CustomizationCounterfactual Analysis35 min read

Authors: Lachlan Thorpe, Lewis Bawden, Karanjot Vendal (Hawk-Eye Innovations), John Bronskill, Richard E. Turner (University of Cambridge)

Read Paper Football Demo Tennis Demo

What's This Paper Actually About?

The simple version: Most trajectory prediction models can only predict a few seconds into the future before things go wrong. The errors compound, the ball starts flying unrealistically, and players move in ways that don't make sense.

SportsNGEN solves this by building a sustained simulation engine — one that can simulate an entire tennis rally or a long football passing sequence while maintaining realism. It learns how real players make decisions from millions of tracking data sequences, then uses that knowledge to generate new gameplay that looks and feels like the real thing.

Key innovation: Instead of just predicting where players will go, SportsNGEN captures the complete distribution of player decision-making. It knows that from a given position, a tennis player might hit cross-court 60% of the time and down-the-line 40% of the time — and it samples from these possibilities to create diverse, realistic simulations.

The Four Goals of Sports Simulation

What makes a good sports simulation engine?

The authors argue that a good sports simulation should satisfy four key requirements:

Realistic

Simulations must capture the complete distribution of real player behavior — not just average movements, but the full range of decisions players actually make.

Sustained

Simulations should run for the duration between natural breaks in gameplay — an entire rally in tennis, a full passage of play in football, not just a few seconds.

Customizable

The model should be fine-tunable to emulate specific players or teams. A simulation of Nadal should play like Nadal, not like a generic player.

Measurable

There must be quantitative metrics to evaluate simulation quality — not just "does it look good?" but measurable statistics that can be optimized.

Why These Matter

Previous work achieved some of these goals but not all together. Baller2vec could simulate realistic short trajectories but not sustained gameplay. RL-based approaches could sustain gameplay but didn't capture real player behavior distributions. SportsNGEN aims to achieve all four simultaneously.

The Key Insight: Treat Movement as Classification

Why discretizing space makes simulation more stable

Two Ways to Predict Movement

When predicting where a player or ball moves next, you have two fundamental choices:

❌ Regression: Predict Exact Coordinates

Output continuous (x, y, z) values directly. Seems natural but has problems:

• Small prediction errors accumulate over time
• Hard to bound outputs to physically possible values
• Model can predict any value, including impossible movements

✓ Classification: Predict Which Bin

Divide possible movement into discrete bins, predict which bin:

• Naturally bounds predictions to valid movement ranges
• Easier to learn — classification is well-understood
• Can sample from probability distribution over bins
• Forces physical constraints (max speed, etc.)

How the Grid Works

SportsNGEN divides the space of possible next positions into a grid of bins:

For players (2D grid):

61 × 61 = 3,721 possible movement bins. Each bin represents a small displacement from current position.

For the ball (3D grid):

61 × 61 × 61 = 226,981 possible movement bins. More bins because the ball moves in 3D and at higher speeds.

The model outputs a probability for each bin. To get the actual position, we sample a bin, then sample uniformly within that bin for the final coordinate.

Why This Helps Sustained Simulation

In regression, if the model predicts a velocity that's slightly too high, the next input is slightly wrong, leading to a slightly more wrong prediction, and so on — exponential error growth. With classification, the model is forced to pick from valid bins, preventing runaway errors and enabling longer simulations.

Input Representation: Object Tokens

How players and ball are represented for the model

Each player and the ball at each timestep is represented as an "object token" containing rich information about their state:

Object Token Components

Position (p_x, p_y, p_z)

Current location. Players are 2D (on court/pitch), ball is 3D.

Velocity (v_x, v_y, v_z)

Current movement speed and direction.

Distance to Ball (d_x, d_y, d_z)

Relative position of ball — helps model player-ball interactions.

Player Identity (I)

A learned embedding vector that captures individual playing style.

Elapsed Time (e)

Time into the game/rally — models fatigue and strategy changes.

Context Tokens

In addition to object tokens, SportsNGEN uses context tokens that provide match-level information:

Court/Surface Type

Hard, clay, or grass court affects ball bounce behavior significantly.

First/Second Serve

Players hit second serves slower and safer to avoid double faults.

Tournament

Different venues may have different court speeds and conditions.

Critical Detail: Ball Noise During Training

A crucial insight: they add small random noise (±25mm in x, ±12.5mm in y,z) to ball positions during training. Without this, the model only sees "perfect" ball trajectories and can't handle prediction errors during simulation. With noise, the model learns to correct small errors, enabling sustained rollouts.

Architecture: Extended Baller2vec

Building on prior work with key extensions

SportsNGEN builds on baller2vec, a transformer architecture designed for multi-agent spatiotemporal modeling, with several crucial extensions:

The Core Architecture

Transformer Decoder: 4 layers, 2048 embedding dimension, 8 attention heads

Input MLP: 3 layers (30 → 256 → 512 → 2048) to embed object tokens

Player Output Head: Linear layer → 61×61 = 3,721 bins

Ball Output Head: Linear layer → 61×61×61 = 226,981 bins

Key Extension: Simultaneous Ball + Player Modeling

The original baller2vec only modeled either players OR the ball. SportsNGEN models both simultaneously — this is essential for realistic simulation where player and ball movements are tightly coupled.

Attention Mask

Each object token can attend to:

• All context tokens (match info, court type, etc.)
• All object tokens up to and including its own timestep
• NOT future timesteps (causal masking)

This means at each timestep, all players and the ball can be updated simultaneously, as each has access to the current state of all others.

Nucleus Sampling

A key insight: how you sample from the output distribution matters enormously. SportsNGEN uses nucleus sampling (top-p sampling) from NLP:

What is nucleus/top-p sampling?

Instead of sampling from ALL bins or just taking the most likely, nucleus sampling considers only the smallest set of bins whose cumulative probability exceeds a threshold p (e.g., 0.9).

Why does p matter?

• p too high (→1.0): Samples unlikely bins, creates unrealistic movements
• p too low (→0.1): Always picks most likely bin, loses diversity
• Sweet spot (0.8-0.9): Balances realism with variety

The Simulation Loop

How a complete tennis match is simulated

SportsNGEN combines the trajectory model with additional components to simulate entire matches:

Initialize from Historical Data

Sample initial conditions (serve position, player positions) from real historical matches between the specific players being simulated.

Autoregressive Rollout

Run the model step-by-step: predict next positions → update state → repeat. Uses a rolling window of T tokens as input context.

Detect Rally End

Simple logic checks: ball goes out of bounds, bounces twice on one side, gets stuck near the net, or passes a player without being returned.

Event Classification

A separate classifier model (same architecture, no causal mask) analyzes the rally to determine: shot types (groundstroke, volley), direction (cross-court, down-the-line), outcome (winner, error, continuation).

Update Match State

Based on who won the point, update the score. Determine who serves next, from which side. Initialize the next rally.

Complete Match Simulation

By combining these components with score-tracking logic, SportsNGEN can simulate an entire best-of-3 tennis match — hundreds of rallies with realistic gameplay throughout. This is unprecedented for learned (non-physics-based) simulators.

Counterfactual Analysis: "What If?" Scenarios

Using simulation for coaching insights

One of the most powerful applications of SportsNGEN is counterfactual analysis — asking "what would have happened if the player made a different decision?"

Example: Evaluating Shot Choices

Step 1: Take a real rally at the moment a player is about to hit (the "branch point")

Step 2: The model's output shows two probability peaks — one for cross-court, one for down-the-line

Step 3: Force sampling from each mode separately, run 100 simulations for each choice

Step 4: Calculate win percentage for each shot choice

Real Example from Paper

In one analyzed rally, the player hit down the middle (the actual shot). Simulations showed:

<50%

Down the middle (actual)

58%

Cross-court wide

58%

Straight wide

The counterfactual analysis reveals that pushing the opponent wider would have been a stronger tactical choice — valuable coaching insight.

Applications

• Coaching: Identify suboptimal shot selection patterns
• Broadcast: Show "what if" scenarios during replays
• Strategy: Test different tactical approaches against specific opponents
• Training: Help players understand decision-making consequences

Player-Specific Customization

Fine-tuning the model for individual players

SportsNGEN can be customized to emulate specific players through transfer learning:

Generic Model

First, train on all players with a single "generic player" identity vector. This model learns general tennis behavior patterns.

Fine-tuned Model

Then fine-tune on matches containing a specific player, learning a new player identity vector that captures their unique style.

How Much Data is Needed?

The paper shows metrics improving as fine-tuning data increases:

• Serve metrics (low variability) — converge quickly with few samples
• Groundstroke patterns (high variability) — need more data, continue improving up to 6000+ sequences

This suggests that for players with limited data, serve behavior can be captured quickly, but rally patterns require more extensive match history.

Evaluation: Measuring Simulation Quality

How do you know if simulations are realistic?

Unlike trajectory prediction (which can measure error vs ground truth), simulation quality requires comparing statistical distributions — do simulated matches have similar statistics to real matches?

Physical Metrics

For each metric, they compare the distribution in simulations vs real data using Wasserstein distance (how different are the distributions):

Toss Contact Height

Where the serve is hit

1st/2nd Serve Speed

Maximum serve velocities

Return Speed

Maximum return velocities

Groundstroke Speed

Shot speeds during rallies

Statistical Metrics

Aggregate statistics compared between real and simulated matches:

First Serve %

% of 1st serves in

Double Fault %

% of 2nd serves out

1st/2nd Serve Win %

Points won on serve

Ace %

% of serves untouched

Service Points Won %

Overall serve success

Win Percentage Calibration

Perhaps the most important test: if SportsNGEN predicts a player has a 90% chance of winning from a certain game state, they should actually win ~90% of the time in real data.

Result: The model is well-calibrated — predicted win percentages closely match observed outcomes across the full range (0-100%).

Key Experimental Results

Dataset

• ~15,000 professional tennis matches
• 7.6 million rally sequences
• Player and ball COM at 25 Hz
• Rich metadata: players, tournament, court type, shot labels, etc.

Top-p Sensitivity

The sampling parameter p has a significant impact:

p = 0.1

Too low: lacks variety, high rejection rate

p = 0.8-0.9

Sweet spot: realistic + diverse

p = 1.0

Too high: unrealistic movements, high rejection

Ablation: What Matters?

✓ Components That Help

• Velocity, distance-to-ball, elapsed time in tokens
• Ball noise during training (critical for stability)
• Context tokens for surface type
• Player identity vectors

Impact of Player ID Size

Increasing the player ID embedding dimension improves metrics up to size ~20, then shows diminishing returns. Larger embeddings capture more nuanced player styles.

Surface Type Modeling

The model correctly learns that clay courts have slower bounces than hard/grass courts. The distribution of "bounce speed ratio" (speed after ÷ speed before bounce) differs by surface in simulations just as it does in real data — without being explicitly programmed.

Football: Qualitative Demonstration

While the paper focuses quantitatively on tennis, they also demonstrate SportsNGEN working for football (soccer) with 23 entities (22 players + ball):

What Works

• Sustained passing sequences
• Players maintain reasonable formations
• Ball movement looks realistic
• Same architecture generalizes across sports

Challenges

• 23 entities vs 3 — much longer sequences
• Trade-off between sample rate and compute
• Harder to define natural break points
• More complex coordination patterns

Watch Football Simulation Video

Limitations

Out-of-Distribution

The model can't handle unconventional situations or unseen players well — it falls back to "generic" behavior. Novel tactics or unusual plays may not be captured.

Computational Cost

Training requires 2 days on an NVIDIA A100 GPU. The large number of output bins (226K for ball) makes the model expensive.

Sport-Specific Testing

While demonstrated on tennis and football, other sports may introduce unique challenges not addressed in this work.

No Physics Model

The model learns physics implicitly from data but has no explicit physical constraints — occasionally produces physically impossible (though rare) trajectories.

How Does SportsNGEN Compare?

Aspect	RL Approaches	Baller2vec	SportsNGEN
Training approach	Learns from scratch via rewards	Learns from tracking data	Learns from tracking data
Captures real player behavior?	❌ No — learns own strategy	✓ Yes	✓ Yes
Sustained simulation?	✓ Yes	❌ Short only	✓ Yes (entire matches)
Player customization?	❌ Difficult	⚠️ Limited	✓ Via fine-tuning
Models ball + players together?	Varies	❌ Separate	✓ Simultaneous
Counterfactual analysis?	⚠️ Possible but different meaning	⚠️ Limited	✓ Natural application

Summary: What SportsNGEN Does and Doesn't Do

✓ What It DOES Well

• Sustained simulation: Entire tennis matches, long football sequences
• Realistic behavior: Captures full distribution of player decisions
• Customizable: Fine-tune for specific players
• Counterfactuals: Evaluate alternative shot choices
• Measurable: Clear metrics to optimize
• Multi-sport: Same architecture works for tennis and football
• Well-calibrated: Win predictions match reality

❌ What It DOESN'T Do

• Handle novelty: Unusual situations fall to generic behavior
• Explicit physics: Learns physics implicitly, occasionally violates it
• Real-time: Computationally expensive, not designed for live use
• Unseen players: Needs data to customize to new players
• All sports tested: Only demonstrated on tennis and football

Key Takeaways

Classification > Regression

Discretizing movement into bins with bounded ranges prevents error accumulation and enables sustained simulation.

Ball Noise is Critical

Adding small noise to ball positions during training teaches the model to correct errors, essential for long rollouts.

Top-p Matters

The sampling parameter dramatically affects realism. Too high = unrealistic, too low = no diversity. Sweet spot ~0.8-0.9.

Player ID Embeddings

Learned identity vectors capture individual playing styles, enabling player-specific simulations via fine-tuning.

Context Changes Behavior

Surface type, serve number, and tournament all affect simulated behavior appropriately without explicit programming.

Counterfactuals for Coaching

The ability to branch simulations at decision points enables 'what if' analysis for coaching and strategy.

Resources & Further Reading

Read the Paper

icSPORTS 2024

Tennis Demo

Simulated rally video

Football Demo

Passing sequence video