The simple version: Most trajectory prediction models can only predict a few seconds into the future before things go wrong. The errors compound, the ball starts flying unrealistically, and players move in ways that don't make sense.
SportsNGEN solves this by building a sustained simulation engine — one that can simulate an entire tennis rally or a long football passing sequence while maintaining realism. It learns how real players make decisions from millions of tracking data sequences, then uses that knowledge to generate new gameplay that looks and feels like the real thing.
Key innovation: Instead of just predicting where players will go, SportsNGEN captures the complete distribution of player decision-making. It knows that from a given position, a tennis player might hit cross-court 60% of the time and down-the-line 40% of the time — and it samples from these possibilities to create diverse, realistic simulations.
The authors argue that a good sports simulation should satisfy four key requirements:
Simulations must capture the complete distribution of real player behavior — not just average movements, but the full range of decisions players actually make.
Simulations should run for the duration between natural breaks in gameplay — an entire rally in tennis, a full passage of play in football, not just a few seconds.
The model should be fine-tunable to emulate specific players or teams. A simulation of Nadal should play like Nadal, not like a generic player.
There must be quantitative metrics to evaluate simulation quality — not just "does it look good?" but measurable statistics that can be optimized.
Previous work achieved some of these goals but not all together. Baller2vec could simulate realistic short trajectories but not sustained gameplay. RL-based approaches could sustain gameplay but didn't capture real player behavior distributions. SportsNGEN aims to achieve all four simultaneously.
Two Ways to Predict Movement
When predicting where a player or ball moves next, you have two fundamental choices:
Output continuous (x, y, z) values directly. Seems natural but has problems:
- • Small prediction errors accumulate over time
- • Hard to bound outputs to physically possible values
- • Model can predict any value, including impossible movements
Divide possible movement into discrete bins, predict which bin:
- • Naturally bounds predictions to valid movement ranges
- • Easier to learn — classification is well-understood
- • Can sample from probability distribution over bins
- • Forces physical constraints (max speed, etc.)
SportsNGEN divides the space of possible next positions into a grid of bins:
61 × 61 = 3,721 possible movement bins. Each bin represents a small displacement from current position.
61 × 61 × 61 = 226,981 possible movement bins. More bins because the ball moves in 3D and at higher speeds.
The model outputs a probability for each bin. To get the actual position, we sample a bin, then sample uniformly within that bin for the final coordinate.
In regression, if the model predicts a velocity that's slightly too high, the next input is slightly wrong, leading to a slightly more wrong prediction, and so on — exponential error growth. With classification, the model is forced to pick from valid bins, preventing runaway errors and enabling longer simulations.
Each player and the ball at each timestep is represented as an "object token" containing rich information about their state:
Current location. Players are 2D (on court/pitch), ball is 3D.
Current movement speed and direction.
Relative position of ball — helps model player-ball interactions.
A learned embedding vector that captures individual playing style.
Time into the game/rally — models fatigue and strategy changes.
Context Tokens
In addition to object tokens, SportsNGEN uses context tokens that provide match-level information:
Hard, clay, or grass court affects ball bounce behavior significantly.
Players hit second serves slower and safer to avoid double faults.
Different venues may have different court speeds and conditions.
A crucial insight: they add small random noise (±25mm in x, ±12.5mm in y,z) to ball positions during training. Without this, the model only sees "perfect" ball trajectories and can't handle prediction errors during simulation. With noise, the model learns to correct small errors, enabling sustained rollouts.
SportsNGEN builds on baller2vec, a transformer architecture designed for multi-agent spatiotemporal modeling, with several crucial extensions:
Key Extension: Simultaneous Ball + Player Modeling
The original baller2vec only modeled either players OR the ball. SportsNGEN models both simultaneously — this is essential for realistic simulation where player and ball movements are tightly coupled.
Each object token can attend to:
- • All context tokens (match info, court type, etc.)
- • All object tokens up to and including its own timestep
- • NOT future timesteps (causal masking)
This means at each timestep, all players and the ball can be updated simultaneously, as each has access to the current state of all others.
Nucleus Sampling
A key insight: how you sample from the output distribution matters enormously. SportsNGEN uses nucleus sampling (top-p sampling) from NLP:
Instead of sampling from ALL bins or just taking the most likely, nucleus sampling considers only the smallest set of bins whose cumulative probability exceeds a threshold p (e.g., 0.9).
• p too high (→1.0): Samples unlikely bins, creates unrealistic movements
• p too low (→0.1): Always picks most likely bin, loses diversity
• Sweet spot (0.8-0.9): Balances realism with variety
SportsNGEN combines the trajectory model with additional components to simulate entire matches:
Sample initial conditions (serve position, player positions) from real historical matches between the specific players being simulated.
Run the model step-by-step: predict next positions → update state → repeat. Uses a rolling window of T tokens as input context.
Simple logic checks: ball goes out of bounds, bounces twice on one side, gets stuck near the net, or passes a player without being returned.
A separate classifier model (same architecture, no causal mask) analyzes the rally to determine: shot types (groundstroke, volley), direction (cross-court, down-the-line), outcome (winner, error, continuation).
Based on who won the point, update the score. Determine who serves next, from which side. Initialize the next rally.
By combining these components with score-tracking logic, SportsNGEN can simulate an entire best-of-3 tennis match — hundreds of rallies with realistic gameplay throughout. This is unprecedented for learned (non-physics-based) simulators.
One of the most powerful applications of SportsNGEN is counterfactual analysis — asking "what would have happened if the player made a different decision?"
In one analyzed rally, the player hit down the middle (the actual shot). Simulations showed:
The counterfactual analysis reveals that pushing the opponent wider would have been a stronger tactical choice — valuable coaching insight.
- • Coaching: Identify suboptimal shot selection patterns
- • Broadcast: Show "what if" scenarios during replays
- • Strategy: Test different tactical approaches against specific opponents
- • Training: Help players understand decision-making consequences
SportsNGEN can be customized to emulate specific players through transfer learning:
First, train on all players with a single "generic player" identity vector. This model learns general tennis behavior patterns.
Then fine-tune on matches containing a specific player, learning a new player identity vector that captures their unique style.
The paper shows metrics improving as fine-tuning data increases:
- • Serve metrics (low variability) — converge quickly with few samples
- • Groundstroke patterns (high variability) — need more data, continue improving up to 6000+ sequences
This suggests that for players with limited data, serve behavior can be captured quickly, but rally patterns require more extensive match history.
Unlike trajectory prediction (which can measure error vs ground truth), simulation quality requires comparing statistical distributions — do simulated matches have similar statistics to real matches?
Physical Metrics
For each metric, they compare the distribution in simulations vs real data using Wasserstein distance (how different are the distributions):
Where the serve is hit
Maximum serve velocities
Maximum return velocities
Shot speeds during rallies
Statistical Metrics
Aggregate statistics compared between real and simulated matches:
% of 1st serves in
% of 2nd serves out
Points won on serve
% of serves untouched
Overall serve success
Perhaps the most important test: if SportsNGEN predicts a player has a 90% chance of winning from a certain game state, they should actually win ~90% of the time in real data.
Result: The model is well-calibrated — predicted win percentages closely match observed outcomes across the full range (0-100%).
Dataset
- • ~15,000 professional tennis matches
- • 7.6 million rally sequences
- • Player and ball COM at 25 Hz
- • Rich metadata: players, tournament, court type, shot labels, etc.
Top-p Sensitivity
The sampling parameter p has a significant impact:
Too low: lacks variety, high rejection rate
Sweet spot: realistic + diverse
Too high: unrealistic movements, high rejection
Ablation: What Matters?
- • Velocity, distance-to-ball, elapsed time in tokens
- • Ball noise during training (critical for stability)
- • Context tokens for surface type
- • Player identity vectors
Increasing the player ID embedding dimension improves metrics up to size ~20, then shows diminishing returns. Larger embeddings capture more nuanced player styles.
Surface Type Modeling
The model correctly learns that clay courts have slower bounces than hard/grass courts. The distribution of "bounce speed ratio" (speed after ÷ speed before bounce) differs by surface in simulations just as it does in real data — without being explicitly programmed.
While the paper focuses quantitatively on tennis, they also demonstrate SportsNGEN working for football (soccer) with 23 entities (22 players + ball):
- • Sustained passing sequences
- • Players maintain reasonable formations
- • Ball movement looks realistic
- • Same architecture generalizes across sports
- • 23 entities vs 3 — much longer sequences
- • Trade-off between sample rate and compute
- • Harder to define natural break points
- • More complex coordination patterns
The model can't handle unconventional situations or unseen players well — it falls back to "generic" behavior. Novel tactics or unusual plays may not be captured.
Training requires 2 days on an NVIDIA A100 GPU. The large number of output bins (226K for ball) makes the model expensive.
While demonstrated on tennis and football, other sports may introduce unique challenges not addressed in this work.
The model learns physics implicitly from data but has no explicit physical constraints — occasionally produces physically impossible (though rare) trajectories.
| Aspect | RL Approaches | Baller2vec | SportsNGEN |
|---|---|---|---|
| Training approach | Learns from scratch via rewards | Learns from tracking data | Learns from tracking data |
| Captures real player behavior? | ❌ No — learns own strategy | ✓ Yes | ✓ Yes |
| Sustained simulation? | ✓ Yes | ❌ Short only | ✓ Yes (entire matches) |
| Player customization? | ❌ Difficult | ⚠️ Limited | ✓ Via fine-tuning |
| Models ball + players together? | Varies | ❌ Separate | ✓ Simultaneous |
| Counterfactual analysis? | ⚠️ Possible but different meaning | ⚠️ Limited | ✓ Natural application |
- • Sustained simulation: Entire tennis matches, long football sequences
- • Realistic behavior: Captures full distribution of player decisions
- • Customizable: Fine-tune for specific players
- • Counterfactuals: Evaluate alternative shot choices
- • Measurable: Clear metrics to optimize
- • Multi-sport: Same architecture works for tennis and football
- • Well-calibrated: Win predictions match reality
- • Handle novelty: Unusual situations fall to generic behavior
- • Explicit physics: Learns physics implicitly, occasionally violates it
- • Real-time: Computationally expensive, not designed for live use
- • Unseen players: Needs data to customize to new players
- • All sports tested: Only demonstrated on tennis and football
Classification > Regression
Discretizing movement into bins with bounded ranges prevents error accumulation and enables sustained simulation.
Ball Noise is Critical
Adding small noise to ball positions during training teaches the model to correct errors, essential for long rollouts.
Top-p Matters
The sampling parameter dramatically affects realism. Too high = unrealistic, too low = no diversity. Sweet spot ~0.8-0.9.
Player ID Embeddings
Learned identity vectors capture individual playing styles, enabling player-specific simulations via fine-tuning.
Context Changes Behavior
Surface type, serve number, and tournament all affect simulated behavior appropriately without explicit programming.
Counterfactuals for Coaching
The ability to branch simulations at decision points enables 'what if' analysis for coaching and strategy.
- • Baller2vec (Alcorn & Nguyen, 2021) — Foundation transformer for multi-agent spatiotemporal modeling
- • Nucleus Sampling (Holtzman et al., 2020) — Top-p sampling technique from NLP
- • Google Research Football (Kurach et al., 2020) — RL environment for football