Understanding trajectories in multi-agent scenarios requires addressing various tasks, including predicting future movements, imputing missing observations, inferring the status of unseen agents, and classifying different global states. Traditional data-driven approaches often handle these tasks separately with specialized models.
TranSPORTmer introduces a unified transformer-based framework capable of addressing all these tasks, showcasing its application to the intricate dynamics of multi-agent sports scenarios like soccer and basketball. Using Set Attention Blocks, TranSPORTmer effectively captures temporal dynamics and social interactions in an equivariant manner.
The model's tasks are guided by an input mask that conceals missing or yet-to-be-predicted observations. Additionally, a [CLS] extra agent is introduced to classify states along soccer trajectories, including passes, possessions, uncontrolled states, and out-of-play intervals.
Before TranSPORTmer, sports analytics teams faced a frustrating reality: they needed to build, train, and maintain separate models for every task. Each model had its own architecture, hyperparameters, training pipeline, and failure modes.
Predict where players will move in the next 1-4 seconds.
Example: Will the striker run behind the defense or come short for a pass?
Fill in missing/occluded player positions in tracking data.
Example: A player was blocked by another in the camera view for 2 seconds — where were they?
Infer ball position when ball tracking is unavailable or unreliable.
Example: Broadcast footage doesn't track the ball — can we infer it from player movements?
Classify what's happening: pass, possession, dead ball, etc.
Example: Is this moment a completed pass, a loose ball, or out of play?
Why Is This Problematic?
Each model learns similar things (player dynamics, spatial relationships) from scratch. Training 4 models means doing similar work 4 times. If you have limited data (common in sports), you're splitting it across multiple models instead of pooling it.
Your forecasting model learns that "when a player accelerates toward goal, they often continue running." But your imputation model learns this separately. Insights don't transfer between tasks.
In production, you need to manage 4+ models, each with different input/output formats, update cycles, and potential failure modes. One unified model is far simpler to deploy and maintain.
All these tasks share a fundamental requirement: understanding spatial relationships between agents(who is near whom, who is marking whom) and temporal dynamics (how the play is evolving). Instead of learning these representations 4 times, learn them once and apply them to all tasks via different masking strategies.
Before understanding the architecture, let's clarify what data TranSPORTmer processes. At each timestep, we have a snapshot of the game:
Concrete Example: Soccer Sequence
Imagine a 3-second clip at 10 FPS = 30 frames. At each frame, we record:
- • Position: (x, y) in meters
- • Velocity: (vx, vy) in m/s
- • Team ID: 0 or 1
- • Optional: acceleration, orientation
- • Position: (x, y) in meters
- • Velocity: (vx, vy) in m/s
- • Height: z (for aerial balls)
- • Special "ball" type flag
Total: 30 frames × 23 agents × ~6 features = ~4,140 values describing this 3-second sequence.
TranSPORTmer adds a 24th "agent" — the [CLS] token. This isn't a real player; it's a learnable embedding that attends to all other agents and learns to represent the global game state. After processing, the CLS token's output is used for classification (pass/possession/out-of-play).
This is borrowed from NLP transformers like BERT, where [CLS] represents the "meaning" of the whole sentence. Here, it represents the "state" of the whole game moment.
TranSPORTmer's architecture has two main stages: spatial encoding(understanding relationships at each moment) and temporal encoding(understanding how the game evolves over time).
Stage 1: Spatial Encoding with Set Attention
At each timestep t, all agents attend to each other using self-attention. This is where the model learns "who matters to whom" at this moment.
K_i = W_K · agent_i (What do I offer?)
V_i = W_V · agent_i (What info do I carry?)
# How relevant is agent j to agent i?
# Weights sum to 1 for each agent
# Each agent's new embedding incorporates info from relevant others
Consider a striker with the ball. The Set Attention computes:
- • High attention to nearby defender (α = 0.35) — immediate threat
- • High attention to making-run winger (α = 0.30) — passing option
- • Medium attention to goalkeeper (α = 0.15) — shooting angle
- • Low attention to far-side fullback (α = 0.05) — not immediately relevant
- • Low attention to own goalkeeper (α = 0.02) — not relevant to attacking
The striker's output embedding now encodes: "I have pressure from one defender, a good option on the left wing, and a shooting angle past the keeper."
Stage 2: Temporal Encoding
After spatial encoding, each agent has an embedding at each timestep. Now we process these across time to understand how the play develops.
# T embeddings, one per frame
# PE = learnable or sinusoidal position embedding
output = TemporalSelfAttention(Agent_i_sequence)
Consider 30 frames (3 seconds) where an attack develops:
- • Frames 1-10: Ball in midfield, players in initial positions
- • Frames 11-20: Midfielder plays through ball, striker starts run
- • Frames 21-30: Striker receives ball, defenders scramble
When predicting frame 31+, the temporal attention allows frame 30 to "look back" at frames 11-20 and recognize: "This is the continuation of a through-ball run that started 1.5 seconds ago."
Stage 3: Task-Specific Outputs
The unified encoder produces rich embeddings. Different "heads" decode these for different tasks:
The magic ingredient: an input mask tells the model which observations are "available" and which are "hidden." Different masks → different tasks:
This is perhaps the most important technical concept in TranSPORTmer. Standard sequence models (LSTMs, regular transformers) assume your data has a meaningful order. But players on a pitch don't have an order — there's no "first" player.
Suppose you always order players by jersey number. Your data looks like:
But jersey numbers are arbitrary! If you reordered by position or alphabetically, you'd get the same game state represented differently — and a standard model would produce different outputs.
Set Attention treats inputs as an unordered set:
Any permutation of players produces the same output(up to the corresponding permutation of outputs).
MATHEMATICAL DEFINITION: EQUIVARIANCE
A function f is permutation equivariant if permuting the inputs permutes the outputs in the same way:
In plain English: If I swap players 3 and 7 in the input, the outputs for players 3 and 7 are swapped too — but the values of those outputs don't change.
Scenario: Messi is in position (35, 20), Pedri is in position (40, 25).
The predictions for each player are identical regardless of input order. Only the order of outputs changes to match the input order.
HOW SET ATTENTION ACHIEVES EQUIVARIANCE
Self-attention is naturally equivariant because it's symmetric in how it treats inputs:
- Each agent computes Q, K, V using the same weight matrices
- Attention is computed between all pairs — no special treatment for "first" or "last"
- Each agent's output is a weighted sum of all values — symmetric operation
Key insight: There hasn't been any positional encoding in the spatial dimension (agent dimension). Positional encoding is only added in the temporal dimension, where order does matter.
This is TranSPORTmer's elegant solution: all trajectory tasks are really "predict the masked parts" problems. Change what you mask, change the task.
What's masked: All future frames
What's visible: Past and present frames
Goal: Predict where each player will be in 1-4 seconds
What's masked: Random player positions (simulating occlusion)
What's visible: Other players + other timesteps of same player
Goal: Reconstruct the missing observations
What's masked: Entire ball trajectory
What's visible: All player positions (both teams)
Goal: Infer where the ball is from player movements
What's masked: Nothing (full context available)
What's visible: All players + ball
Goal: Classify each moment as pass/possession/uncontrolled/out-of-play
1. Regularization: Training on multiple objectives prevents overfitting to any single task.
2. Data efficiency: The encoder sees more varied examples, learning more robust representations.
3. Transfer: Solving imputation helps forecasting — both require understanding motion patterns.
4. Implicit knowledge: Classification teaches the model "what's happening" which helps predict "what will happen."
TranSPORTmer was evaluated against state-of-the-art task-specific models. The unified model outperforms specialists on most metrics.
Datasets Used
- • Professional match tracking data
- • 22 players + ball at ~10-25 FPS
- • Used for all four tasks
- • Train/val/test split by matches
- • SportVU optical tracking
- • 10 players + ball at 25 FPS
- • Primarily forecasting evaluation
- • Smaller court = denser interactions
Key Results
Outperforms Social-STGCNN, Trajectron++, and other trajectory prediction baselines.
• Lower ADE (Average Displacement Error) at all horizons
• Lower FDE (Final Displacement Error) for endpoint accuracy
• Particularly strong at 2-4 second horizons
Handles real-world noisy data where some observations are missing.
• Most models can't do both simultaneously
• TranSPORTmer naturally handles partial observations
• Critical for production deployment with imperfect tracking
Accurately locates ball from player movements alone.
• Mean error ~1-2 meters when ball is with a player
• Higher error during long passes (expected — ball in flight)
• Enables analysis of broadcast video without ball tracking
CLS token accurately identifies game states.
• High accuracy for possession vs out-of-play
• Pass detection enables automatic event annotation
• State info improves trajectory predictions
Adding the state classification task improved trajectory prediction accuracy. This wasn't obvious beforehand — you might expect adding tasks to hurt performance. But teaching the model to recognize "this is a pass being made" helps it predict "the receiving player will control the ball here."
L_forecast: MSE between predicted and actual future positions
L_impute: MSE between reconstructed and actual masked positions
L_ball: MSE for inferred ball position
L_classify: Cross-entropy for state classification
λ weights are tuned to balance task contributions — typically around 1.0 for position tasks, 0.5-1.0 for classification.
TranSPORTmer establishes a strong foundation, but there are clear opportunities for improvement. Here's a detailed analysis of limitations and potential solutions:
Current Limitations
TranSPORTmer outputs a single (x, y) prediction per agent per timestep. But player trajectories are inherently multi-modal — a striker might run behind OR come short.
The model can predict physically impossible movements — a player accelerating at 15 m/s² or sustaining 12 m/s for 5 seconds (faster than Usain Bolt).
The state classification (pass/possession) is discrete, but there's no explicit modeling of when events occur or who is involved.
Self-attention over 23 agents × T timesteps = O(23² × T + 23 × T²) complexity. For long sequences or real-time inference, this becomes expensive.
Proposed Improvements
Instead of predicting a single (x, y), predict a distribution over possible futures.
Predict K Gaussian components. Each represents a possible trajectory mode. π gives the probability of each mode.
Sample latent code z from learned distribution. Different z values give different plausible trajectories.
Add soft constraints that penalize unrealistic movements without breaking differentiability.
Speed penalty: Penalize velocities exceeding 9.5 m/s (elite sprint speed)
Acceleration penalty: Penalize accelerations exceeding 5 m/s² (human biomechanical limit)
Jerk penalty: Optionally penalize sudden acceleration changes for smoother trajectories
Add a dedicated output for predicting discrete events alongside trajectories.
At each frame, predict: P(pass), P(shot), P(dribble), P(nothing)
If pass: which player sends, which receives? (pairwise attention scores)
Process time at multiple scales: frame-level (0.1s), phase-level (1-2s), and possession-level (5-15s).
Level 2: 1 second windows pooled (tactical adjustments)
Level 3: Possession segments (strategic patterns)
Benefit: Long-range dependencies (e.g., "this is a counter-attack that started 10 seconds ago") are captured without O(T²) attention over all frames.
Replace full attention with more efficient variants for real-time inference.
O(N) instead of O(N²) via kernel approximation
Only attend to nearby players + ball
Memory-efficient exact attention
Build a "foundation model" for multi-agent sports by pre-training on multiple sports simultaneously.
Insight: Basic motion patterns (acceleration, direction changes, spacing) are shared across soccer, basketball, hockey, American football.
Approach: Pre-train on combined data with sport-specific tokens, then fine-tune on target sport.
Benefit: More data → better representations. Teams with limited data can leverage pre-trained model.
Beyond prediction: generate "what if" scenarios for tactical analysis.
Question: "What if the striker had run behind instead of coming short?"
Approach: Condition the model on hypothetical striker movement, predict how defenders would respond.
Application: Post-match analysis, training sessions, tactical planning.
Combine trajectory modeling with visual features from broadcast video.
Current limitation: TranSPORTmer uses only positional data.
Opportunity: Visual features could provide: body orientation, gaze direction, ball possession confidence, player identification.
Architecture: Add a visual encoder (CNN/ViT) that produces per-player visual embeddings, concatenate with positional embeddings.
If you've read our SkillCorner STGNN Implementation, you might wonder how it relates to TranSPORTmer. Here's a comparison:
| Aspect | SkillCorner STGNN | TranSPORTmer |
|---|---|---|
| Spatial Encoding | TransformerConv (GNN with edge features) | Set Attention Blocks (pure attention) |
| Temporal Encoding | Transformer Encoder (causal) | Temporal Transformer (causal or bidirectional) |
| Tasks | Trajectory forecasting only | Forecasting + Imputation + Ball + Classification |
| Edge Features | Yes (distance, team, relative velocity) | Implicit via attention (learned) |
| Physics Constraints | Yes (speed/velocity loss terms) | No (pure data-driven) |
| Focus | Single runner trajectory | All agents simultaneously |
The best of both worlds: use TranSPORTmer's multi-task unified architecture with the SkillCorner STGNN's explicit edge features (distance, same_team) and physics-informed loss. This would combine elegant unified learning with domain knowledge about football structure.
• "Attention Is All You Need" (Vaswani et al., 2017) — The original Transformer
• "Set Transformer" (Lee et al., 2019) — Set attention for unordered inputs
• "Social-STGCNN" (Mohamed et al., 2020) — Trajectory prediction with social graphs
• "Trajectron++" (Salzmann et al., 2020) — Probabilistic trajectory forecasting