💷📊
Research Deep DivearXiv October 2024
TranSPORTmer: A Holistic Approach to Trajectory Understanding
A unified transformer framework for multi-agent sports that handles trajectory prediction, imputation, ball inference, and state classification — all in one model.
Set Attention BlocksMulti-Task LearningPermutation Equivariance35 min read
Authors: Guillem Capellera, Luis Ferraz, Antonio Rubio, Antonio Agudo, Francesc Moreno-Noguer
Paper Abstract

Understanding trajectories in multi-agent scenarios requires addressing various tasks, including predicting future movements, imputing missing observations, inferring the status of unseen agents, and classifying different global states. Traditional data-driven approaches often handle these tasks separately with specialized models.

TranSPORTmer introduces a unified transformer-based framework capable of addressing all these tasks, showcasing its application to the intricate dynamics of multi-agent sports scenarios like soccer and basketball. Using Set Attention Blocks, TranSPORTmer effectively captures temporal dynamics and social interactions in an equivariant manner.

The model's tasks are guided by an input mask that conceals missing or yet-to-be-predicted observations. Additionally, a [CLS] extra agent is introduced to classify states along soccer trajectories, including passes, possessions, uncontrolled states, and out-of-play intervals.

The Problem: Fragmented Sports Analytics

Before TranSPORTmer, sports analytics teams faced a frustrating reality: they needed to build, train, and maintain separate models for every task. Each model had its own architecture, hyperparameters, training pipeline, and failure modes.

❌ Player Forecasting

Predict where players will move in the next 1-4 seconds.

Example: Will the striker run behind the defense or come short for a pass?

❌ Player Imputation

Fill in missing/occluded player positions in tracking data.

Example: A player was blocked by another in the camera view for 2 seconds — where were they?

❌ Ball Inference

Infer ball position when ball tracking is unavailable or unreliable.

Example: Broadcast footage doesn't track the ball — can we infer it from player movements?

❌ State Classification

Classify what's happening: pass, possession, dead ball, etc.

Example: Is this moment a completed pass, a loose ball, or out of play?

Why Is This Problematic?

1. Wasted Computation & Data

Each model learns similar things (player dynamics, spatial relationships) from scratch. Training 4 models means doing similar work 4 times. If you have limited data (common in sports), you're splitting it across multiple models instead of pooling it.

2. No Knowledge Transfer

Your forecasting model learns that "when a player accelerates toward goal, they often continue running." But your imputation model learns this separately. Insights don't transfer between tasks.

3. Deployment Complexity

In production, you need to manage 4+ models, each with different input/output formats, update cycles, and potential failure modes. One unified model is far simpler to deploy and maintain.

TranSPORTmer's Key Insight

All these tasks share a fundamental requirement: understanding spatial relationships between agents(who is near whom, who is marking whom) and temporal dynamics (how the play is evolving). Instead of learning these representations 4 times, learn them once and apply them to all tasks via different masking strategies.

Input Representation: How Data Enters the Model
Understanding the data format before diving into architecture

Before understanding the architecture, let's clarify what data TranSPORTmer processes. At each timestep, we have a snapshot of the game:

Single Frame Input (t = some moment)
# For each of the 23 agents (22 players + 1 ball):
agent_i = [x, y, vx, vy, ...] # position + velocity + optional features
# Stacked into a matrix:
X^t ∈ ℝ^(N_agents × d_features)
# Example: 23 agents × 4 features = 23×4 matrix
Sequence Input (full trajectory)
# Stack T timesteps together:
X = [X^1, X^2, ..., X^T]
X ∈ ℝ^(T × N_agents × d_features)
# Example: 50 frames × 23 agents × 4 features
# At 10 FPS, 50 frames = 5 seconds of tracking data

Concrete Example: Soccer Sequence

Imagine a 3-second clip at 10 FPS = 30 frames. At each frame, we record:

22 Players
  • • Position: (x, y) in meters
  • • Velocity: (vx, vy) in m/s
  • • Team ID: 0 or 1
  • • Optional: acceleration, orientation
1 Ball
  • • Position: (x, y) in meters
  • • Velocity: (vx, vy) in m/s
  • • Height: z (for aerial balls)
  • • Special "ball" type flag

Total: 30 frames × 23 agents × ~6 features = ~4,140 values describing this 3-second sequence.

The CLS Agent: A Clever Addition

TranSPORTmer adds a 24th "agent" — the [CLS] token. This isn't a real player; it's a learnable embedding that attends to all other agents and learns to represent the global game state. After processing, the CLS token's output is used for classification (pass/possession/out-of-play).

This is borrowed from NLP transformers like BERT, where [CLS] represents the "meaning" of the whole sentence. Here, it represents the "state" of the whole game moment.

Core Architecture: Set Attention Blocks
The key innovation that makes unified modeling possible

TranSPORTmer's architecture has two main stages: spatial encoding(understanding relationships at each moment) and temporal encoding(understanding how the game evolves over time).

Stage 1: Spatial Encoding with Set Attention

1
Set Attention Blocks (Per-Frame Processing)

At each timestep t, all agents attend to each other using self-attention. This is where the model learns "who matters to whom" at this moment.

How It Works (Step by Step):
1.
Project each agent to Q, K, V:
Q_i = W_Q · agent_i (What am I looking for?)
K_i = W_K · agent_i (What do I offer?)
V_i = W_V · agent_i (What info do I carry?)
2.
Compute attention scores between ALL pairs:
score(i,j) = Q_i · K_j^T / √d
# How relevant is agent j to agent i?
3.
Softmax to get attention weights:
α_ij = softmax(score(i, :))
# Weights sum to 1 for each agent
4.
Weighted sum of values:
output_i = Σ_j α_ij · V_j
# Each agent's new embedding incorporates info from relevant others
🎯 Concrete Example: Striker Attending to Others

Consider a striker with the ball. The Set Attention computes:

  • High attention to nearby defender (α = 0.35) — immediate threat
  • High attention to making-run winger (α = 0.30) — passing option
  • Medium attention to goalkeeper (α = 0.15) — shooting angle
  • Low attention to far-side fullback (α = 0.05) — not immediately relevant
  • Low attention to own goalkeeper (α = 0.02) — not relevant to attacking

The striker's output embedding now encodes: "I have pressure from one defender, a good option on the left wing, and a shooting angle past the keeper."

Stage 2: Temporal Encoding

2
Temporal Transformer (Cross-Time Processing)

After spatial encoding, each agent has an embedding at each timestep. Now we process these across time to understand how the play develops.

How It Works:
1.
For each agent, stack embeddings across time:
Agent_i_sequence = [H_i^1, H_i^2, ..., H_i^T]
# T embeddings, one per frame
2.
Add positional encoding (so model knows which frame is which):
H_i^t = H_i^t + PE(t)
# PE = learnable or sinusoidal position embedding
3.
Apply self-attention across timesteps:
# Frame 10 can attend to frames 1-9 (causal) or all frames (bidirectional)
output = TemporalSelfAttention(Agent_i_sequence)
🎯 Concrete Example: Understanding a Developing Attack

Consider 30 frames (3 seconds) where an attack develops:

  • Frames 1-10: Ball in midfield, players in initial positions
  • Frames 11-20: Midfielder plays through ball, striker starts run
  • Frames 21-30: Striker receives ball, defenders scramble

When predicting frame 31+, the temporal attention allows frame 30 to "look back" at frames 11-20 and recognize: "This is the continuation of a through-ball run that started 1.5 seconds ago."

Stage 3: Task-Specific Outputs

3
Output Heads (Task-Specific Decoders)

The unified encoder produces rich embeddings. Different "heads" decode these for different tasks:

📍 Position Head
Linear(d_model → 2)
Output: (x, y) coordinates
🎭 CLS Classification Head
Linear(d_model → 4)
Output: [pass, possession, uncontrolled, out_of_play]
⚽ Ball Position Head
Linear(d_model → 2)
Output: inferred ball (x, y)
🔮 Velocity Head (optional)
Linear(d_model → 2)
Output: predicted (vx, vy)
4
Input Mask: The Task Controller

The magic ingredient: an input mask tells the model which observations are "available" and which are "hidden." Different masks → different tasks:

# Forecasting: Mask future frames
mask = [1,1,1,...,1,0,0,0] # 1=visible, 0=predict
# Imputation: Mask random positions
mask = [1,0,1,1,0,1,1,1] # 0s are missing observations
# Ball inference: Mask all ball positions
mask[ball_agent] = 0 # Only ball is hidden
PERMUTATION EQUIVARIANCE: WHY SET ATTENTION MATTERS
The mathematical property that makes this work for sports

This is perhaps the most important technical concept in TranSPORTmer. Standard sequence models (LSTMs, regular transformers) assume your data has a meaningful order. But players on a pitch don't have an order — there's no "first" player.

❌ THE ORDERING PROBLEM

Suppose you always order players by jersey number. Your data looks like:

[Player_1, Player_7, Player_10, Player_11, ...]

But jersey numbers are arbitrary! If you reordered by position or alphabetically, you'd get the same game state represented differently — and a standard model would produce different outputs.

✓ SET ATTENTION SOLUTION

Set Attention treats inputs as an unordered set:

{Player_1, Player_7, Player_10} = {Player_10, Player_1, Player_7}

Any permutation of players produces the same output(up to the corresponding permutation of outputs).

MATHEMATICAL DEFINITION: EQUIVARIANCE

A function f is permutation equivariant if permuting the inputs permutes the outputs in the same way:

f(π(X)) = π(f(X)) for any permutation π

In plain English: If I swap players 3 and 7 in the input, the outputs for players 3 and 7 are swapped too — but the values of those outputs don't change.

🎯 CONCRETE EXAMPLE: WHY THIS MATTERS

Scenario: Messi is in position (35, 20), Pedri is in position (40, 25).

Input Order A:
[Messi, Pedri, ...]
Output: Messi → (37, 22), Pedri → (42, 28)
Input Order B:
[Pedri, Messi, ...]
Output: Pedri → (42, 28), Messi → (37, 22)

The predictions for each player are identical regardless of input order. Only the order of outputs changes to match the input order.

HOW SET ATTENTION ACHIEVES EQUIVARIANCE

Self-attention is naturally equivariant because it's symmetric in how it treats inputs:

  1. Each agent computes Q, K, V using the same weight matrices
  2. Attention is computed between all pairs — no special treatment for "first" or "last"
  3. Each agent's output is a weighted sum of all values — symmetric operation

Key insight: There hasn't been any positional encoding in the spatial dimension (agent dimension). Positional encoding is only added in the temporal dimension, where order does matter.

Unified Tasks via Input Masking
One model, many capabilities — controlled by what you hide

This is TranSPORTmer's elegant solution: all trajectory tasks are really "predict the masked parts" problems. Change what you mask, change the task.

🎯 Task 1: Player Forecasting

What's masked: All future frames

What's visible: Past and present frames

Goal: Predict where each player will be in 1-4 seconds

# 50 frames, predict last 20
frames 1-30: ✓ visible
frames 31-50: ✗ masked (predict these)
Example: Given player positions at t=0 to t=3s, predict positions at t=3.1s to t=5s.
🔧 Task 2: Player Imputation

What's masked: Random player positions (simulating occlusion)

What's visible: Other players + other timesteps of same player

Goal: Reconstruct the missing observations

# Player 7 occluded frames 15-25
Player 7 @ frames 1-14: ✓
Player 7 @ frames 15-25: ✗ masked
Player 7 @ frames 26-50: ✓
All other players: ✓
Example: Camera was blocked, player disappeared for 1 second. Use context to fill in their likely positions.
⚽ Task 3: Ball Inference

What's masked: Entire ball trajectory

What's visible: All player positions (both teams)

Goal: Infer where the ball is from player movements

# No ball tracking available
All 22 players: ✓ visible
Ball position: ✗ masked (infer this)
Example: Broadcast video with no ball tracking. Players are looking at position (30, 15), someone there is dribbling — ball is probably at (30, 15).
📊 Task 4: State Classification

What's masked: Nothing (full context available)

What's visible: All players + ball

Goal: Classify each moment as pass/possession/uncontrolled/out-of-play

# [CLS] token classifies each frame
Frame 15: "pass" (ball in flight)
Frame 20: "possession" (controlled)
Frame 45: "out_of_play" (ball crossed line)
Example: Ball traveling between two players — classify as "pass"; ball at player's feet — classify as "possession".
Why Multi-Task Training Helps

1. Regularization: Training on multiple objectives prevents overfitting to any single task.

2. Data efficiency: The encoder sees more varied examples, learning more robust representations.

3. Transfer: Solving imputation helps forecasting — both require understanding motion patterns.

4. Implicit knowledge: Classification teaches the model "what's happening" which helps predict "what will happen."

Experimental Results
Quantitative and qualitative findings from soccer and basketball

TranSPORTmer was evaluated against state-of-the-art task-specific models. The unified model outperforms specialists on most metrics.

Datasets Used

⚽ Soccer Dataset
  • • Professional match tracking data
  • • 22 players + ball at ~10-25 FPS
  • • Used for all four tasks
  • • Train/val/test split by matches
🏀 Basketball Dataset (NBA)
  • • SportVU optical tracking
  • • 10 players + ball at 25 FPS
  • • Primarily forecasting evaluation
  • • Smaller court = denser interactions

Key Results

Player Forecasting

Outperforms Social-STGCNN, Trajectron++, and other trajectory prediction baselines.

• Lower ADE (Average Displacement Error) at all horizons

• Lower FDE (Final Displacement Error) for endpoint accuracy

• Particularly strong at 2-4 second horizons

Joint Forecasting + Imputation

Handles real-world noisy data where some observations are missing.

• Most models can't do both simultaneously

• TranSPORTmer naturally handles partial observations

• Critical for production deployment with imperfect tracking

Ball Inference

Accurately locates ball from player movements alone.

• Mean error ~1-2 meters when ball is with a player

• Higher error during long passes (expected — ball in flight)

• Enables analysis of broadcast video without ball tracking

State Classification

CLS token accurately identifies game states.

• High accuracy for possession vs out-of-play

• Pass detection enables automatic event annotation

• State info improves trajectory predictions

Key Finding: Classification Helps Prediction

Adding the state classification task improved trajectory prediction accuracy. This wasn't obvious beforehand — you might expect adding tasks to hurt performance. But teaching the model to recognize "this is a pass being made" helps it predict "the receiving player will control the ball here."

Technical Implementation Details
Hyperparameters and training specifics from the paper
Model Architecture
Embedding dimension (d_model):128-256
Number of attention heads:4-8
Spatial encoder layers:2-4
Temporal encoder layers:2-4
Dropout:0.1
Training Details
Optimizer:AdamW
Learning rate:1e-4 to 1e-3
Batch size:32-64 sequences
Sequence length:50-100 frames
Training epochs:100-200
Loss Function (Multi-Task)
L_total = λ_1 · L_forecast + λ_2 · L_impute + λ_3 · L_ball + λ_4 · L_classify

L_forecast: MSE between predicted and actual future positions

L_impute: MSE between reconstructed and actual masked positions

L_ball: MSE for inferred ball position

L_classify: Cross-entropy for state classification

λ weights are tuned to balance task contributions — typically around 1.0 for position tasks, 0.5-1.0 for classification.

Improvements, Limitations & Future Directions
Where TranSPORTmer could go next

TranSPORTmer establishes a strong foundation, but there are clear opportunities for improvement. Here's a detailed analysis of limitations and potential solutions:

Current Limitations

1. Point Predictions Only

TranSPORTmer outputs a single (x, y) prediction per agent per timestep. But player trajectories are inherently multi-modal — a striker might run behind OR come short.

Impact: The model averages between possibilities, producing paths that may not match any realistic trajectory. Evaluation metrics (ADE/FDE) hide this issue because they only measure distance to the actual path.
2. No Physics Constraints

The model can predict physically impossible movements — a player accelerating at 15 m/s² or sustaining 12 m/s for 5 seconds (faster than Usain Bolt).

Impact: Predictions may look reasonable on average but contain unrealistic segments that coaches/analysts would immediately reject.
3. No Explicit Event Modeling

The state classification (pass/possession) is discrete, but there's no explicit modeling of when events occur or who is involved.

Impact: Can't answer: "When will the pass happen?" or "Who will receive it?" — questions that are critical for tactical analysis.
4. Computational Cost

Self-attention over 23 agents × T timesteps = O(23² × T + 23 × T²) complexity. For long sequences or real-time inference, this becomes expensive.

Impact: May not achieve the <40ms inference needed for real-time (25 FPS) analysis on typical hardware.

Proposed Improvements

1. Probabilistic / Multi-Modal Outputs

Instead of predicting a single (x, y), predict a distribution over possible futures.

Option A: Gaussian Mixture Model (GMM)
Output: K × (μ_x, μ_y, σ_x, σ_y, ρ, π)

Predict K Gaussian components. Each represents a possible trajectory mode. π gives the probability of each mode.

Option B: Conditional VAE (CVAE)
z ~ N(μ, σ) → Decoder(z) → trajectory

Sample latent code z from learned distribution. Different z values give different plausible trajectories.

Example: For a striker with the ball, output 3 modes: (1) run behind defense (40% probability), (2) come short (35% probability), (3) drift wide (25% probability).
2. Physics-Informed Loss Functions

Add soft constraints that penalize unrealistic movements without breaking differentiability.

L_physics = λ_speed · max(0, |v| - v_max)² + λ_acc · max(0, |a| - a_max)²

Speed penalty: Penalize velocities exceeding 9.5 m/s (elite sprint speed)

Acceleration penalty: Penalize accelerations exceeding 5 m/s² (human biomechanical limit)

Jerk penalty: Optionally penalize sudden acceleration changes for smoother trajectories

Benefit: Predictions remain in the space of human-achievable movements. Analysts can trust that "player X could actually do this."
3. Explicit Event Prediction Head

Add a dedicated output for predicting discrete events alongside trajectories.

Event Type Prediction

At each frame, predict: P(pass), P(shot), P(dribble), P(nothing)

Event Participant Prediction

If pass: which player sends, which receives? (pairwise attention scores)

Example output: "At frame 45, P(pass)=0.87. Most likely passer: Player 10. Most likely receiver: Player 7 (65%) or Player 11 (30%)."
4. Hierarchical Temporal Modeling

Process time at multiple scales: frame-level (0.1s), phase-level (1-2s), and possession-level (5-15s).

Level 1: 10 FPS individual frames (micro-movements)
Level 2: 1 second windows pooled (tactical adjustments)
Level 3: Possession segments (strategic patterns)

Benefit: Long-range dependencies (e.g., "this is a counter-attack that started 10 seconds ago") are captured without O(T²) attention over all frames.

5. Efficient Attention Variants

Replace full attention with more efficient variants for real-time inference.

Linear Attention

O(N) instead of O(N²) via kernel approximation

Sparse Attention

Only attend to nearby players + ball

Flash Attention

Memory-efficient exact attention

6. Cross-Sport Pre-Training

Build a "foundation model" for multi-agent sports by pre-training on multiple sports simultaneously.

Insight: Basic motion patterns (acceleration, direction changes, spacing) are shared across soccer, basketball, hockey, American football.

Approach: Pre-train on combined data with sport-specific tokens, then fine-tune on target sport.

Benefit: More data → better representations. Teams with limited data can leverage pre-trained model.

Analogy: Like how GPT pre-trains on all text then fine-tunes for specific tasks, a sports foundation model could pre-train on all tracking data then fine-tune for soccer-specific analysis.
7. Counterfactual Trajectory Generation

Beyond prediction: generate "what if" scenarios for tactical analysis.

Question: "What if the striker had run behind instead of coming short?"

Approach: Condition the model on hypothetical striker movement, predict how defenders would respond.

Application: Post-match analysis, training sessions, tactical planning.

8. Integration with Video Understanding

Combine trajectory modeling with visual features from broadcast video.

Current limitation: TranSPORTmer uses only positional data.

Opportunity: Visual features could provide: body orientation, gaze direction, ball possession confidence, player identification.

Architecture: Add a visual encoder (CNN/ViT) that produces per-player visual embeddings, concatenate with positional embeddings.

How Does TranSPORTmer Compare to Our SkillCorner STGNN?
Connecting research to our implementation

If you've read our SkillCorner STGNN Implementation, you might wonder how it relates to TranSPORTmer. Here's a comparison:

AspectSkillCorner STGNNTranSPORTmer
Spatial EncodingTransformerConv (GNN with edge features)Set Attention Blocks (pure attention)
Temporal EncodingTransformer Encoder (causal)Temporal Transformer (causal or bidirectional)
TasksTrajectory forecasting onlyForecasting + Imputation + Ball + Classification
Edge FeaturesYes (distance, team, relative velocity)Implicit via attention (learned)
Physics ConstraintsYes (speed/velocity loss terms)No (pure data-driven)
FocusSingle runner trajectoryAll agents simultaneously
Potential Hybrid Approach

The best of both worlds: use TranSPORTmer's multi-task unified architecture with the SkillCorner STGNN's explicit edge features (distance, same_team) and physics-informed loss. This would combine elegant unified learning with domain knowledge about football structure.

Resources & Further Reading
Foundational Papers

"Attention Is All You Need" (Vaswani et al., 2017) — The original Transformer

"Set Transformer" (Lee et al., 2019) — Set attention for unordered inputs

"Social-STGCNN" (Mohamed et al., 2020) — Trajectory prediction with social graphs

"Trajectron++" (Salzmann et al., 2020) — Probabilistic trajectory forecasting

Key Takeaways
Unified architecture handles forecasting, imputation, ball inference, and state classification
Set Attention Blocks ensure permutation equivariance — player order doesn't matter
Input masking controls which task the model performs with the same weights
CLS token classifies game states (pass, possession, uncontrolled, out-of-play)
Outperforms specialized task-specific models on soccer and basketball
Multi-task training provides regularization and enables knowledge transfer
Practical applications: works with missing data, infers ball from players
Future work: multi-modal outputs, physics constraints, efficient attention, foundation models