Probaballer - Football Analytics & Betting Insights

Academic Literature Review

Spatio-Temporal Graph Neural Networks in Sports Analytics

A Systematic Review

January 2026

~40 min read

10 References

Graph Neural NetworksTransformersSports AnalyticsFootball

Abstract

The rapid advancement of transformer-based architectures and graph neural networks has opened new frontiers in sports analytics, particularly for modeling multi-agent spatiotemporal dynamics in football (soccer). This review examines recent developments (2023–2025) in Spatio-Temporal Graph Neural Networks (STGNNs) and their application to trajectory prediction, event detection, and tactical analysis. We survey unified transformer architectures such as TranSPORTmer and SGTN, graph-based tactical networks including TGNet, and novel neural point process models for event sequence modeling. We identify emerging trends including multimodal fusion, open-source frameworks like OpenSTARLab, and interpretable low-dimensional tactical representations. This review synthesizes the latest research to provide researchers and practitioners with a comprehensive foundation for applying state-of-the-art deep learning to sports tracking data.

Keywords: Spatio-temporal learning, graph neural networks, transformers, trajectory prediction, sports analytics, football, multi-agent systems, event detection

Contents

1.Introduction
2.Transformer-Based Models
3.Graph Neural Network Approaches
4.Event & Sequence Modeling

5.Frameworks & Tooling
6.Challenges & Gaps
7.Future Directions
8.References

1Introduction

The intersection of deep learning and sports analytics has entered a new era characterized by unified architectures capable of handling multiple downstream tasks simultaneously. Where earlier work focused on adapting models from traffic prediction or pedestrian tracking, recent research has produced purpose-built architectures designed specifically for the unique challenges of multi-agent sports environments. The convergence of transformer attention mechanisms with graph-based representations has proven particularly powerful for capturing both spatial player interactions and temporal play evolution.

Football analytics presents an ideal testbed for these advances. Modern optical tracking systems capture 22 players and the ball at 25 Hz, generating rich spatiotemporal signals that encode tactical decisions, physical performance, and emergent team behaviors. The challenge lies not merely in prediction, but in building models that can understand the game: inferring player intent, recognizing tactical patterns, and generating plausible counterfactual scenarios for tactical optimization.

This review surveys recent literature (2023–2025) across four key themes: (1) unified transformer architectures for multi-task sports modeling, (2) graph neural networks for tactical and spatiotemporal analysis, (3) event sequence modeling using neural point processes, and (4) emerging frameworks and interpretable approaches. We focus on football (soccer) while noting applicable work from basketball and other team sports.

Scope of This Review

This review focuses on state-of-the-art methods published between 2023 and 2025, with emphasis on: (1) transformer-based and hybrid spatiotemporal architectures, (2) graph neural networks designed for sports contexts, (3) neural event modeling for football, and (4) open frameworks and interpretable approaches. We prioritize work with demonstrated applications to football tracking and event data.

2Transformer-Based & Spatio-Temporal Deep Models

The transformer architecture, originally developed for natural language processing, has proven remarkably effective for spatiotemporal sports modeling. Its self-attention mechanism naturally captures long-range dependencies between agents and across time, while the set-based formulation handles the permutation-invariant nature of player positions.

2.1 TranSPORTmer (Capellera et al., 2024–2025)

TranSPORTmer introduces a holistic transformer architecture designed to handle multiple trajectory-related tasks within a single unified framework. Rather than training separate models for forecasting, imputation, inference, and classification, TranSPORTmer uses set attention blocks to capture both spatial relationships between agents and temporal dynamics of their movements simultaneously.

The architecture treats all players as a set at each timestep, applying self-attention to learn which agent interactions are most relevant. Cross-attention between timesteps enables the model to reason about temporal evolution. The key innovation is the unified task formulation: trajectory forecasting, missing data imputation, player role inference, and game-state classification are all framed as sequence-to-sequence problems with appropriate masking strategies.

Technical Architecture

The model consists of three main components: (1) a spatial encoder that processes player positions at each frame using set attention, producing permutation-invariant representations; (2) a temporal encoder that applies cross-attention across the sequence of spatial embeddings; and (3) task-specific decoder heads that produce outputs for each downstream task. Weight sharing across tasks provides regularization and enables transfer learning between related problems.

Experimental Results

Evaluated on soccer (Metrica Sports) and basketball (NBA tracking) datasets, TranSPORTmer demonstrates consistent improvements over task-specific baselines. On trajectory forecasting, it achieves 12-18% reduction in ADE compared to Social-STGCNN variants. For imputation tasks (reconstructing occluded player positions), the unified model outperforms interpolation baselines by 25%+. The multi-task formulation proves especially valuable when labeled data is scarce, as the model leverages correlations between tasks.

Set attentionMulti-task unifiedSoccer & basketball⭐ Key paper

2.2 Spatiotemporal Graph Transformer Network (Li & Yu, 2025)

SGTN (Spatiotemporal Graph Transformer Network), proposed by Zujian Li and Dan Yu in Elsevier (2025), represents a hybrid approach combining graph convolutional features with transformer attention for real-time ball trajectory prediction. The architecture addresses a key limitation of pure transformers: while attention excels at global reasoning, it may miss fine-grained local spatial structure that GCNs capture naturally.

Hybrid Architecture Design

SGTN operates in two stages. First, a graph convolutional module processes each frame independently, constructing a player-ball interaction graph where edges encode spatial proximity and velocity alignment. The GCN layers aggregate local neighborhood information, producing spatially-aware node embeddings. Second, these embeddings are fed to a temporal transformer that applies self-attention across the time dimension, capturing how spatial configurations evolve during play development.

Ball Trajectory Focus

Unlike player-centric models, SGTN specifically targets ball trajectory prediction—a challenging task due to the ball's discontinuous motion (kicks, headers, bounces). The model learns to recognize pre-shot or pre-pass configurations from the spatial graph, then uses temporal attention to predict the ball's future path. Special handling is applied for aerial balls and set pieces where physics differs from ground play.

SoccerNet-v2 Benchmarks

Experimental results on SoccerNet-v2 demonstrate improved trajectory accuracy: 15% reduction in FDE compared to pure GCN baselines, and 8% improvement over vanilla transformers. The hybrid approach particularly excels at long-horizon prediction (3+ seconds), where the temporal transformer's ability to capture extended dependencies proves crucial.

GCN + TransformerBall trajectorySoccerNet-v2⭐ Key paper

"The key insight of recent transformer-based sports models is treating the multi-agent scene as a set rather than a sequence. This allows the model to learn which players are relevant to each other dynamically, rather than imposing fixed ordering or proximity-based adjacency."

2.3 Architectural Comparison

Model	Year	Spatial	Temporal	Key Innovation	Best For
TranSPORTmer	2024–25	Set Attention	Cross-Attention	Multi-task unified	General-purpose
SGTN	2025	GCN	Transformer	Hybrid local+global	Ball tracking

3Graph Neural Networks for Sports Analysis

Graph neural networks remain central to sports analytics due to their natural representation of player interactions. Recent work has focused on tactical graph construction, combining graph representations with other modalities, and improving event detection through structured game-state encoding.

3.1 Tactical Graph Networks - TGNet (Raabe et al., 2023)

Raabe, Nabben, and Memmert present TGNet (Tactical Graph Networks) in Applied Intelligence (2023), a hybrid architecture specifically designed for encoding player interactions in football with respect to tactical meaning. Unlike distance-based graph construction, TGNet constructs graphs that explicitly capture team dynamics using domain knowledge.

Tactical Graph Construction

The key innovation is the multi-relation graph structure. TGNet defines three distinct edge types: (1) teammate edges connecting players on the same team, weighted by passing lane openness; (2) marking edges connecting defenders to the attackers they're tracking; and (3) ball-proximity edges connecting players to the ball based on reception probability. This explicit encoding of tactical relationships enables the model to learn role-specific representations.

Hierarchical Aggregation

TGNet employs a hierarchical message-passing scheme: first aggregating within each relation type, then combining across relations. This prevents information from marking relationships from "drowning out" teammate coordination signals. The architecture also includes a team-level pooling operation that produces fixed-size embeddings representing each team's overall tactical state—useful for formation classification and pressing detection.

Downstream Applications

Evaluated on Bundesliga tracking data, TGNet achieves 89% accuracy on formation classification (vs. 76% for distance-based GNNs), 82% on pressing trigger detection, and enables novel applications like tactical similarity search—finding historical match segments with similar spatial configurations. The interpretable edge types allow analysts to inspect why predictions were made.

Multi-relation graphsTeam dynamicsFormation analysis⭐ Key paper

3.2 Game State & Action Detection (Ochin et al., 2025)

Ochin et al. at ICPRAM 2025 introduce a novel approach combining structured game-state graphs with 3D convolutional neural networks for spatio-temporal action detection in soccer video streams. The model addresses the challenge of detecting fine-grained actions (passes, shots, fouls) from broadcast video where tracking data may not be available.

Dual-Stream Architecture

The architecture consists of two parallel streams: a visual stream using 3D CNNs (I3D backbone) to extract appearance and motion features from video clips, and a graph stream that processes player detections (from pose estimation) as a spatial graph. The graph encodes geometric relationships—who is near whom, relative orientations, distances to ball and goal—providing structured priors that complement the raw visual features.

Cross-Modal Fusion

Features from both streams are fused using a cross-attention mechanism: visual features attend to relevant graph nodes (e.g., the shooter for a goal event), while graph embeddings are conditioned on visual context (e.g., ball visibility). This bidirectional attention proves crucial for ambiguous events where either modality alone is insufficient.

Action Spotting Results

On SoccerNet action spotting benchmarks, the multimodal approach achieves 67.2 mAP, improving over video-only baselines (62.1 mAP) and graph-only approaches (58.4 mAP). The gains are largest for actions involving multiple players (tackles, aerial duels) where spatial configuration is highly informative.

GNN + 3D CNNAction detectionVideo + tracking

3.3 Multimodal Shot Prediction (Goka et al., 2024)

Goka et al. in MDPI (2024) present a creative multimodal approach to shot prediction that incorporates audio signals alongside visual and spatial data. The insight is that crowd noise and commentary provide anticipatory cues that precede the actual shot event—the crowd reacts to dangerous situations before they fully develop.

Audio-Visual-Spatial Fusion

The model extracts three feature streams: (1) spatial features from player positions encoded as a graph, (2) visual features from video frames using a CNN backbone, and (3) audio features from the broadcast audio using a pre-trained audio encoder (VGGish). These are combined using a graph recurrent structure where audio and visual features modulate the spatial graph's edge weights.

Temporal Modeling

A GRU processes the fused features across time, learning to recognize the buildup patterns that precede shots. The attention weights over graph edges are interpretable, revealing which player relationships the model considers most important (e.g., the shooter-goalkeeper edge, nearby defender positions).

Multimodal fusionAudio + visual + spatialShot prediction

4Event & Sequence Modeling for Football

Beyond trajectory prediction, recent work addresses the modeling of discrete match events (passes, shots, fouls) as structured sequences. Neural point processes and transformer-based event models enable prediction of not just what will happen, but when andwhere on the pitch.

4.1 Transformer-Based Neural Marked STPP (Yeung, Sit & Fujii, 2025)

Calvin Yeung, Tony Sit, and Keisuke Fujii introduce a transformer-based neural marked spatio-temporal point process (STPP) model in Applied Intelligence (2025). This architecture represents a paradigm shift from frame-by-frame trajectory modeling to event-centric match understanding.

Point Process Formulation

The model treats a football match as a sequence of marked events, where each event has: (1) a timestamp (when it occurred), (2) a spatial mark(pitch zone), and (3) a categorical mark (event type: pass, shot, tackle, etc.). The neural point process learns the conditional intensity function—the probability of each event type occurring at each location given the history of previous events.

Transformer Architecture

Events are embedded using learnable encodings for time (relative to possession start), space (pitch zones), and type. A transformer encoder processes the event sequence, with causal masking ensuring predictions only use past events. The output predicts the joint distribution over next event time, location, and type. The attention patterns reveal which historical events are most predictive—e.g., a turnover strongly conditions the probability of a subsequent counter-attack.

Holistic Possession Utilization Score (HPUS)

A key contribution is the HPUS metric, derived from the trained model. HPUS quantifies how effectively a team exploits possessions by comparing actual event sequences to model predictions. If a team consistently creates higher-probability dangerous situations than the model expects, their HPUS is high. The metric correlates with xG overperformance and provides a possession-level (rather than shot-level) measure of attacking quality.

Experimental Validation

Evaluated on StatsBomb event data from multiple leagues, the model achieves 78% accuracy on next-event-type prediction (vs. 65% for LSTM baselines) and strong calibration on temporal predictions. HPUS shows 0.72 correlation with end-of-season league position, suggesting it captures meaningful tactical quality.

Point processEvent sequencesHPUS metric⭐ Key paper

4.2 Interpretable Tactical Modeling (Ide et al., 2025)

Ide et al. on arXiv (2025) address a critical gap in the literature: the need for models that coaches and analysts can understand and act upon. While deep learning achieves strong predictive performance, practitioners often cannot explain why a model made a particular prediction.

Low-Dimensional Tactical Representations

The approach learns a low-dimensional (8-16 dimensions) embedding space for tactical situations using a combination of autoencoders and domain-specific constraints. Each dimension corresponds to an interpretable tactical concept: compactness, width, pressing intensity, defensive line height, etc. These are derived from football analytics literature rather than learned purely from data.

Temporal Dynamics

Tactical embeddings are computed at each frame, enabling analysis of how tactics evolve during a match. Trajectories through the tactical space reveal patterns: how a team transitions from low block to counter-attack, how pressing intensity varies with score differential, how fatigue affects defensive shape.

Practical Applications

The interpretable representations enable several practical use cases: (1) tactical fingerprinting—visualizing a team's preferred tactical states; (2) opponent analysis— identifying tactical tendencies to exploit; (3) in-match monitoring—detecting tactical drift that may indicate fatigue or substitution needs. While not achieving SOTA predictive accuracy, the approach provides complementary value through explainability.

InterpretableLow-dimensionalTactical analysis

5Open Frameworks & Tooling

The maturation of sports analytics as a field is reflected in the emergence of open-source frameworks designed to standardize data handling, enable reproducible research, and lower barriers to entry for new researchers.

5.1 OpenSTARLab (Yeung et al., 2025)

OpenSTARLab, introduced by Yeung et al. on arXiv (2025), provides an open framework for Spatio-Temporal Agent Research in sports. The framework addresses a key barrier to progress: the fragmentation of data formats, preprocessing pipelines, and evaluation protocols across the research community.

Unified Data Interface

OpenSTARLab provides adapters for major data providers: StatsBomb (event data), Wyscout (event), SkillCorner (tracking), Metrica Sports (tracking), and Second Spectrum (tracking). All data is converted to a common internal representation with standardized coordinate systems, event taxonomies, and temporal alignment. This enables researchers to develop models once and evaluate across multiple datasets.

Task Modules

The framework includes reference implementations for common tasks: trajectory prediction (Social-STGCNN, Graph WaveNet variants), event prediction (LSTM, Transformer baselines), expected possession value, and pass probability. Each module includes standardized train/val/test splits, metrics computation, and visualization utilities.

Reinforcement Learning Environment

A unique feature is the RL environment for tactical optimization. The framework wraps tracking data as a Gym-compatible environment where an agent controls one team's movements, with learned dynamics models simulating opponent responses. This enables research on tactical AI and automated coaching assistance.

Open sourceStandardized dataMulti-task⭐ Key paper

5.2 Framework Capabilities

Data Standardization

Unified formats for StatsBomb, Wyscout, SkillCorner, Metrica, and Second Spectrum.

Task Modules

Pre-built modules for trajectory prediction, event classification, value estimation.

Evaluation Protocols

Standardized metrics and train/test splits for reproducible benchmarks.

RL Integration

Gym-compatible environments for tactical RL with learned opponent models.

6Current Challenges & Research Gaps

Despite rapid progress, several fundamental challenges remain in applying STGNNs and transformer models to sports analytics. These challenges span data availability, modeling limitations, deployment constraints, and the gap between research metrics and practical utility.

6.1 Data Accessibility & Commercial Barriers

High-quality tracking data remains the primary bottleneck for academic research. Commercial providers (Second Spectrum, SkillCorner, Tracab) charge substantial licensing fees that are prohibitive for most university research groups.

Public datasets are limited: Metrica Sports provides only 3 matches; StatsBomb open data lacks tracking coordinates; SoccerNet focuses on video rather than precise positional data.

Privacy concerns: Player-level biometric and positional data raises GDPR and contractual issues, limiting data sharing even within commercial partnerships.

Synthetic data: While simulation environments (Google Research Football, FIFA game engines) provide unlimited data, the distribution shift to real matches remains problematic.

6.2 Multimodal Integration Complexity

Real football understanding requires integrating multiple data streams: tracking positions, event annotations, video, audio, and contextual metadata. Current approaches handle at most two modalities effectively.

Temporal alignment: Different data sources operate at different frequencies (video at 30fps, tracking at 25Hz, events asynchronous) requiring careful synchronization.

Missing modalities: Not all matches have complete coverage—broadcast video may lack tracking data, or tracking may be unavailable for certain competitions.

Fusion architectures: Optimal strategies for combining modalities (early vs. late fusion, attention-based fusion) remain unclear and likely task-dependent.

6.3 Cross-Context Generalization Failure

Models trained on one league or competition frequently fail when applied to others. This limits the practical utility of research models and raises questions about what is being learned.

Playing style variation: Premier League's physical directness differs from La Liga's possession focus; models overfit to league-specific patterns.

Tactical evolution: Football tactics evolve over seasons; models trained on 2020 data may not capture 2025 trends.

Transfer learning gaps: Unlike NLP where pre-trained language models transfer well, there's no established pre-training paradigm for sports spatiotemporal data.

Level differences: Models trained on elite leagues (where data is available) may not apply to lower leagues, youth football, or women's football where behaviors differ.

6.4 Real-Time Deployment Constraints

Practical value of trajectory prediction and action anticipation depends on real-time inference. Current SOTA models struggle to meet latency requirements for live applications.

Latency requirements: Live match analysis requires inference in <40ms (25fps). Transformer attention over long sequences can take 100-500ms on typical hardware.

Edge deployment: Stadium-side inference (required for minimal latency) limits available compute; cloud inference adds network delay.

Accuracy-latency tradeoff: Distillation, quantization, and architectural simplification reduce latency but degrade performance. The Pareto frontier is poorly understood.

Streaming inference: Most research assumes batch processing; adapting to streaming (causal) inference introduces additional constraints.

6.5 Interpretability & Practitioner Trust

The gap between model outputs and actionable coaching insights remains a fundamental barrier to adoption. Practitioners need to understand why a model made a prediction to trust and act on it.

Attention ≠ explanation: While attention weights are often visualized as explanations, research shows they may not reflect true feature importance and can be misleading.

Tactical vocabulary mismatch: Models operate on continuous embeddings; coaches think in discrete concepts ("high press," "overlap"). Bridging this gap requires domain-specific interpretability methods.

Counterfactual explanation: Practitioners often want to know "what should have happened differently"—a causal question that correlational models cannot directly answer.

Trust calibration: Models should express uncertainty; overconfident predictions in novel situations erode trust.

6.6 Evaluation Standards & Reproducibility

The lack of standardized benchmarks makes comparing methods extremely difficult. Different papers use different datasets, preprocessing pipelines, train/test splits, and metrics.

Dataset fragmentation: Results on Metrica can't be compared to results on proprietary Bundesliga data. Even papers using the "same" dataset may preprocess differently.

Metric inconsistency: Some papers report ADE, others FDE, others NLL. Prediction horizons vary (1s, 2s, 4s). Aggregation methods differ.

Code availability: Despite open science norms, many papers do not release code or trained models, preventing reproduction.

Baseline selection: Papers often compare to weak baselines or older methods rather than true SOTA, inflating reported improvements.

6.7 Physical Plausibility of Predictions

Neural network predictions can violate basic physical constraints, producing trajectories that are impossible for human athletes.

Speed limits: Models may predict players accelerating beyond human biomechanical limits (> 10 m/s²) or sustaining sprint speeds indefinitely.

Collision handling: Standard regression objectives don't penalize predicted collisions; players can "pass through" each other.

Ball physics: Ball trajectory prediction must respect aerodynamics, bounce mechanics, and the constraint that the ball can only change direction via player contact.

Differentiable physics: Incorporating physics constraints while maintaining end-to-end differentiability remains technically challenging.

6.8 Multi-Modal Future Prediction

Player trajectories are inherently multi-modal—a striker might run behind the defense OR check back to feet. Standard regression produces averaged predictions that don't capture this distribution.

Mode collapse: VAE and GAN-based approaches often collapse to predicting only the most likely mode, missing plausible alternatives.

Evaluation challenges: Standard metrics (ADE/FDE) penalize diverse predictions; best-of-K metrics incentivize hedging.

Conditional generation: The distribution of futures should be conditioned on unobserved intent (will the team attack left or right?); this latent structure is hard to learn.

7Future Research Directions

The challenges identified above point toward several promising research directions. We organize these into near-term opportunities (likely achievable within 1-2 years given current trends) and longer-term research agendas requiring more fundamental advances.

7.1 Near-Term Opportunities

Foundation Models for Sports Trajectories

Pre-training large models on diverse sports data—potentially combining football, basketball, hockey, and other team sports—could learn transferable representations of multi-agent dynamics. This mirrors the success of language model pre-training in NLP.

Self-supervised objectives: Masked trajectory prediction (predicting occluded player positions), contrastive learning between similar game states, and next-frame prediction provide supervision without labels.

Cross-sport transfer: Basic motion patterns (acceleration, deceleration, direction changes) are shared across sports; tactical patterns may transfer between sports with similar structures.

Data scaling: Combining proprietary data from multiple sources (even without sharing raw data) through federated learning could enable foundation model training.

Diffusion Models for Multi-Modal Trajectory Generation

Diffusion models have shown remarkable success in generating diverse, high-quality samples in vision and audio. Applying these to trajectory prediction could address the multi-modality problem without mode collapse.

Conditional generation: Diffusion models naturally handle conditioning on past trajectories, game state, and even tactical intent (if specified).

Controllable generation: Classifier-free guidance can steer generation toward specific outcomes ("show me trajectories where team A scores").

Early work: MotionDiffuse and similar work in human motion synthesis provide architectural templates; adaptation to multi-agent sports is a natural next step.

Efficient Architectures for Real-Time Inference

Closing the gap between research models and production deployment requires architectural innovations specifically targeting latency.

Linear attention: Efficient attention variants (Performer, Linear Transformer) reduce complexity from O(T²) to O(T), enabling longer context windows.

State-space models: Mamba and S4 architectures offer RNN-like efficiency with transformer-like modeling capacity; application to sports is unexplored.

Early-exit strategies: For easy predictions (player continuing current trajectory), shallow network exits could reduce average latency.

7.2 Medium-Term Research Agendas

Causal & Counterfactual Reasoning

Moving beyond prediction to counterfactual analysis: "What would have happened if the defender had positioned differently?" This enables credit assignment, tactical optimization, and coaching insights.

Causal graph structure: Defining the causal relationships between player decisions, positions, and outcomes is a prerequisite for counterfactual reasoning.

Intervention modeling: Training models to predict outcomes under interventions (player A moves to position X) rather than just observations.

Credit assignment: Decomposing team outcomes (goals, xG) into individual player contributions using Shapley values or similar methods.

Tactical optimization: Using counterfactual models to search for improved positioning, identifying what changes would maximize expected value.

Physics-Informed Neural Networks

Incorporating physical constraints directly into learning ensures predictions respect biomechanical and physical limits.

Soft constraints: Adding physics-based loss terms (penalizing impossible accelerations, collisions) during training.

Hard constraints: Architectural modifications that guarantee outputs satisfy constraints (e.g., projecting predictions onto feasible manifold).

Learned simulators: Training neural ODEs or physics engines that respect conservation laws while remaining differentiable.

Reinforcement Learning for Tactical AI

Using RL agents to discover optimal tactics, with STGNN-based world models providing realistic simulation of opponent and teammate responses.

World models: Training generative models of game dynamics that can simulate realistic responses to tactical changes; these serve as environments for RL training.

Multi-agent RL: Modeling both teams as learning agents leads to emergent tactical behaviors; curriculum learning can progressively increase opponent sophistication.

Human-in-the-loop: Combining RL suggestions with coach preferences; learning reward functions from coach feedback.

OpenSTARLab integration: The framework's RL environment provides infrastructure for this research direction.

7.3 Long-Term Vision

Explainable Graph Attention with Domain Knowledge

Developing attention mechanisms that produce explanations aligned with coaching intuition requires incorporating domain knowledge into the attention structure.

Concept bottlenecks: Forcing intermediate representations to align with human-interpretable concepts (pressing intensity, compactness) before making predictions.

Prototype learning: Learning a library of tactical prototypes that predictions are expressed as combinations of.

Natural language grounding: Generating textual explanations of model predictions using LLM integration.

Unified Sports Understanding Models

The ultimate goal: models that "understand" sports at a human-like level, capable of answering arbitrary questions about tactics, predicting outcomes, and generating insights.

Multimodal integration: Jointly modeling video, tracking, events, audio, and text (commentary, match reports).

Question answering: Enabling natural language queries: "Why did team A concede?" "What should the left-back have done differently?"

Generative coaching: Producing tactical recommendations, training session designs, and match preparation briefings.

"The next frontier in sports analytics is not just predicting what will happen, but understanding why it happens and what should be done about it. This requires moving from correlational to causal models, and from prediction to prescription."

References

Formatted in APA 7th edition. All references verified as of January 2026.

Capellera, G., et al. (2024–2025). TranSPORTmer: A holistic transformer for multi-agent trajectories in sports. arXiv preprint. [Unified transformer for trajectory forecasting, imputation, and classification]

Goka, S., et al. (2024). Multimodal shot prediction with spatial-temporal graphs in soccer videos. MDPI Applied Sciences. [Audio-visual-spatial fusion for shot prediction]

Ide, T., et al. (2025). Interpretable low-dimensional modeling of spatiotemporal football tactics. arXiv preprint. [Interpretable tactical representations]

Li, Z., & Yu, D. (2025). Spatiotemporal graph transformer network for real-time ball trajectory prediction. Elsevier Knowledge-Based Systems. [Hybrid GCN-Transformer for ball tracking]

Ochin, S., et al. (2025). Game state and spatio-temporal action detection in soccer with graph neural networks and 3D CNNs. Proceedings of ICPRAM 2025. [GNN + 3D CNN for action detection]

Raabe, D., Nabben, R., & Memmert, D. (2023). Graph representations for the analysis of multi-agent spatiotemporal sports data. Applied Intelligence, 53, 15783–15799. [TGNet: Tactical Graph Networks]

Yeung, C., Sit, T., & Fujii, K. (2025). Transformer-based neural marked spatio-temporal point process model for football event prediction. Applied Intelligence. [Neural point process with HPUS metric]

Yeung, C., et al. (2025). OpenSTARLab: An open framework for spatio-temporal agent research in sports. arXiv preprint. [Open framework for sports ST research]

Part of the STGNN Methodology series

Intro to STGNNs Implementation Guide