In team sports, continuous player motion and discrete events (who has the ball, who passed to whom) are not independent β they are two views of the same underlying tactical decision. Almost every existing trajectory model treats them in isolation: trajectories first, events bolted on as a post-hoc heuristic (or vice-versa).
That separation is exactly what produces the unrealistic samples you see in prior work: passes that don't reach a teammate, possessors who are nowhere near the ball, and ball trajectories that ignore who is supposedly on it.
JointDiff's premise: denoise both modalities at once from a single network, so the continuous head and the discrete head are forced to agree at every step of the reverse process.
The paper diagnoses three blind spots in the current trajectory-generation literature:
Continuous diffusion models (U2Diff, MoFlow, LED) and discrete event models live in different codebases. The result: nice trajectories with implausible possession dynamics β or vice versa.
min ADE/FDE, inherited from pedestrian forecasting, only checks if one sample passes near the ground-truth. It rewards modes that ignore scene-level coherence between teammates.
Existing controllable diffusion (Trace & Pace, MotionDiffuser) targets individual-agent waypoints. Nobody had a clean way to say "PlayerΒ 1 starts with the ball and passes to PlayerΒ 3" for a 23-agent scene.
A joint continuous-discrete diffusion formulation: Gaussian DDPM on trajectories + multinomial diffusion on possessor events, with one shared backbone, two heads, and a controllability module (CrossGuid) that supports both weak-possessor-guidance and natural-language text guidance.
The scene at any time is a tuple X = (Y, E) where Y β β^(TΓNΓ2) are the 2D trajectories of the ball + 22 players, and E β {0,1}^(TΓN) is a one-hot row at every timestep saying who currently possesses the ball (or the ball itself if nobody does).
Trajectories are corrupted with Gaussian noise toward N(0, I), then denoised by a regression head Ξ΅_ΞΈ. Standard DDPM + DDIM sampling for speed.
- β’ S = 50 continuous diffusion steps
- β’ Quadratic Ξ²-schedule, Ξ²β = 1e-4 β Ξ²_S = 0.5
- β’ DDIM with skip ΞΆ = 5 β 11 effective inference steps
- β’ Simplified MSE objective on predicted noise
Possession one-hots are corrupted toward a uniform categorical and denoised by a classification head Ο_ΞΈ producing Γβ. Multinomial (Hoogeboom et al., 2021), not absorbing-state β see consistency results below.
- β’ S_d = 10 discrete steps, aligned with continuous via s_d = βs Β· S_d / Sβ
- β’ Variational KL loss
L^E_vb - β’ Stochastic categorical sampler at inference
Because the forward processes are independent across modalities, the variational bound factorises: L_joint = L^Y_simple + Ξ» Β· L^E_vb. The authors find Ξ» = 0.1 is the sweet spot β high enough that events influence the representation, low enough that trajectory quality doesn't degrade.
Critically the reverse network is conditioned on the full noisy state X_s = (Y_s, E_s), so each head sees what the other head is currently producing β this is what couples trajectories and events at every denoising step.
JointDiff inherits the Social-Temporal Block from the same authors' U2Diff (CVPR 2025): a per-agent Temporal Mamba that captures individual dynamics, followed by a Social Transformer that mixes information across agents at each timestep. Two such blocks make up the denoiser.
Noisy X_s, observed context X_co, binary mask M, denoising step s, and (optionally) guidance G. All concatenated into a [T, N, 7] tensor.
2 Γ Residual Denoising Block β each = projection + Social-Temporal Block. Hidden size 256, 8 attention heads, 1024-dim FFN. CrossGuid sits between the Mamba and the Social Transformer.
Regression head β predicted Gaussian noise Ξ΅_ΞΈ for trajectories.
Classification head β softmax over agents, giving original-event probabilities Γβ.
CrossGuid is a multi-head-attention block that injects an external guidance signal G into the intermediate scene representation H β β^(TΓNΓ256). It comes in two flavours, trained with classifier-free guidance (25% conditioning dropout) so the same checkpoint can run both controllable and unconditional generation.
The user supplies an ordered list of intended possessors, e.g. [1, 3, 5]. No timing, no positions β just "these players touch the ball, in this order".
- β’ K, V = learnable agent embeddings of the possessor list
- β’ Q = the ball's row of
H - β’ Update is applied only to the ball channel
- β’ Players still get an additive agent embedding for social reasoning
Free-form prompts like "PlayerΒ 1 starts with the ball and passes to PlayerΒ 3", encoded by a frozen T5-Base (768-dim).
- β’ K, V = projected T5 token embeddings
- β’ Q = each agent's row + agent embedding
- β’ MHA runs per agent against the shared text context
- β’ Update is applied to all agents
WPG only constrains the set and order of possessors, not their timing or pitch coordinates. That's exactly the granularity an analyst wants: "simulate a build-up through 6β8β10" without dictating exactly when each touch happens. It's a much more forgiving control surface than waypoint-based guidance.
32.5k train / 12k test scenes. T = 30 timesteps @ 5 fps. N = 11 (ball + 10 players). Splits from Mao et al. (LED).
10,762 train / 2,624 test. T = 50 @ 10 fps. N = 23. Splits from Sports-Traj. Text comes from public play metadata.
2,093 train (Γ 180Β° aug.) / 524 test from 7 matches. T = 40 @ 6.25 fps. N = 23. Text generated via a Stage-1/2/3 LLM-refinement pipeline.
Possession events are extracted from raw tracking with a single, data-driven 1.5 m threshold chosen by minimising the average ball-direction change while uncovered β the same threshold works across all three sports, which is itself a small but elegant result.
All evaluations report scene-level SADE / SFDE (Casas et al., 2020) β average distance / final distance averaged across all agents in the scene, computed as min/avg over K = 20 generated samples β plus possession-event accuracy.
Beats GroupNet, AutoBots, LED, MART, MoFlow and U2Diff on avg SADE/SFDE across all three datasets β and stays competitive on min against non-IID samplers, which usually have an unfair advantage there.
Best SADE on NFL (0.84 / 1.03), Bundesliga (0.91 / 1.18), and NBA (0.57 / 0.78) β beating U2Diff, Sports-Traj and the deterministic TranSPORTmer.
On NBA, JointDiff is preferred over MoFlow (80%), U2Diff (65%), and the no-joint ablation (53%). It loses to the ground truth only 44% of the time, with 24% ties β i.e. people often can't tell its samples from real plays.
MoFlow wins min ADE/FDE on NBA, but loses the human study to JointDiff. The paper uses this as direct evidence that the long-standing minADE/FDE ranking β inherited from pedestrian forecasting β does not capture what humans actually mean by "a realistic football scene". Scene-level metrics correlate with perception; individual ones don't.
The controllable-generation table compares unconditioned (w/o G), WPG, and text guidance, and ablates the joint formulation. Two consistent patterns emerge:
w/o G < WPG < Text on every dataset, every metric. Even the very loose possessor list shaves 7β13% off SADE; full text descriptions go further.
The w/o joint ablation is worse than full JointDiff on both unconditional and controllable tasks β including on possession Acc and on trajectory-only metrics. Modelling events as a side channel improves the trajectories.
An attention-entropy analysis backs this up: the Social Transformer's attention in the joint model is consistently more focused (lower entropy) than the no-joint variant, especially in the early denoising steps. Knowing who has the ball lets the model immediately route attention toward the salient interactions.
Most prior work on joint continuous-discrete diffusion (DLT for layouts, DualDiffusion for vision-language) uses absorbing-state diffusion for the discrete part β once a token is "decided" it's frozen. JointDiff argues this is wrong for temporally evolving domains and uses multinomial diffusion instead.
Once a possessor token is unmasked it cannot be revised β even if subsequent denoising of the trajectories implies the player is now the wrong choice. Lower consistency between the generated trajectories and the predicted possessor sequence.
Every discrete token can be re-sampled at every step, in light of the latest continuous denoising. Result: 97β99% max and 80β92% avg agreement between predicted possessors and the heuristic possessor extracted from the predicted trajectories.
| Aspect | U2Diff | Diffoot | CausalTraj | JointDiff |
|---|---|---|---|---|
| Output | Continuous trajectories | Continuous trajectories | Continuous trajectories | Trajectories + events |
| Discrete Modelling | None (heuristic post-hoc) | None | None | Multinomial diffusion |
| Controllability | Past observation only | Past + graph | Past + causal structure | WPG + free-form text (CrossGuid) |
| Eval Suite | SADE/SFDE | ADE/FDE + direction | Coherence metrics | SADE/SFDE + Acc + human study |
| Datasets | NBA, NFL, soccer | Bundesliga only | Football only | Unified NBA + NFL + Bundesliga (with text) |
| Best Use Case | Strong general baseline | Defensive scouting | Coherent rollouts | Prompt-driven tactical "what-if" |
The discrete channel must share the same temporal grid as the trajectories β perfect for "ball possessor at time t", awkward for genuinely sparse events like shots or fouls. Extending to temporal point processes is flagged as the main next step.
β10k NFL pairs and β4k Bundesliga pairs are simply not enough for robust text grounding. The appendix shows clean failure cases where the trajectories ignore parts of the prompt β a data problem, not a model problem.
Possession is extracted with a 1.5 m threshold, not annotated. Works well in practice and generalises across sports, but misses subtle cases (deflections, shielded balls, contested touches).
The public NFL Big Data Bowl event stream doesn't name the player who performed each action, so the authors have to back it out from tracking + heuristics. Some text-grounding errors trace directly to noise in this step.
β’ DDPM (Ho et al., 2020) β Continuous Gaussian diffusion
β’ Multinomial Diffusion (Hoogeboom et al., 2021) β Discrete diffusion toward uniform
β’ Absorbing-State Diffusion (Austin et al., 2021) β The alternative discrete formulation
β’ Classifier-Free Guidance (Ho & Salimans, 2022) β How CrossGuid is trained
β’ U2Diff (Capellera et al., 2025) β The Social-Temporal Block backbone
β’ TranSPORTmer (Capellera et al., 2024) β Same authors, deterministic predecessor
β’ DLT (Levi et al., 2023) β Joint continuous-discrete diffusion for layouts
β’ Mamba (Gu & Dao, 2023) β The temporal SSM module used per agent
β’ T5 (Raffel et al., 2020) β Frozen text encoder for text-guidance
The Capellera/Kognia line of work has been climbing one rung at a time: TranSPORTmer (deterministic, multi-task) β U2Diff (uncertainty-aware diffusion) β JointDiff (joint modelling of trajectories and events, with text/possessor control). It's the most complete public framework for tactical scene generation today, and pairs naturally with GenTac as the controllable counterpart focused specifically on football open-play tactics.