Probaballer - Football Analytics & Betting Insights

Research Deep DiveICLR 2026

JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation

The first diffusion framework that jointly denoises continuous player trajectories and the synchronous discrete possession events that drive them — controllable by a list of intended possessors or a free-form text prompt.

Joint Continuous-Discrete DiffusionCrossGuidWPG + Text GuidanceScene-Level SOTA40 min read

Authors: Guillem Capellera, Luis Ferraz, Antonio Rubio, Alexandre Alahi, Antonio Agudo — Kognia Sports Intelligence · EPFL · IRI CSIC-UPC (ICLR 2026)

Read on arXiv

The Modelling Problem

In team sports, continuous player motion and discrete events (who has the ball, who passed to whom) are not independent — they are two views of the same underlying tactical decision. Almost every existing trajectory model treats them in isolation: trajectories first, events bolted on as a post-hoc heuristic (or vice-versa).

That separation is exactly what produces the unrealistic samples you see in prior work: passes that don't reach a teammate, possessors who are nowhere near the ball, and ball trajectories that ignore who is supposedly on it.

JointDiff's premise: denoise both modalities at once from a single network, so the continuous head and the discrete head are forced to agree at every step of the reverse process.

Why Existing Methods Fall Short

The paper diagnoses three blind spots in the current trajectory-generation literature:

❌ Modalities in Isolation

Continuous diffusion models (U2Diff, MoFlow, LED) and discrete event models live in different codebases. The result: nice trajectories with implausible possession dynamics — or vice versa.

❌ Wrong Metrics

min ADE/FDE, inherited from pedestrian forecasting, only checks if one sample passes near the ground-truth. It rewards modes that ignore scene-level coherence between teammates.

❌ No Semantic Control

Existing controllable diffusion (Trace & Pace, MotionDiffuser) targets individual-agent waypoints. Nobody had a clean way to say "Player 1 starts with the ball and passes to Player 3" for a 23-agent scene.

JointDiff's Solution

A joint continuous-discrete diffusion formulation: Gaussian DDPM on trajectories + multinomial diffusion on possessor events, with one shared backbone, two heads, and a controllability module (CrossGuid) that supports both weak-possessor-guidance and natural-language text guidance.

Joint Continuous-Discrete Diffusion

One reverse process, two factorised modalities

The scene at any time is a tuple X = (Y, E) where Y ∈ ℝ^(T×N×2) are the 2D trajectories of the ball + 22 players, and E ∈ {0,1}^(T×N) is a one-hot row at every timestep saying who currently possesses the ball (or the ball itself if nobody does).

Continuous Side: Gaussian DDPM

Trajectories are corrupted with Gaussian noise toward N(0, I), then denoised by a regression head ε_θ. Standard DDPM + DDIM sampling for speed.

• S = 50 continuous diffusion steps
• Quadratic β-schedule, β₀ = 1e-4 → β_S = 0.5
• DDIM with skip ζ = 5 → 11 effective inference steps
• Simplified MSE objective on predicted noise

Discrete Side: Multinomial Diffusion

Possession one-hots are corrupted toward a uniform categorical and denoised by a classification head π_θ producing Ê₀. Multinomial (Hoogeboom et al., 2021), not absorbing-state — see consistency results below.

• S_d = 10 discrete steps, aligned with continuous via s_d = ⌈s · S_d / S⌉
• Variational KL loss L^E_vb
• Stochastic categorical sampler at inference

The Joint Loss

Because the forward processes are independent across modalities, the variational bound factorises: L_joint = L^Y_simple + λ · L^E_vb. The authors find λ = 0.1 is the sweet spot — high enough that events influence the representation, low enough that trajectory quality doesn't degrade.

Critically the reverse network is conditioned on the full noisy state X_s = (Y_s, E_s), so each head sees what the other head is currently producing — this is what couples trajectories and events at every denoising step.

Architecture: U2Diff Backbone + CrossGuid

Mamba over time, Transformer over agents, attention over guidance

JointDiff inherits the Social-Temporal Block from the same authors' U2Diff (CVPR 2025): a per-agent Temporal Mamba that captures individual dynamics, followed by a Social Transformer that mixes information across agents at each timestep. Two such blocks make up the denoiser.

Inputs

Noisy X_s, observed context X_co, binary mask M, denoising step s, and (optionally) guidance G. All concatenated into a [T, N, 7] tensor.

Backbone

2 × Residual Denoising Block — each = projection + Social-Temporal Block. Hidden size 256, 8 attention heads, 1024-dim FFN. CrossGuid sits between the Mamba and the Social Transformer.

Two Heads

Regression head → predicted Gaussian noise ε_θ for trajectories.
Classification head → softmax over agents, giving original-event probabilities Ê₀.

CrossGuid: Two Ways to Steer the Scene

One module, two MHA configurations, classifier-free guidance for control

CrossGuid is a multi-head-attention block that injects an external guidance signal G into the intermediate scene representation H ∈ ℝ^(T×N×256). It comes in two flavours, trained with classifier-free guidance (25% conditioning dropout) so the same checkpoint can run both controllable and unconditional generation.

Weak-Possessor-Guidance (WPG)

The user supplies an ordered list of intended possessors, e.g. [1, 3, 5]. No timing, no positions — just "these players touch the ball, in this order".

• K, V = learnable agent embeddings of the possessor list
• Q = the ball's row of H
• Update is applied only to the ball channel
• Players still get an additive agent embedding for social reasoning

Text-Guidance

Free-form prompts like "Player 1 starts with the ball and passes to Player 3", encoded by a frozen T5-Base (768-dim).

• K, V = projected T5 token embeddings
• Q = each agent's row + agent embedding
• MHA runs per agent against the shared text context
• Update is applied to all agents

Why "Weak"?

WPG only constrains the set and order of possessors, not their timing or pitch coordinates. That's exactly the granularity an analyst wants: "simulate a build-up through 6→8→10" without dictating exactly when each touch happens. It's a much more forgiving control surface than waypoint-based guidance.

Unified Benchmark: NBA + NFL + Bundesliga, with Text

Three datasets, one dataloader, paired natural-language descriptions

🏀 NBA (SportVU)

32.5k train / 12k test scenes. T = 30 timesteps @ 5 fps. N = 11 (ball + 10 players). Splits from Mao et al. (LED).

🏈 NFL Big Data Bowl

10,762 train / 2,624 test. T = 50 @ 10 fps. N = 23. Splits from Sports-Traj. Text comes from public play metadata.

⚽ Bundesliga (IDSSE)

2,093 train (× 180° aug.) / 524 test from 7 matches. T = 40 @ 6.25 fps. N = 23. Text generated via a Stage-1/2/3 LLM-refinement pipeline.

Possession events are extracted from raw tracking with a single, data-driven 1.5 m threshold chosen by minimising the average ball-direction change while uncovered — the same threshold works across all three sports, which is itself a small but elegant result.

All evaluations report scene-level SADE / SFDE (Casas et al., 2020) — average distance / final distance averaged across all agents in the scene, computed as min/avg over K = 20 generated samples — plus possession-event accuracy.

Headline Results

Future Generation (SOTA)

Beats GroupNet, AutoBots, LED, MART, MoFlow and U2Diff on avg SADE/SFDE across all three datasets — and stays competitive on min against non-IID samplers, which usually have an unfair advantage there.

Imputation

Best SADE on NFL (0.84 / 1.03), Bundesliga (0.91 / 1.18), and NBA (0.57 / 0.78) — beating U2Diff, Sports-Traj and the deterministic TranSPORTmer.

Human Evaluation

On NBA, JointDiff is preferred over MoFlow (80%), U2Diff (65%), and the no-joint ablation (53%). It loses to the ground truth only 44% of the time, with 24% ties — i.e. people often can't tell its samples from real plays.

A Useful Aside on min ADE/FDE

MoFlow wins min ADE/FDE on NBA, but loses the human study to JointDiff. The paper uses this as direct evidence that the long-standing minADE/FDE ranking — inherited from pedestrian forecasting — does not capture what humans actually mean by "a realistic football scene". Scene-level metrics correlate with perception; individual ones don't.

Controllability: WPG and Text Both Help

The controllable-generation table compares unconditioned (w/o G), WPG, and text guidance, and ablates the joint formulation. Two consistent patterns emerge:

1. More info → better samples

w/o G < WPG < Text on every dataset, every metric. Even the very loose possessor list shaves 7–13% off SADE; full text descriptions go further.

2. Joint training helps everywhere

The w/o joint ablation is worse than full JointDiff on both unconditional and controllable tasks — including on possession Acc and on trajectory-only metrics. Modelling events as a side channel improves the trajectories.

An attention-entropy analysis backs this up: the Social Transformer's attention in the joint model is consistently more focused (lower entropy) than the no-joint variant, especially in the early denoising steps. Knowing who has the ball lets the model immediately route attention toward the salient interactions.

Multinomial vs. Absorbing-State Discrete Diffusion

Most prior work on joint continuous-discrete diffusion (DLT for layouts, DualDiffusion for vision-language) uses absorbing-state diffusion for the discrete part — once a token is "decided" it's frozen. JointDiff argues this is wrong for temporally evolving domains and uses multinomial diffusion instead.

❌ Absorbing State

Once a possessor token is unmasked it cannot be revised — even if subsequent denoising of the trajectories implies the player is now the wrong choice. Lower consistency between the generated trajectories and the predicted possessor sequence.

✅ Multinomial

Every discrete token can be re-sampled at every step, in light of the latest continuous denoising. Result: 97–99% max and 80–92% avg agreement between predicted possessors and the heuristic possessor extracted from the predicted trajectories.

JointDiff vs. U2Diff vs. Diffoot vs. CausalTraj

Aspect	U2Diff	Diffoot	CausalTraj	JointDiff
Output	Continuous trajectories	Continuous trajectories	Continuous trajectories	Trajectories + events
Discrete Modelling	None (heuristic post-hoc)	None	None	Multinomial diffusion
Controllability	Past observation only	Past + graph	Past + causal structure	WPG + free-form text (CrossGuid)
Eval Suite	SADE/SFDE	ADE/FDE + direction	Coherence metrics	SADE/SFDE + Acc + human study
Datasets	NBA, NFL, soccer	Bundesliga only	Football only	Unified NBA + NFL + Bundesliga (with text)
Best Use Case	Strong general baseline	Defensive scouting	Coherent rollouts	Prompt-driven tactical "what-if"

Limitations & Open Questions

1. Dense Events Only

The discrete channel must share the same temporal grid as the trajectories — perfect for "ball possessor at time t", awkward for genuinely sparse events like shots or fouls. Extending to temporal point processes is flagged as the main next step.

2. Text Failures on Small Data

≈10k NFL pairs and ≈4k Bundesliga pairs are simply not enough for robust text grounding. The appendix shows clean failure cases where the trajectories ignore parts of the prompt — a data problem, not a model problem.

3. Heuristic Possessor Labels

Possession is extracted with a 1.5 m threshold, not annotated. Works well in practice and generalises across sports, but misses subtle cases (deflections, shielded balls, contested touches).

4. NFL Actor Inference

The public NFL Big Data Bowl event stream doesn't name the player who performed each action, so the authors have to back it out from tracking + heuristics. Some text-grounding errors trace directly to noise in this step.

Resources & Further Reading

Read the JointDiff Paper

arXiv 2509.22522 · ICLR 2026

U2Diff (the backbone)

Capellera et al., CVPR 2025