⚽💷📊
Video Analysis Pipeline
How we turn raw broadcast footage into pitch coordinates — detection, tracking, calibration, and projection explained step by step.
Computer VisionObject Detection15 min read
What Does This Pipeline Do?

You upload a clip of a football match. You get back a JSON with every player and ball tracked across every frame, with their positions mapped onto a standard 105×68m pitch. That's it.

Sounds simple, but it involves four distinct stages — each solving a different problem. The pipeline runs on Modal.com (serverless GPU infrastructure) so you don't need a local GPU.

1
Detection

Find every player and ball in each frame

2
Tracking

Link detections across frames into consistent IDs

3
Calibration

Figure out where the camera is pointing

4
Projection

Convert pixel positions to real-world pitch metres

Stage 1 — Detection (YOLOv8x)

Every frame is fed through YOLOv8x, Ultralytics' largest COCO-pretrained object detector. It has 68.2 million parameters, is trained on 330K images, and is used by millions of people. We filter its 80 COCO classes down to two:

COCO class 0: "person" → player

Every human on the pitch — players, goalkeepers, referees. COCO doesn't distinguish between them, so everyone gets the label "player."

COCO class 32: "sports ball" → ball

The ball. Notoriously hard to detect — it's often just 10–20 pixels across, moves fast, and gets occluded by players constantly.

Why COCO and not a soccer-specific model?

We tried several soccer-specific models from HuggingFace (uisikdag/football_players_rf, etc.). They all had fewer than 50 downloads, were trained on tiny datasets, and performed worse than COCO on real broadcast footage. YOLOv8x with 330K training images simply detects people and balls more reliably than a model trained on 500 hand-labeled screenshots.

The output for each frame is a set of bounding boxes: [x1, y1, x2, y2, confidence, class_id] for every detected object.

Stage 2 — Tracking (ByteTrack)

Detection tells you "there are 14 people in this frame." It doesn't tell you which person is which between frames. ByteTrack solves this — it's a multi-object tracker that assigns consistent IDs across the entire video.

ByteTrack works by matching detections between consecutive frames based on motion prediction and appearance similarity. The clever part: it uses both high-confidence and low-confidence detections. Most trackers throw away low-confidence detections, but ByteTrack keeps them as backup candidates — if a player gets partially occluded and the detector is only 30% sure, ByteTrack can still link them to their track.

We use it via the supervision library from Roboflow, which wraps ByteTrack in a clean Python API. After this stage, every detection has a track_id — player #7 stays player #7 across frames.

Stage 3 — Calibration (Keypoints + ORB Propagation)

This is the hardest stage and the one most likely to fail. The goal: figure out how to map pixel coordinates in the video frame to real-world metres on the pitch.

The mathematical tool for this is a homography — a 3×3 matrix that transforms points in one 2D plane (the video frame) to another 2D plane (the pitch, viewed from above). To compute a homography, you need at least 4 point correspondences: "this pixel in the image corresponds to this metre-position on the pitch."

Where do the point correspondences come from?

We use a YOLO pose model (Adit-jain/Soccana_Keypoint from HuggingFace) that detects 29 pitch landmarks: corner flags, penalty box corners, centre circle intersections, etc. Each detected keypoint has a known real-world position on a FIFA-standard 105×68m pitch. If the model detects ≥4 keypoints with confidence above 0.3, we compute a homography using cv2.findHomography with RANSAC.

The camera moves. Now what?

Broadcast cameras pan to follow play. A homography computed from frame 0 becomes wrong by frame 50 because the camera has moved. We handle this with a three-tier fallback system:

Tier 1: Absolute Calibration (every 3 frames)

Run the keypoint model, detect pitch landmarks, compute a fresh homography. This is the gold standard — direct point correspondences give the most accurate mapping.

Tier 2: ORB Feature Propagation

When the keypoint model fails (not enough landmarks visible), we use ORB feature matching between the current frame and the last successfully calibrated frame. This tracks how the camera has panned and rotates/translates the old homography accordingly. Purely computer vision — no ML model required.

Tier 3: Reuse Last Homography

If both above fail (rare), reuse the last known homography. This gets increasingly wrong as the camera moves, so it's a last resort.

Stage 4 — Projection

Once we have a homography H for the current frame, projecting a player's pixel position to pitch coordinates is one matrix multiply:

pitch_coords = H @ [pixel_x, pixel_y, 1]

We use the bottom-centre of each bounding box (the player's feet) as the pixel position — this gives the most accurate ground-plane projection.

Two post-processing steps clean up the output:

  • Clamping — positions that project outside the pitch boundaries (± a few metres tolerance) are discarded as likely calibration errors.
  • EMA smoothing — an exponential moving average per track smooths out frame-to-frame jitter. The ball gets less smoothing (α=0.8) than players (α=0.5) because the ball moves faster and we don't want to lag behind it.
Known Limitations

This is a hobby project, not a commercial tracking system. Here's where it falls short:

No team identification

COCO's "person" class doesn't tell you which team someone is on. Everyone is labelled "player." Professional systems use jersey colour segmentation or re-identification networks to separate teams.

No goalkeeper/referee distinction

Same issue — COCO doesn't know what a referee looks like. A soccer-specific model would add these labels but we couldn't find one that detected reliably enough.

Calibration model is amateur

The Soccana keypoint model has ~0 downloads on HuggingFace. It works on some camera angles but fails on others. The ORB propagation helps fill gaps, but it accumulates drift over time. This is the weakest link in the pipeline.

Ball detection is inconsistent

Even YOLOv8x struggles with the ball — it's tiny, fast, and frequently occluded. Professional systems use multi-frame temporal context and dedicated ball detectors.

Camera cuts break everything

Broadcast footage has replay cuts, close-ups, and multi-angle switches. Each cut requires a fresh calibration and resets all track IDs. We don't detect cuts automatically.

How the Industry Does It Better

Professional tracking data companies produce centimetre-accurate position data at 25 fps for every player, referee, and the ball. They do it very differently from us. Here's who they are and what they do:

Hawk-Eye (Sony) — Premier League, La Liga, MLS

Hawk-Eye doesn't use broadcast footage at all. They install dedicated multi-camera rigs in stadiums — typically 8–12 synchronised high-resolution cameras pointed at the pitch from fixed positions. Each camera is precisely calibrated during installation using surveyed ground-truth points.

With multiple calibrated cameras, they use triangulation — the same player is visible from 3+ cameras, and the intersection of view rays gives the 3D position within centimetres. No homography estimation needed per-frame because the cameras don't move.

Second Spectrum (Genius Sports) — NBA, MLS, Serie A

Similar multi-camera approach to Hawk-Eye but with more emphasis on machine learning post-processing. Their tracking feeds into proprietary models that compute real-time expected possession value (EPV), pass probability fields, and tactical metrics.

They were acquired by Genius Sports for $200M, giving you a sense of the commercial value of accurate tracking data.

SkillCorner — Broadcast Tracking at Scale

SkillCorner is the closest to what we're doing — they extract tracking data from broadcast video only, no dedicated cameras. They cover 90+ competitions worldwide because they only need the TV feed.

The difference: they have a team of ML engineers, proprietary models trained on millions of labelled frames, temporal models that handle camera cuts, and jersey number recognition. Their calibration uses learned camera parameter regression rather than simple keypoint matching.

SkillCorner tracking data is what powers the STGNN predictions on this site (see the SkillCorner article).

Stats Perform (Opta) — Event Data + Tracking

Stats Perform traditionally provided event data (passes, shots, tackles logged by human operators), not tracking data. They've since added optical tracking via their acquisition of ChyronHego.

Their setup is similar to Hawk-Eye — fixed stadium cameras with pre-calibrated intrinsics/extrinsics. They also produce "derived" tracking from broadcast footage for leagues where they can't install cameras.

SoccerNet — Academic Benchmark

Not a company but the academic standard for soccer video analysis. SoccerNet provides large-scale annotated datasets and runs annual challenges for action spotting, player tracking, camera calibration, and game state reconstruction.

Their calibration challenge has produced the best open-source approaches for broadcast camera estimation. Unfortunately, the winning models aren't packaged as easy-to-use libraries — they're research code with complex dependencies. Integrating SoccerNet calibration would be the single biggest improvement to our pipeline.

Our Pipeline vs. The Pros — Side by Side
ComponentOur PipelineProfessional Systems
CamerasSingle broadcast feed8–12 dedicated stadium cameras (Hawk-Eye) or broadcast + proprietary ML (SkillCorner)
CalibrationAmateur keypoint model + ORB propagationPre-surveyed camera positions (Hawk-Eye) or learned camera parameter regression (SkillCorner)
DetectionCOCO YOLOv8x (general-purpose)Custom models trained on millions of labelled sports frames
Team IDNoneJersey colour segmentation, number recognition, re-ID networks
Ball trackingFrame-by-frame YOLO detectionTemporal models, trajectory prediction, multi-camera triangulation
Accuracy~5–15m position error (estimated)<10cm (Hawk-Eye), ~1m (SkillCorner broadcast)
Camera cutsBreaks everythingDetected automatically, tracks bridged across cuts
Cost~$0.02 per video (Modal GPU seconds)$50K–$500K per season per competition
Technical Stack
Infrastructure
  • • Modal.com — serverless GPU (T4)
  • • FastAPI — endpoint wrapped by Modal
  • • Next.js API route — frontend proxy
Models & Libraries
  • • YOLOv8x — detection (Ultralytics, COCO)
  • • ByteTrack — tracking (via supervision)
  • • Soccana_Keypoint — pitch keypoints (HuggingFace)
  • • OpenCV ORB — homography propagation
What would I change if starting over?

The biggest improvement would be replacing the Soccana keypoint model with a proper camera calibration system — either SoccerNet's calibration approach (which regresses camera parameters directly) or training a custom keypoint detector on the SoccerNet keypoint dataset. The detection and tracking stages are already solid; calibration is where the quality ceiling is.