You upload a clip of a football match. You get back a JSON with every player and ball tracked across every frame, with their positions mapped onto a standard 105×68m pitch. That's it.
Sounds simple, but it involves four distinct stages — each solving a different problem. The pipeline runs on Modal.com (serverless GPU infrastructure) so you don't need a local GPU.
Find every player and ball in each frame
Link detections across frames into consistent IDs
Figure out where the camera is pointing
Convert pixel positions to real-world pitch metres
Every frame is fed through YOLOv8x, Ultralytics' largest COCO-pretrained object detector. It has 68.2 million parameters, is trained on 330K images, and is used by millions of people. We filter its 80 COCO classes down to two:
Every human on the pitch — players, goalkeepers, referees. COCO doesn't distinguish between them, so everyone gets the label "player."
The ball. Notoriously hard to detect — it's often just 10–20 pixels across, moves fast, and gets occluded by players constantly.
We tried several soccer-specific models from HuggingFace (uisikdag/football_players_rf, etc.). They all had fewer than 50 downloads, were trained on tiny datasets, and performed worse than COCO on real broadcast footage. YOLOv8x with 330K training images simply detects people and balls more reliably than a model trained on 500 hand-labeled screenshots.
The output for each frame is a set of bounding boxes: [x1, y1, x2, y2, confidence, class_id] for every detected object.
Detection tells you "there are 14 people in this frame." It doesn't tell you which person is which between frames. ByteTrack solves this — it's a multi-object tracker that assigns consistent IDs across the entire video.
ByteTrack works by matching detections between consecutive frames based on motion prediction and appearance similarity. The clever part: it uses both high-confidence and low-confidence detections. Most trackers throw away low-confidence detections, but ByteTrack keeps them as backup candidates — if a player gets partially occluded and the detector is only 30% sure, ByteTrack can still link them to their track.
We use it via the supervision library from Roboflow, which wraps ByteTrack in a clean Python API. After this stage, every detection has a track_id — player #7 stays player #7 across frames.
This is the hardest stage and the one most likely to fail. The goal: figure out how to map pixel coordinates in the video frame to real-world metres on the pitch.
The mathematical tool for this is a homography — a 3×3 matrix that transforms points in one 2D plane (the video frame) to another 2D plane (the pitch, viewed from above). To compute a homography, you need at least 4 point correspondences: "this pixel in the image corresponds to this metre-position on the pitch."
We use a YOLO pose model (Adit-jain/Soccana_Keypoint from HuggingFace) that detects 29 pitch landmarks: corner flags, penalty box corners, centre circle intersections, etc. Each detected keypoint has a known real-world position on a FIFA-standard 105×68m pitch. If the model detects ≥4 keypoints with confidence above 0.3, we compute a homography using cv2.findHomography with RANSAC.
The camera moves. Now what?
Broadcast cameras pan to follow play. A homography computed from frame 0 becomes wrong by frame 50 because the camera has moved. We handle this with a three-tier fallback system:
Run the keypoint model, detect pitch landmarks, compute a fresh homography. This is the gold standard — direct point correspondences give the most accurate mapping.
When the keypoint model fails (not enough landmarks visible), we use ORB feature matching between the current frame and the last successfully calibrated frame. This tracks how the camera has panned and rotates/translates the old homography accordingly. Purely computer vision — no ML model required.
If both above fail (rare), reuse the last known homography. This gets increasingly wrong as the camera moves, so it's a last resort.
Once we have a homography H for the current frame, projecting a player's pixel position to pitch coordinates is one matrix multiply:
pitch_coords = H @ [pixel_x, pixel_y, 1]We use the bottom-centre of each bounding box (the player's feet) as the pixel position — this gives the most accurate ground-plane projection.
Two post-processing steps clean up the output:
- Clamping — positions that project outside the pitch boundaries (± a few metres tolerance) are discarded as likely calibration errors.
- EMA smoothing — an exponential moving average per track smooths out frame-to-frame jitter. The ball gets less smoothing (α=0.8) than players (α=0.5) because the ball moves faster and we don't want to lag behind it.
This is a hobby project, not a commercial tracking system. Here's where it falls short:
COCO's "person" class doesn't tell you which team someone is on. Everyone is labelled "player." Professional systems use jersey colour segmentation or re-identification networks to separate teams.
Same issue — COCO doesn't know what a referee looks like. A soccer-specific model would add these labels but we couldn't find one that detected reliably enough.
The Soccana keypoint model has ~0 downloads on HuggingFace. It works on some camera angles but fails on others. The ORB propagation helps fill gaps, but it accumulates drift over time. This is the weakest link in the pipeline.
Even YOLOv8x struggles with the ball — it's tiny, fast, and frequently occluded. Professional systems use multi-frame temporal context and dedicated ball detectors.
Broadcast footage has replay cuts, close-ups, and multi-angle switches. Each cut requires a fresh calibration and resets all track IDs. We don't detect cuts automatically.
Professional tracking data companies produce centimetre-accurate position data at 25 fps for every player, referee, and the ball. They do it very differently from us. Here's who they are and what they do:
Hawk-Eye doesn't use broadcast footage at all. They install dedicated multi-camera rigs in stadiums — typically 8–12 synchronised high-resolution cameras pointed at the pitch from fixed positions. Each camera is precisely calibrated during installation using surveyed ground-truth points.
With multiple calibrated cameras, they use triangulation — the same player is visible from 3+ cameras, and the intersection of view rays gives the 3D position within centimetres. No homography estimation needed per-frame because the cameras don't move.
Similar multi-camera approach to Hawk-Eye but with more emphasis on machine learning post-processing. Their tracking feeds into proprietary models that compute real-time expected possession value (EPV), pass probability fields, and tactical metrics.
They were acquired by Genius Sports for $200M, giving you a sense of the commercial value of accurate tracking data.
SkillCorner is the closest to what we're doing — they extract tracking data from broadcast video only, no dedicated cameras. They cover 90+ competitions worldwide because they only need the TV feed.
The difference: they have a team of ML engineers, proprietary models trained on millions of labelled frames, temporal models that handle camera cuts, and jersey number recognition. Their calibration uses learned camera parameter regression rather than simple keypoint matching.
SkillCorner tracking data is what powers the STGNN predictions on this site (see the SkillCorner article).
Stats Perform traditionally provided event data (passes, shots, tackles logged by human operators), not tracking data. They've since added optical tracking via their acquisition of ChyronHego.
Their setup is similar to Hawk-Eye — fixed stadium cameras with pre-calibrated intrinsics/extrinsics. They also produce "derived" tracking from broadcast footage for leagues where they can't install cameras.
Not a company but the academic standard for soccer video analysis. SoccerNet provides large-scale annotated datasets and runs annual challenges for action spotting, player tracking, camera calibration, and game state reconstruction.
Their calibration challenge has produced the best open-source approaches for broadcast camera estimation. Unfortunately, the winning models aren't packaged as easy-to-use libraries — they're research code with complex dependencies. Integrating SoccerNet calibration would be the single biggest improvement to our pipeline.
| Component | Our Pipeline | Professional Systems |
|---|---|---|
| Cameras | Single broadcast feed | 8–12 dedicated stadium cameras (Hawk-Eye) or broadcast + proprietary ML (SkillCorner) |
| Calibration | Amateur keypoint model + ORB propagation | Pre-surveyed camera positions (Hawk-Eye) or learned camera parameter regression (SkillCorner) |
| Detection | COCO YOLOv8x (general-purpose) | Custom models trained on millions of labelled sports frames |
| Team ID | None | Jersey colour segmentation, number recognition, re-ID networks |
| Ball tracking | Frame-by-frame YOLO detection | Temporal models, trajectory prediction, multi-camera triangulation |
| Accuracy | ~5–15m position error (estimated) | <10cm (Hawk-Eye), ~1m (SkillCorner broadcast) |
| Camera cuts | Breaks everything | Detected automatically, tracks bridged across cuts |
| Cost | ~$0.02 per video (Modal GPU seconds) | $50K–$500K per season per competition |
- • Modal.com — serverless GPU (T4)
- • FastAPI — endpoint wrapped by Modal
- • Next.js API route — frontend proxy
- • YOLOv8x — detection (Ultralytics, COCO)
- • ByteTrack — tracking (via supervision)
- • Soccana_Keypoint — pitch keypoints (HuggingFace)
- • OpenCV ORB — homography propagation
The biggest improvement would be replacing the Soccana keypoint model with a proper camera calibration system — either SoccerNet's calibration approach (which regresses camera parameters directly) or training a custom keypoint detector on the SoccerNet keypoint dataset. The detection and tracking stages are already solid; calibration is where the quality ceiling is.