Building an AI That Watches Rugby
Thereβs a gap in rugby data.
Weβve got the big moments covered β the tries, conversions, and cards. Structured event feeds do a good job of telling you what happened.
But theyβre not so good at telling you why.
At Gainline, we build our entire app around context. We want to give rugby fans a second-screen experience that feels alive β a commentary that goes deeper than the scoreline. We already pull in weather data, team stats, and player profiles. We enrich it with AI-generated summaries.
But weβre limited by the data we get.
We donβt know why the ref blew the whistle. We canβt tell if a prop is quietly dominating the scrum. We miss what the ref said to the captain. And thatβs a problem β because these moments matter when youβre trying to tell the full story of a match.
So we asked ourselves a simple question:
What if we could watch the game and generate the data ourselves?
That led me down a really fun rabbit hole.
In this post, Iβll show you how I built a prototype system that watches a rugby game using AI. Weβll look at how we extracted the score and game clock from the broadcasterβs UI, how we used Whisper to transcribe referee and commentary audio, and what we learned about running these kinds of experiments cheaply and effectively.
Itβs scrappy β but it works.
Context is Everything
Itβs a clean, well-designed experience that gives fans what they need. We pull together data from a range of providers β live scores, player stats, team histories β and try to tell a richer story about whatβs happening on the pitch.
Most of it works well. If you want to know who scored the last try, who the fly-half is, or whoβs made the most carries, weβve got you covered.
But rugby is messy.
A lot happens between structured events. Penalties go unexplained. Players work relentlessly in ways that never show up in the stats. Props spend 80 minutes blowing their lungs out β maybe earning a mention if they score.
And we canβt see any of it.
Thatβs frustrating β because it limits our AI-generated summaries. If all we know is that a penalty occurred, we canβt say why. We canβt spot a breakdown nightmare or a dominant scrum.
The best rugby minds donβt just watch the ball β they read the whole game.
Thatβs what we want Gainline to do.
The Idea
What if we could watch the game ourselves?
Not literally. We canβt hire analysts to watch every match and enter data manually.
But AI? That just might work π
The plan was simple.
Take a video of a rugby game. Slice it into screenshots β one every five seconds. Feed those frames into OpenAIβs vision model and ask it whatβs going on.
Can We Read the Score?
We started with a lazy approach: What can we detect easily? Letβs begin with the basics.
Whatβs the score? What does the game clock say?
But I was also curious β what can the model really see?
Hereβs the prompt I used (built through a separate refinement process β another post for another day!):
You are an AI that analyzes screenshots of rugby matches. Your task is to visually interpret the image and extract structured game information β including the score, time, team names, and match phase (e.g., first half, second half, full time). Return the data in a clear, structured format suitable for programmatic use (e.g., JSON). Focus on identifying all elements that summarize the current state of the match.
The result:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
"home_team": "Bath Rugby",
"away_team": "Harlequins",
"home_team_abbreviation": "BTH",
"away_team_abbreviation": "HAR",
"score": {
"BTH": 0,
"HAR": 0
},
"time_elapsed": "00:36",
"match_phase": "first_half",
"competition": "Gallagher Premiership",
"current_play": "ruck",
"bath_team_kit": "dark blue with light blue accents",
"harlequins_team_kit": "white with green shorts and multicolor accents"
}
It worked. Remarkably well.
But vision models price their API calls based on context size β the number of tokens an image turns into. Sending a full-resolution screenshot every five seconds gets expensive fast.
So the next challenge became: how do we do this cheaper?
Reducing Context Size
Letβs zoom in on the essentials. What if we only want the score and elapsed time?
If we crop the image down to just the scoreboard, we can dramatically reduce the size β and cost.
I first asked the model to return the pixel coordinates of the scoreboard.
It didnβt work.
I couldnβt get a reliable bounding box.
Iβm not exactly sure why. I tried several approaches. I thought maybe the image was being resized internally, so I switched to asking for percentages instead of pixel values β but the results were still off.
Then I realised: I didnβt need a bounding box.
The scoreboard always appears in one of the corners. Cropping to that corner gave me a 75% reduction in image size.
I updated the prompt. It worked perfectly. Cheap, reliable, and didnβt require complex image processing.
Isnβt It Just a Diff?
Do we really need a language model to find the scoreboard?
Broadcasts usually place the scoreboard in a consistent location β often the top-left or top-right. Could we just diff two frames β one with the scoreboard, one without β to detect the UI?
In theory, yes.
The static background would cancel out, leaving only the overlay.
Hereβs the command:
1
2
magick compare -highlight-color Red -lowlight-color Black \
-compose src frame_000015.png frame_000016.png diff.png
Itβs rough, but you can see it working. We clearly identify a corner. We can crop it or add padding and target only the pixels that change.
This whole project is about finding the simplest, most reliable way to generate rugby data.
And if that means using less AI β even better.
Do We Need an LLM at All?
We started with large language models because they were the easiest tool to prototype with. I could send an image to OpenAIβs vision model, describe what I wanted, and get useful results.
But I started wondering β do we even need an LLM here?
Weβre just trying to extract text from a predictable area β the scoreboard.
So I tried tesseract
, an open-source OCR tool, to get the score and clock.
It kind of worked. But not well enough.
The problem was quality. Blurry frames, low resolution, and complex overlays made OCR tricky. When it worked, it worked well. But when it failed, it didnβt extract anything useful.
Maybe it would do better with higher-quality streams or some pre-processing β but in my test setup, it wasnβt reliable.
So for now β the LLM stays.
Bonus: Listening to the Game
Once I had the score and clock, I turned to the audio.
Rugby broadcasts are full of context:
- The referee mic explains decisions.
- The commentary adds subjective analysis.
- The crowd adds atmosphere.
I used OpenAI Whisper to transcribe the audio. It worked brilliantly β giving me timestamped commentary I could use to enrich the structured data.
Now I could highlight a propβs incredible shift, or capture events that donβt show up in a stat feed β like missed penalties, scuffles, or Freddie Burns celebrating too early.
I canβt wait to integrate this properly.
Instead of just showing the facts β we can start telling the story.
Whatβs Next?
This is a prototype. Itβs not production ready. But it shows whatβs possible.
Scaling this will be an infrastructure challenge:
- Should we spin up VMs to watch live streams?
- Do we run distributed workers that pull frames and audio?
- How do we handle different broadcasters, formats, and languages?
Then there are the legal and ethical questions.
Weβre not trying to replace broadcasters or journalists. But if AI can watch a game and summarise it in real-time β is that just automated journalism?
Itβs a question weβll have to answer.
This has been one of the most fun experiments Iβve worked on in a while.
AI is moving beyond structured data and customer support chatbots. These models are growing exponentially more capable. As a developer, itβs my job to stay close to that evolution β to know whatβs possible, whatβs not, and where the limits lie.
For rugby β and for sport more broadly β I think the opportunity is huge.
We can do more with less. Unlock better insights. Tell richer stories. And have way more fun.