There's a World Cup every four years, and every four years I build a model to predict the winner with whatever tool I happen to be using at the time. Here's this years.
Eight years ago it was Python. Machine learning was riding high, I was writing a book about it, and a small regression model said Brazil was going to win. The model was simple enough to explain: learn team strengths from past results, turn strength differences into match outcomes, simulate the tournament a large number of times.
Four years ago I was running Neptyne, a programmable spreadsheet startup, so the World Cup model became a spreadsheet. Same general idea, but now the Monte Carlo simulation lived inside a sheet. This was partly because spreadsheets are a good place to poke at assumptions, and partly because when you run a spreadsheet company every problem looks suspiciously tabular.
These days all tech is AI, so you might expect this year's Football Predictions model to be powered by an LLM. It is, but the LLM is not doing the prediction. The prediction runs on classic machinery: ELO ratings, expected goals, a Poisson-ish score model, host advantage and a tournament simulator. You could hand code this, and if I had, the code would have looked better and the design worse.
The real superpower LLMs bring to this sort of project is data work. Building a tournament definition from what FIFA publishes, or from the relevant Wikipedia article, used to be annoying enough that I would only do it for the current tournament. Groups, brackets, hosts, match schedules, tiebreakers, old country codes, weird tournament formats, none of it is hard exactly. It is just the kind of work that makes you suddenly remember an important email.
Goose does it in a few minutes if you prompt it correctly. More importantly, it keeps going. Once it can turn one tournament into YAML, it can turn every World Cup and Euro back to 1930 into YAML. Then it can write the replay test that checks whether loading the actual scores produces the actual champion.
Once the old tournaments are in the same format as the new one, backtesting becomes just another button. Run the model before each tournament, compare it with what actually happened, and see which assumptions survive.
One result is that these models tend to be too sure about the winner, while doing pretty well at predicting the last four. 1992 where Denmark beats Germany in the finals being the canonical example. Making the ELO difference matter less in the final games of the tournament increases the accuracy of the model.
This is where Gary Lineker's old line starts to wobble: "Football is a simple game; 22 men chase a ball for 90 minutes and at the end, the Germans win." Funny, and for a while emotionally true. But between 1972 and 1992, the model expects Germany and West Germany to win even more trophies than they did.
The new version has knobs for ELO strength, host advantage, recent form and later-round randomness. You can edit team ratings, pin scores, load historical results and backtest your settings against the past. You can argue with the model and then make the old tournaments check your work.
And if you find a setting that convinces you, Kalshi is ready to take your money.
A person forgives only when they are in the wrong.