Who Will Win the World Cup? A Quantitative Approach.

2022 FIFA World Cup betting odds and what they tell us about each team’s likelyhood to succeed.
python
altair
visualisation
football
Published

November 19, 2022

This year’s world cup seems a bit like the dish that nobody ordered. Inconvenient ethical questions aside, the world cup is traditionally one of the events around which the highest volume of sports betting is transacted. To put things into perspective, on the seemingly bland question “Will Ecuador reach the Quarter Final?”, more than 5,084 Euros have been wagered on one online betting platform alone. For all of the games of the FIFA World Cup, more than 10 million euros worth of bets have been placed on the same platform. As a data scientist, it’s exciting to observe how all of that punting, in aggregate, produces a quite comprehensive collective valuation of the chances of success for each of the teams.

The big question is of course: “Who will win the World Cup?” Crunching the numbers, one can work out the implicit probabilities in the odds offered for each team. As of 9 am today, the odds were as follows:

Code
import altair as alt
import pandas as pd

STAGE_ORDER = ['win', 'final', 'semi', 'quarter', 'qualify']
STAGE_ORDER_MAP = {v: k for k, v in enumerate(STAGE_ORDER[::-1])}

STAGE_NAME_MAP = {
    "win": "Win the Tournament",
    "final": "Reach the Finals",
    "semi": "Reach the Semi-Finals",
    "quarter": "Reach the Quarter-Finals",
    "qualify": "Reach the Round of 16"
}

last = pd.read_csv('teams.20221120.csv')

chart = last.copy().loc[:, ['country', 'p', 'stage']]
chart['stage_order'] = chart.stage.apply(lambda b: STAGE_ORDER_MAP[b])
chart['stage_nice'] = chart.stage.apply(lambda b: STAGE_NAME_MAP[b])
chart['y'] = 0
chart = chart.sort_values('stage_order')

country_order = list(chart[chart.stage == 'win'].sort_values('p', ascending=False)['country'])

alt.Chart(chart[chart.stage == 'win']).mark_bar(size=12).encode(
    x=alt.X('country:O', axis=alt.Axis(title=""), sort=country_order),
    y=alt.Y('p:Q', axis=alt.Axis(title="Probability", format="%")),
    y2='y',
    color=alt.Color(
        field='stage_nice', type='ordinal', sort=[STAGE_NAME_MAP[b] for b in STAGE_ORDER][::-1],
        legend=None, scale=alt.Scale(scheme='greens')
    ),
    tooltip = [
        alt.Tooltip('country', title="Country"), 
        alt.Tooltip('p', title="Probability to Win", format=".2%")
    ]
).properties(
    width={"step": 16}
)

To work out how these odds are derived, consider the following: As of today, bookmakers are offering to bet on Brazil winning the tournament with a payout of 4.4 times the wager. To bet against Brazil winning the tournament is offered at odds of 4.5. To come up with a probability, we first take the midpoint of those two numbers, which is 4.45. We then invert that number and we get an implied probability of precisely 22.4719% for Brazil to win the World Cup.

As mentioned, there are a lot more events that people can and do bet on. From those, it is possible to determine the odds of any team reaching the round of 16, quarter-finals, semi-finals and finals. All of these odds for each team are shown in the chart below:

Code
alt.Chart(chart).mark_bar(size=12).encode(
    x=alt.X('country:O', axis=alt.Axis(title=""), sort=country_order),
    y=alt.Y('p:Q', axis=alt.Axis(title="Probability", format="%")),
    y2='y',
    color=alt.Color(
        field='stage_nice', type='ordinal', sort=[STAGE_NAME_MAP[b] for b in STAGE_ORDER][::-1],
        legend=alt.Legend(title="Position"), scale=alt.Scale(scheme='greens')
    ),
    tooltip = [
        alt.Tooltip('country', title="Country"),
        alt.Tooltip('stage_nice', title="Position"),
        alt.Tooltip('p', title="Probability", format=".2%")
    ]
).properties(
    width={"step": 16}
)

One thing that stands out here is that the probability ratios between these outcomes are not exactly the same for each team. Take Serbia and Ecuador for example: Serbia has a 1.0% implied chance of winning the tournament, while Ecuador has a 0.4% chance. Despite the better odds to win the tournament, Serbia has a 47.4% chance of qualifying for the round of 16, while Ecuador has a 46.9% chance. This is because Serbia is in a particularly strong group, while Ecuador is in a relatively weak group. Below are the cumulative probabilities of winning the tournament per each group:

Code
groups = last[last.stage == 'qualify'].rename(columns = {'bet': 'group'})[['country', 'group']]
groups = last.merge(groups, left_on='country', right_on='country')

win = groups[groups.bet == 'win']
by_group_chart = win.groupby('group').sum().reset_index()

y = by_group_chart.sort_values('p', ascending=False).reset_index(drop='true').reset_index().loc[:, ['group', 'index']]
y.columns = ['group', 'y']

x = win.groupby('group').apply(lambda x: x.reset_index().reset_index()[['country', 'level_0']]).reset_index(drop=True)
x.columns = ['country', 'x']

square = win.merge(x, left_on='country', right_on='country')
square = square.merge(y, left_on='group', right_on='group')

labels = alt.Chart(square).mark_text(baseline="middle").encode(
    x=alt.X('x:O', axis=alt.Axis(title='', ticks=False, labels=False, grid=False, domain=False)),
    y=alt.Y('y:O', axis=alt.Axis(title='', ticks=False, labels=False, grid=False, domain=False)),
    text='country',
).properties(
    width=250,
    height=200
)

bars = by_group_chart.copy()
bars.group = bars.group.apply(lambda s: "Group " + s.upper())

bars = alt.Chart(bars).mark_bar().encode(
    x=alt.X('p:Q', axis=alt.Axis(title="Combined Probability to Win", format="%")),
    y=alt.Y('group:N', axis=alt.Axis(title="",), sort='-x'),
    color=alt.Color(field='foo', legend=None, scale=alt.Scale(scheme='greens')),
    tooltip = [
        alt.Tooltip('group', title="Group"), 
        alt.Tooltip('p', title="Combined Probability to Win", format=".2%")
    ]
).properties(
    width=400,
    height=200
)

(labels | bars).configure_axisY(
    labelFontSize=11
).configure_view(
    stroke = "white"
)

One final thing one might wonder about is who’s going to emerge as the tournament’s top goal scorer. On this one, Tottenham striker Harry Kane seems to be the consensus favourite after already winning the Golden Boot at the 2018 FIFA World Cup, followed by Kylian Mbappe and Lionel Messi:

Code
goalscorer = pd.read_csv('goalscorer.20221120.csv').sort_values('p', ascending=False)

alt.Chart(goalscorer.iloc[:40]).mark_bar(size=12).encode(
    x=alt.X('country:O', axis=alt.Axis(title="Player"), sort='-y'),
    y=alt.Y('p:Q', axis=alt.Axis(title="Probability", format="%")),
    # y2='y',
    color=alt.Color(
        field='stage_nice', type='ordinal', sort=[STAGE_NAME_MAP[b] for b in STAGE_ORDER][::-1],
        legend=None, scale=alt.Scale(scheme='greens')
    ),
    tooltip = [
        alt.Tooltip('country', title="Player"), 
        alt.Tooltip('p', title="Probability", format=".2%")
    ]
).properties(
    width={"step": 16}
)

The World Cup kicks off tomorrow with the classic crowd-pleaser Qatar v Ecuador. We will see whether the host nation will be able to turn the 9.8% odds of winning in their favour. Maybe instead the Ecuadorians can capitalise on their 68.3% implied probability of winning the game or at least achieve the draw that has a 21.9% probability. Time will tell! Until then you can follow this blog for more quantitative analysis of the World Cup and updates about how these probabilities shift as the events unfold.