This document provides a summary of a notebook that aims to predict the results of the 2020 European Soccer Championships using historical soccer match results data. The notebook loads and explores the data, defines functions to predict individual game results based on a Poisson goal scoring model and past match outcomes, and outlines how these predictions will be used to simulate the tournament to determine an overall winner. The 2020 event was postponed due to the COVID-19 pandemic.
Statistics notes ,it includes mean to index numbers
Predictions European Championships 2020
1. EC2020 Soccer predictions
March 29, 2020
1 EC2020 Soccer predictions
The goal of this notebook is to make a prediction for the European Championships of 2020. The
data that will be used in order to make these predictions are the results of international soccer
games from 1872 to 2020. The main goal of this analysis is to get familiar with the python
language hence a simplistic algorithm will be used for the predictions. The algorithm that
will be used to predict the outcome of the European Championships is the following: - Look at the
(weighted) average amount of goals both teams scored in previous encounters - Model the amount
of goals these teams scored following a poisson model - Determine the winner of the game based
on the amount of goals both teams scored - Give the teams points based on the result of the game
in the groupstage or determine which team will move on to the next stage - Define a function that
simulates the entire tournament game by game - Perform a monte carlo simulation
As I go through the code, it will become clear what kind of assumptions have been made.
Note: The 2020 European championship soccer has been postponed due to the spread of the covid-
19 virus.
1.1 Setting up the environment
In order to manage this project in the most efficient way, the following set-up will be used: - A
github repository is used to update the code - A python script is used to make sure that the code
is self contained - This Jupyter notebook is created to document what has been done so far
Next, the necessary packages will be loaded in the environment:
[1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from random import random
from random import shuffle
import random
Now, the data is loaded into the environment:
[2]: data = pd.read_csv("data/results.csv")
1
2. 1.2 Exploring the dataset
A first step in the exploration of the dataset is to look at the characteristics of the data. This can be
easily done using the describe function from the pandas package:
[3]: print(data.describe())
home_score away_score
count 41586.000000 41586.000000
mean 1.745756 1.187587
std 1.753780 1.405323
min 0.000000 0.000000
25% 1.000000 0.000000
50% 1.000000 1.000000
75% 2.000000 2.000000
max 31.000000 21.000000
The description of the data shows a couple of interesting things: - The mean amount of goals that
is scored by the home team is 1.745756 while the mean amount of goals scored by the away team
equals 1.187587. - The maximum amount of goals scored is 31. The maximum amount of goals
scored by the away team equals 21. These datapoints should be further explored.
To explore the games where there was a large amount of goals, we sort the data based on the
amount of home and away goals (descending) and then look at the first 5 observations in the
dataframe:
[4]: dataSortedHomeGoals = data.sort_values("home_score", ascending = False)
dataSortedHomeGoals = dataSortedHomeGoals[['date', 'home_team','away_team'␣
→,'home_score', 'away_score']]
temp = dataSortedHomeGoals.head(5)
del dataSortedHomeGoals
temp
[4]: date home_team away_team home_score away_score
23796 2001-04-11 Australia American Samoa 31 0
7902 1971-09-13 Tahiti Cook Islands 30 0
10949 1979-08-30 Fiji Kiribati 24 0
23793 2001-04-09 Australia Tonga 22 0
28838 2006-11-24 Sápmi Monaco 21 1
[5]: dataSortedAwayGoals = data.sort_values("away_score", ascending = False)
dataSortedAwayGoals = dataSortedAwayGoals[['date','home_team','away_team'␣
→,'home_score', 'away_score']]
temp = dataSortedAwayGoals.head(5)
del dataSortedAwayGoals
temp
2
3. [5]: date home_team away_team home_score away_score
27373 2005-03-11 Guam North Korea 0 21
25694 2003-06-30 Sark Isle of Wight 0 20
36020 2014-06-01 Darfur Padania 0 20
14720 1987-12-15 American Samoa Papua New Guinea 0 20
36025 2014-06-02 Darfur South Ossetia 0 19
It can be seen that the extreme values for the amount of goals scored seem to be valid observations.
By example, the game between Guam and North Korea played on the 11th of March in 2005
finished indeed with a goal difference of 21 goals:
img/GuamNK.png
The number of goals that were scored by home and by away teams can be compared by plotting
two histograms.
[6]: plt.subplot(1, 2, 1)
plt.hist(data.loc[:,'home_score'].values)
plt.subplot(1, 2, 2)
plt.hist(data.loc[:,'away_score'].values)
plt.show;
Both plots look very similar indicating that both away and home teams are expected to score only
3
4. a couple of goals. Large amounts of goals are very rare. This matches our expectations based on
every day life. It hardly happens that a team scores more than 5 goals within one game.
1.3 Predicting the European Championship 2020
The outcome of the European Championships will be predicting by assigning probabilities to ev-
ery team winning the tournament and picking the one with the highest probability.
1.3.1 Predicting individual game results
In order to predict the outcome of the European Championship soccer, it is necessary to predict
the invidual results of the different games. In this part, we will be predicting the result of a game
between two countries by looking at past results of games between these countries. It is thus
necessary to look at how much data we have for the different games between countries.
First we have to filter the data to only include countries that are participating to the world cup:
[7]: # I will assume that Hungary, Slovakia, Scotland and Georgia passed the play-offs
countryList = (["England", "Italy", "Switzerland", "Turkey", "Wales", "Belgium",␣
→"Denmark", "Finland", "Russia", "Austria", "Netherlands", "Ukraine",␣
→"Croatia", "Czech Republic", "England", "Poland", "Spain", "Sweden", "France",␣
→"Germany", "Portugal", "Hungary", "Slovakia", "Scotland", "Georgia"])
ECdata = data.loc[data['home_team'].isin(countryList) & data['away_team'].
→isin(countryList)]
del data
Next we look at the combinations between the different countries and try to identify how many
games they have played against each other:
[8]: print(ECdata.groupby(['home_team', 'away_team']).size())
home_team away_team
Austria Belgium 6
Croatia 4
Czech Republic 3
Denmark 5
England 12
..
Wales Spain 3
Sweden 3
Switzerland 3
Turkey 3
Ukraine 1
Length: 536, dtype: int64
Exploring this data shows that this analysis is incorrect. At the moment, a game between e.g. Aus-
tria and Belgium is not the same as a game between Belgium and Austria. For the purpose of our
analysis, this is incorrect. We thus need to resolve this issue. A way we can do this is by going
4
5. through the observations and putting the country that appears first in the dictionary (alphabeti-
cally) as the home team.
[9]: ECdata.loc[:,'team1'] = np.where(ECdata.loc[:,'home_team'] < ECdata.loc[:
→,'away_team'], ECdata.loc[:,'home_team'], ECdata.loc[:,'away_team'])
ECdata.loc[:,'team1_score'] = np.where(ECdata.loc[:,'home_team'] < ECdata.loc[:
→,'away_team'], ECdata.loc[:,'home_score'], ECdata.loc[:,'away_score'])
ECdata.loc[:,'team2'] = np.where(ECdata.loc[:,'home_team'] > ECdata.loc[:
→,'away_team'], ECdata.loc[:,'home_team'], ECdata.loc[:,'away_team'])
ECdata.loc[:,'team2_score'] = np.where(ECdata.loc[:,'home_team'] > ECdata.loc[:
→,'away_team'], ECdata.loc[:,'home_score'], ECdata.loc[:,'away_score'])
ECdata = ECdata.loc[:, ['date', 'team1', 'team1_score', 'team2', 'team2_score',␣
→'neutral']]
print(ECdata.groupby(['team1', 'team2']).size())
team1 team2
Austria Belgium 14
Croatia 5
Czech Republic 5
Denmark 9
England 18
..
Switzerland Ukraine 2
Wales 7
Turkey Ukraine 9
Wales 6
Ukraine Wales 3
Length: 271, dtype: int64
Now it can be seen that the outputs are correctly matched. It can be seen that there are 271 kinds
of games between two European teams of which we have data.
The simplest possible prediction Let’s start by defining a simple game prediction function that
predicts the outcome of a soccer game. This function uses the following algorithm to predict the
outcome of a soccer game: - It looks at all the past games that have been played - Based on the
amount of goals scored it assigns a winner to the game - It aggregates the data and looks which
team has won most of the games. If both teams have won the same amount of games or there is
no information available. It estimates a tie.
Let’s start with defining a function that assigns a value to the new columns first_team_won, sec-
ond_team_won and tie in the dataframe:
[10]: ECdata.loc[:,'team1_won'] = np.where(ECdata.loc[:, 'team1_score'] > ECdata.loc[:
→, 'team2_score'], 1, 0)
ECdata.loc[:,'team2_won'] = np.where(ECdata.loc[:, 'team1_score'] < ECdata.loc[:
→, 'team2_score'], 1, 0)
The next step is aggregating the results:
5
6. [11]: ag = ECdata.groupby(['team1', 'team2'], as_index=False).agg('sum')
ag = ag[['team1', 'team1_won', 'team2', 'team2_won']]
print(ag)
team1 team1_won team2 team2_won
0 Austria 9 Belgium 2
1 Austria 0 Croatia 5
2 Austria 2 Czech Republic 2
3 Austria 4 Denmark 4
4 Austria 4 England 10
.. ... ... ... ...
266 Switzerland 0 Ukraine 0
267 Switzerland 5 Wales 2
268 Turkey 4 Ukraine 2
269 Turkey 2 Wales 3
270 Ukraine 1 Wales 0
[271 rows x 4 columns]
Now that the results are aggregated, it is trivial to assign a winner to every game:
[12]: ag.loc[:, 'winner'] = np.where(ag.loc[:, 'team1_won'] > ag.loc[:, 'team2_won'],␣
→ag.loc[:, 'team1'], "tie" )
ag.loc[:, 'winner'] = np.where(ag.loc[:, 'team1_won'] < ag.loc[:, 'team2_won'],␣
→ag.loc[:, 'team2'], ag.loc[:, 'winner'])
ag = ag[['team1', 'team2', 'winner']]
ag.head(5)
[12]: team1 team2 winner
0 Austria Belgium Austria
1 Austria Croatia Croatia
2 Austria Czech Republic tie
3 Austria Denmark tie
4 Austria England England
What follows next is the function that will be used to determine which team won the game. Note
that this function needs as input a table that shows which team wins against which team:
[13]: def matchWinner(wintable, team1, team2):
'''
This function determines the outcome of the football game
between team 1 and team 2 based on the provided win table.
Parameters
----------
wintable : DataFrame
a table that contains the teamnames and the result og the game
team1 : String
6
7. The first teamname
team2 : TYPE
The second teamname
Returns
-------
output : String
The name of the winner of the game or "tie"
'''
if(team1[0] > team2[0]):
temp = team1
team1 = team2
team2 = temp
output = "tie"
for index, row in wintable.iterrows():
if(row['team1'] == team1):
if(row['team2'] == team2):
output = row['winner']
return output
return output
print(matchWinner(ag, "Belgium", "Austria"))
Austria
This function only allows us to determine the winner of the game if there is data about a previous
encounter between the teams available. If there is no encounter between the teams in the dataset,
the algorithm assumes that the game will result in a tie. A limitation of this function is that it is
not possible to model the actual outcome of the game, it is only possible to determine the winner.
It is thus impossible to calculate goal differences and other relevant information.
A more complex function The previous function is very basic and has a couple of limitations: -
It does not allow to determine the end result of a game, only which team has won the game. - The
model always has exactly the same outcome as long as the data does not change. Hence, it is not
possible to assign probabilities to which team wins the game.
Both of these problems can be solved by switching to a poisson model. The poisson model esti-
mates the amount of goals that each team will score based on the average amount of goals that the
teams have scored in previous encounters. Since the model is stochastic, we can perform a monte
carlo simulation to assign probabilities to which team will win which game.
The poisson model is used to construct the wintable that is one of the inputs for the match winner
function:
[14]: def poissonWintable(data):
'''
This function determines the outcome of the football game
between teams and constructs a wintable based on the data.
7
8. Parameters
----------
data : DataFrame
a table that contains the results of previous games
Returns
-------
output : DataFrame
A DataFrame that contains the team names and the winner and amount of␣
→goals scored
by both teams.
'''
# Collect the mean amount of goals the teams scored in previous encounters
stoch = data.groupby(['team1', 'team2'], as_index=False).agg('mean')
# Simulate the amount of goals using a poisson model
stoch.loc[:,'team1_score'] = stoch.apply(lambda x: np.random.poisson(stoch.
→loc[:, 'team1_score'], len(stoch)),axis=1)[0]
stoch.loc[:,'team2_score'] = stoch.apply(lambda x: np.random.poisson(stoch.
→loc[:, 'team2_score'], len(stoch)), axis=1)[0]
# Determine which team has won the game
stoch.loc[:, 'winner'] = np.where(stoch.loc[:, 'team1_score'] > stoch.loc[:
→, 'team2_score'], stoch.loc[:, 'team1'], "tie")
stoch.loc[:, 'winner'] = np.where(stoch.loc[:, 'team1_score'] < stoch.loc[:,␣
→'team2_score'], stoch.loc[:, 'team2'], stoch.loc[:, 'winner'])
# Selecting the relevant output
stoch = stoch[['team1', 'team2', 'winner' ,'team1_score', 'team2_score']]
return stoch
# Example result
# Note: this result changes every time the code is rerun
poissonWintable(ECdata).head(5)
[14]: team1 team2 winner team1_score team2_score
0 Austria Belgium Austria 2 1
1 Austria Croatia tie 0 0
2 Austria Czech Republic Austria 3 2
3 Austria Denmark Austria 2 0
4 Austria England England 2 5
Since now, it is possible to determine the amount fo goals that both teams score, we can modify
the matchWinner function to also include the result of the game and not only the winner:
[15]: def matchWinner(wintable, team1, team2):
'''
This function determines the iutcome of the football game
8
9. between team 1 and team2 based on the provided win table.
Parameters
----------
wintable : DataFrame
a table that contains the teamnames and the result og the game
team1 : String
The first teamname
team2 : TYPE
The second teamname
Returns
-------
output : List
A list with te winner of the game as well as the score between the teams
'''
output = [None]*3
if(team1[0] > team2[0]):
temp = team1
team1 = team2
team2 = temp
output[0] = "tie"
output[1] = 1
output[2] = 1
for index, row in wintable.iterrows():
if(row['team1'] == team1):
if(row['team2'] == team2):
output[0] = row['winner']
output[1] = row['team1_score']
output[2] = row['team2_score']
return output
return output
matchWinner(poissonWintable(ECdata), "Austria", "Belgium")
[15]: ['Belgium', 1, 3]
The code above also shows that we expect the result to be a tie (1-1) if there has never been a
previous encounter between both of the teams. In reality this is not a realistic assumption.
Like mentioned earlier our model is stochastic and the results change everytime we rerun the code.
In order to get an idea about the probability that each team will win the game, we can perform a
monte carlo simulation:
[16]: def monteCarloGame(data, simulations, team1, team2):
output = pd.DataFrame(data = [], columns = ['Win Probability'])
for i in range(simulations):
9
10. output = output.append(pd.DataFrame(data =␣
→[matchWinner(poissonWintable(data), team1, team2)[0]], columns = ['Win␣
→Probability']))
output = output['Win Probability'].value_counts()*100/simulations
return output
monteCarloGame(ECdata, 200, "Austria", "Belgium")
[16]: Austria 75.0
Belgium 15.0
tie 10.0
Name: Win Probability, dtype: float64
The outcome based on the poisson model is much more informative compared to the outcome of
the previous model. However, there are still some limitations to this model: - The model uses
all previous data and attaches the same amount of importance to old and recent games. This
is unrealistic since teams change throughout the years and the results from a long time ago are
hardly relevant to predict the current games. Nonetheless, using the result from old games can
be valuable since it shows which countries have a rich history in football. These kind of countries
often perform better at big tournaments. - A more technical limitation: the poisson model implies
that the mean amount of goals scored is equal to the variance. Looking at the results of the describe
function at the beginning of the document. This assumption seems to be violated.
The weighted poisson model The first limitation of the model can be solved by weighing the
observations in the dataset i.e. less weight will be attached to observations from a long time ago
and a lot of weight will be attached to recent observations. This can be coded as followed:
[17]: def weightedPoissonWintable(data):
'''
This function determines the outcome of the football game
between teams and constructs a wintable based on the data.
Parameters
----------
data : DataFrame
a table that contains the results of previous games
Returns
-------
output : DataFrame
A DataFrame that contains the team names and the winner and amount of␣
→goals scored
by both teams.
'''
data = data.tail(2000) # Giving the oldest observations a weight of zero
data = data.reset_index()
10
11. # Give the first observation (index = 0) some weight (can be tweaked)
data.loc[:, 'weight'] = (data.index + 0.1)/max(data.index)
data.loc[:, 'team1_score'] = data.loc[:, 'weight'] * data.loc[:,␣
→'team1_score']
data.loc[:, 'team2_score'] = data.loc[:, 'weight'] * data.loc[:,␣
→'team2_score']
data = data.groupby(['team1', 'team2'], as_index=False).agg('mean')
data.loc[:, 'team1_score'] = (1 / (data.loc[:, 'weight'])) * data.loc[:,␣
→'team1_score']
data.loc[:, 'team2_score'] = (1 / (data.loc[:, 'weight'])) * data.loc[:,␣
→'team2_score']
# Simulate the amount of goals using a poisson model
data.loc[:,'team1_score'] = data.apply(lambda x: np.random.poisson(data.loc[:
→, 'team1_score'], len(data)),axis=1)[0]
data.loc[:,'team2_score'] = data.apply(lambda x: np.random.poisson(data.loc[:
→, 'team2_score'], len(data)), axis=1)[0]
data.loc[:, 'winner'] = np.where(data.loc[:, 'team1_score'] > data.loc[:,␣
→'team2_score'], data.loc[:, 'team1'], "tie")
data.loc[:, 'winner'] = np.where(data.loc[:, 'team1_score'] < data.loc[:,␣
→'team2_score'], data.loc[:, 'team2'], data.loc[:, 'winner'])
data = data[['team1', 'team2', 'winner' ,'team1_score', 'team2_score']]
return data
weightedPoissonWintable(ECdata).head(5)
[17]: team1 team2 winner team1_score team2_score
0 Austria Belgium tie 2 2
1 Austria Croatia Croatia 0 1
2 Austria Czech Republic Czech Republic 1 3
3 Austria Denmark Denmark 0 2
4 Austria England England 0 4
The monte carlo function can be easily updated to also include the case where we use the weighted
poisson model:
[18]: def monteCarloGame(data, simulations, team1, team2, weighted = True):
'''
Assigns probabilities to each possible outcome of a soccer game between
two teams
Parameters
----------
data : DataFrame
a table that contains the results of previous games
simulations: int
the number of simulations that will be used to determine the probability
the higher the number, the more stable the simulation
11
12. team1: String
the first team
team2: String
the second team
weighted: Boolean
True: uses the weighted poisson model (default)
False: uses the normal poisson model
Returns
-------
output : DataFrame
A DataFrame that contains the possible outcomes of the game and their
probabilities
'''
output = pd.DataFrame(data = [], columns = ['Win Probability'])
for i in range(simulations):
if weighted:
wintable = weightedPoissonWintable(data)
else:
wintable = poissonWintable(data)
output = output.append(pd.DataFrame(data = [matchWinner(wintable, team1,␣
→team2)[0]], columns = ['Win Probability']))
output = output['Win Probability'].value_counts()*100/simulations
return output
print(monteCarloGame(ECdata, 200, "Austria", "Belgium"))
Belgium 54.5
Austria 28.0
tie 17.5
Name: Win Probability, dtype: float64
It can be seen that the probability of Belgium winning has increased quite a bit compared to the
non-weighted poisson model. This is because in the past Austria performed well against Bel-
gium but recently Belgium has been dominating the games against Austria. The weighted poisson
model gives more importance to these recent games.
Possible alternative models There are various other models that can be used to predict the out-
come of an individual soccer game. In order to solve the problem with the mean and the variance,
one can by example use a negative binomial model. More complex models can also be used and
there exist various supervised learning techniques that will be able to predict the soccer games de-
cently. As mentioned before, the goal of this project is to learn data manipulation and modeling in
python, hence, the final and most complex model that will be used in this project is the weighted
poisson model. In the remainder of this document I will only be communicating the results based
on this weighted poisson model.
12
13. 1.3.2 Predicting the outcome of the group stage
In the previous section we have defined a method that allows to predict the outcome of one soccer
game based on a weighted poisson model. The next step in the project is to predict the outcome
of the group stage of the European Championships. We start by defining which teams belong in
which group:
[19]: groupA = ['Italy', 'Switserland', 'Turkey', 'Wales']
groupB = ['Belgium', 'Denmark', 'Finland', 'Russia']
groupC = ['Austria', 'Netherlands', 'Georgia', 'Ukraine']
groupD = ['Croatia', 'Czech Republic', 'England', 'Scotland']
groupE = ['Slovakia', 'Poland', 'Spain', 'Sweden']
groupF = ['France', 'Germany', 'Hungary', 'Portugal']
groups = [groupA, groupB, groupC, groupD, groupE, groupF]
In order to be able to simulate the results of the group stage we need a system in which we assigns
points to the teams based on the result of their game. We will be modeling the group stage as a
series of updates to the leage table based on the results of the game:
[20]: def updateRanking(wintable, team1, team2, groupTable):
'''
This function updates the groupTable to include the result of the game
between team1 and team2 based on the wintable.
Parameters
----------
wintable : dataframe
a dataframe that contains the team names, the winner of a game
between these teams and the goals scored by these teams
team1 : string
The name of the first team
team2 : string
The name of the second team
groupTable : dataframe
The current group ranking during the group stage of the European␣
→Championships
Returns
-------
Does not return anything but updates the groupTable
'''
# Calculate the result of the game
result = matchWinner(wintable, team1, team2)
# Load the teams that are in the team table as an array
group = groupTable.loc[:, "Country"]
# Assign the points
# Because dataframe is a pointer, the updates to the dataframe can
# be used outside this function and it does not need to return anything
13
14. for i in range(4):
if group[i] == team1 and result[0] == "tie":
groupTable.loc[i, 'Points'] += 1
groupTable.loc[i, 'Goals_made'] += result[1]
groupTable.loc[i, 'Goals_recieved'] += result[2]
groupTable.loc[i, 'Goal_difference'] += result[1] - result[2]
elif group[i] == team1 and result[0] == team1:
groupTable.loc[i, 'Points'] += 3
groupTable.loc[i, 'Goals_made'] += result[1]
groupTable.loc[i, 'Goals_recieved'] += result[2]
groupTable.loc[i, 'Goal_difference'] += result[1] - result[2]
elif group[i] == team1 and result[0] == team2:
groupTable.loc[i, 'Goals_made'] += result[1]
groupTable.loc[i, 'Goals_recieved'] += result[2]
groupTable.loc[i, 'Goal_difference'] += result[1] - result[2]
if group[i] == team2 and result[0] == "tie":
groupTable.loc[i, 'Points'] += 1
groupTable.loc[i, 'Goals_made'] += result[2]
groupTable.loc[i, 'Goals_recieved'] += result[1]
groupTable.loc[i, 'Goal_difference'] += result[2] - result[1]
elif group[i] == team2 and result[0] == team2:
groupTable.loc[i, 'Points'] += 3
groupTable.loc[i, 'Goals_made'] += result[2]
groupTable.loc[i, 'Goals_recieved'] += result[1]
groupTable.loc[i, 'Goal_difference'] += result[2] - result[1]
elif group[i] == team2 and result[0] == team1:
groupTable.loc[i, 'Goals_made'] += result[2]
groupTable.loc[i, 'Goals_recieved'] += result[1]
groupTable.loc[i, 'Goal_difference'] += result[2] - result[1]
The next step is creating a function that updates the ranking after every game that has been played
in one group of the European Championships. This function allows the user to pick the method
they want to use. In this case the default method is again the weighted poisson model.
[21]: def simulateGroup(data, group, method = "weightedpoisson"):
'''
This function simulates the group stage of the European Championchip for
only one group.
Parameters
----------
wintable : dataframe
a dataframe that contains the team names, the winner of a game
between these teams and the goals scored by these teams
group : list
a list that contains the team names that are in a group
14
15. Returns
-------
groupTable : dataframe
contains the team names and the amount of points of these teams at the
end of the group stage, it also includes the total amount of goals
scored and conceded as well as the goal difference
'''
if method == "weightedPoisson":
wintable = weightedPoissonWintable(data)
else:
wintable = poissonWintable(data)
groupTable = [[group[0], 0, 0, 0, 0], [group[1], 0, 0, 0, 0], [group[2], 0,␣
→0, 0, 0], [group[3], 0, 0, 0, 0]]
groupTable = pd.DataFrame(groupTable, columns = ['Country', 'Points',␣
→'Goals_made', 'Goals_recieved', 'Goal_difference'])
updateRanking(wintable, group[0], group[1], groupTable)
updateRanking(wintable, group[2], group[3], groupTable)
updateRanking(wintable, group[0], group[2], groupTable)
updateRanking(wintable, group[1], group[3], groupTable)
updateRanking(wintable, group[0], group[3], groupTable)
updateRanking(wintable, group[2], group[1], groupTable)
groupTable = groupTable.sort_values(["Points", "Goals_made",␣
→"Goal_difference"], ascending = False)
return groupTable
simulateGroup(ECdata, groupA)
[21]: Country Points Goals_made Goals_recieved Goal_difference
0 Italy 7 6 3 3
3 Wales 4 3 3 0
1 Switserland 3 3 3 0
2 Turkey 1 2 5 -3
The result above can be used to determine the probability of moving on to the next round. Note
that this is not as straightforward as it seems at first since there is also a probability to continue to
the next round if the team finishes in the third position. The probability of moving on in this case
depends on the results of the other group. The real selection is quite complex and it becomes even
more complex once teams need to be assigned to a certain game in the next round. In this project,
we will be making use of a simpler model that only takes the goal difference into account and
assigns the team with the highest goal difference to the first game slot, the team with the second
highest goal difference to the second game slot and so on. This impacts the final simulation of the
tournament. Again, the goal of this project is learning python, not building the perfect model to
predict the European Championchips.
Determining the probability of moving on to the next round can be done by using a monte carlo
simulation:
15
16. [22]: def monteCarloGroupStage(data, simulations, groups):
mc = pd.DataFrame(data = [], columns = ['Country'])
for j in range(simulations):
groupA = simulateGroup(data, groups[0])
groupB = simulateGroup(data, groups[1])
groupC = simulateGroup(data, groups[2])
groupD = simulateGroup(data, groups[3])
groupE = simulateGroup(data, groups[4])
groupF = simulateGroup(data, groups[5])
# Defining which of the best third places moves on (this is a simplified␣
→version)
best_thirds = [groupA.loc[:, 'Country'].values[3], groupB.loc[:,␣
→'Country'].values[3], groupC.loc[:, 'Country'].values[3],
groupD.loc[:, 'Country'].values[3], groupE.loc[:,␣
→'Country'].values[3], groupF.loc[:, 'Country'].values[3]]
best_thirds_gd = [groupA.loc[:, 'Goal_difference'].values[3], groupB.
→loc[:, 'Goal_difference'].values[3], groupC.loc[:, 'Goal_difference'].
→values[3],
groupD.loc[:, 'Goal_difference'].values[3], groupE.
→loc[:, 'Goal_difference'].values[3], groupF.loc[:, 'Goal_difference'].
→values[3]]
for i in range(6):
if max(best_thirds_gd) == best_thirds_gd[i]:
A3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
for i in range(6):
if max(best_thirds_gd) == best_thirds_gd[i]:
B3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
for i in range(6):
if max(best_thirds_gd) == best_thirds_gd[i]:
C3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
for i in range(6):
if max(best_thirds_gd) == best_thirds_gd[i]:
D3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
mc = mc.append(pd.DataFrame(data = [groupA.loc[:, 'Country'].values[0]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupA.loc[:, 'Country'].values[1]],␣
→columns = ['Country']))
16
17. mc = mc.append(pd.DataFrame(data = [groupB.loc[:, 'Country'].values[0]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupB.loc[:, 'Country'].values[1]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupC.loc[:, 'Country'].values[0]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupC.loc[:, 'Country'].values[1]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupD.loc[:, 'Country'].values[0]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupD.loc[:, 'Country'].values[1]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupE.loc[:, 'Country'].values[0]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupE.loc[:, 'Country'].values[1]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupF.loc[:, 'Country'].values[0]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupF.loc[:, 'Country'].values[1]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [A3], columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [B3], columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [C3], columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [D3], columns = ['Country']))
mc = pd.DataFrame(data = mc, columns = ['Country'])
mc = mc['Country'].value_counts()*100/simulations
return mc
monteCarloGroupStage(ECdata, 200, groups).head(5) #only the first 5 countries␣
→are shown
[22]: Italy 98.0
Russia 92.5
Spain 88.0
Netherlands 83.0
England 78.5
Name: Country, dtype: float64
1.3.3 Predicting the outcome of the tournament
The next step in this project is simulating the entire tournament. After having simulated the results
of the group stage, the next steps are straightforward. First, one should take into account the fact
that there cannot be a tie after the group stage. Thus, we should create a function that determines
the outcome of post group stage games. Note also that the score of the game is irrelevant:
17
18. [23]: def matchWinnerF(wintable, team1, team2):
'''
This function determines the outcome of the football game
between team 1 and team2 based on the provided win table.
This function should be used for post group stage games.
Parameters
----------
wintable : DataFrame
a table that contains the teamnames and the result og the game
team1 : String
The first teamname
team2 : TYPE
The second teamname
Returns
-------
output : String
The winner of the game
'''
output = "tie"
if(team1[0] > team2[0]):
temp = team1
team1 = team2
team2 = temp
output = "tie"
for index, row in wintable.iterrows():
if(row['team1'] == team1):
if(row['team2'] == team2):
output = row['winner']
if output == "tie":
if random.random() > 0.5:
output = team2
else:
output = team1
return output
When the initial estimate of the game was equal to “tie”, the function picks a winner of the game
randomly. It does not simulate stoppage time or something similar. Whether or not this is realistic
is debatable.
The next and final step of the project is creating a function that determines the winner of the
competition based on the weighted poisson model. We define the following function:
[24]: def simulateTournament(data, groups):
groupA = simulateGroup(data, groups[0])
groupB = simulateGroup(data, groups[1])
18
19. groupC = simulateGroup(data, groups[2])
groupD = simulateGroup(data, groups[3])
groupE = simulateGroup(data, groups[4])
groupF = simulateGroup(data, groups[5])
# Defining which of the best third places moves on (this is a simplified␣
→version)
best_thirds = [groupA.loc[:, 'Country'].values[3], groupB.loc[:, 'Country'].
→values[3], groupC.loc[:, 'Country'].values[3],
groupD.loc[:, 'Country'].values[3], groupE.loc[:, 'Country'].
→values[3], groupF.loc[:, 'Country'].values[3]]
best_thirds_gd = [groupA.loc[:, 'Goal_difference'].values[3], groupB.loc[:,␣
→'Goal_difference'].values[3], groupC.loc[:, 'Goal_difference'].values[3],
groupD.loc[:, 'Goal_difference'].values[3], groupE.loc[:,␣
→'Goal_difference'].values[3], groupF.loc[:, 'Goal_difference'].values[3]]
x = [0,1,2,3,4,5]
shuffle(x)
for i in x:
if max(best_thirds_gd) == best_thirds_gd[i]:
A3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
for i in x:
if max(best_thirds_gd) == best_thirds_gd[i]:
B3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
for i in x:
if max(best_thirds_gd) == best_thirds_gd[i]:
C3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
for i in x:
if max(best_thirds_gd) == best_thirds_gd[i]:
D3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
A1 = groupA.loc[:, 'Country'].values[0]
A2 = groupA.loc[:, 'Country'].values[1]
B1 = groupB.loc[:, 'Country'].values[0]
B2 = groupB.loc[:, 'Country'].values[1]
C1 = groupC.loc[:, 'Country'].values[0]
C2 = groupC.loc[:, 'Country'].values[1]
D1 = groupD.loc[:, 'Country'].values[0]
D2 = groupD.loc[:, 'Country'].values[1]
E1 = groupE.loc[:, 'Country'].values[0]
E2 = groupE.loc[:, 'Country'].values[1]
F1 = groupF.loc[:, 'Country'].values[0]
19
20. F2 = groupF.loc[:, 'Country'].values[1]
winTable = weightedPoissonWintable(data)
winner1 = matchWinnerF(winTable, A2, B2)
winner2 = matchWinnerF(winTable, A1, C2)
winner3 = matchWinnerF(winTable, C1, D3)
winner4 = matchWinnerF(winTable, B1, A3)
winner5 = matchWinnerF(winTable, E2, D2)
winner6 = matchWinnerF(winTable, F1, B3)
winner7 = matchWinnerF(winTable, D1, F2)
winner8 = matchWinnerF(winTable, E1, C3)
winnerQF1 = matchWinnerF(winTable, winner6, winner5)
winnerQF2 = matchWinnerF(winTable, winner4, winner2)
winnerQF3 = matchWinnerF(winTable, winner3, winner1)
winnerQF4 = matchWinnerF(winTable, winner8, winner7)
winnerSF1 = matchWinnerF(winTable, winnerQF1, winnerQF2)
winnerSF2 = matchWinnerF(winTable, winnerQF3, winnerQF4)
return matchWinnerF(winTable, winnerSF1, winnerSF2)
print(simulateTournament(ECdata, groups))
France
Since the models that are used to simulate the group stage and the finals of the tournament are
stochastic, the result changes everytime we rerun the code. In order to assign probabilities to the
different teams winning we perform another monte carlo simulation.
[25]: def montecarloTournament(data, simulations, groups):
mc = pd.DataFrame(data = [], columns = ['Country'])
for j in range(simulations):
mc = mc.append(pd.DataFrame(data = [simulateTournament(data, groups)],␣
→columns = ['Country']))
mc = mc['Country'].value_counts()*100/simulations
return mc
montecarloTournament(ECdata, 200, groups)
[25]: Spain 13.0
Netherlands 12.0
Germany 11.5
Italy 11.5
France 9.5
England 8.0
Portugal 6.5
Sweden 4.5
Russia 4.0
Belgium 4.0
Switserland 3.0
Croatia 2.5
20
21. Ukraine 2.0
Czech Republic 1.0
Slovakia 1.0
Scotland 1.0
Poland 1.0
Turkey 1.0
Austria 1.0
Denmark 1.0
Finland 0.5
Hungary 0.5
Name: Country, dtype: float64
The results of the simulation show that the countries with the highest probability of winning the
tournament are Spain, the Netherlands, Italy, Germany and France. Based on this analysis, the
prediction for the European Championship 2020 would be that Spain wins the competition.
The advantage of making a prediction with the approach of modeling the probability of every
possible end-state is that we can look after the tournament what the likelihood was of the outcome
according to the model.
2 Conclusion
The goal of this project was to get familiar with basic data analysis and modeling in python. This
goal has been accomplished. In order to predict the outcome of the European Championships the
probability of every team winning the championship was calculated using a monte carlo simula-
tion of the entire tournament. It was shown that there are some important assumptions that were
made in order to make the modeling easier. Nonetheless, the result of the model seem to be valid
if one takes into account that only historical data was used.
21