SlideShare une entreprise Scribd logo
1  sur  21
EC2020 Soccer predictions
March 29, 2020
1 EC2020 Soccer predictions
The goal of this notebook is to make a prediction for the European Championships of 2020. The
data that will be used in order to make these predictions are the results of international soccer
games from 1872 to 2020. The main goal of this analysis is to get familiar with the python
language hence a simplistic algorithm will be used for the predictions. The algorithm that
will be used to predict the outcome of the European Championships is the following: - Look at the
(weighted) average amount of goals both teams scored in previous encounters - Model the amount
of goals these teams scored following a poisson model - Determine the winner of the game based
on the amount of goals both teams scored - Give the teams points based on the result of the game
in the groupstage or determine which team will move on to the next stage - Define a function that
simulates the entire tournament game by game - Perform a monte carlo simulation
As I go through the code, it will become clear what kind of assumptions have been made.
Note: The 2020 European championship soccer has been postponed due to the spread of the covid-
19 virus.
1.1 Setting up the environment
In order to manage this project in the most efficient way, the following set-up will be used: - A
github repository is used to update the code - A python script is used to make sure that the code
is self contained - This Jupyter notebook is created to document what has been done so far
Next, the necessary packages will be loaded in the environment:
[1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from random import random
from random import shuffle
import random
Now, the data is loaded into the environment:
[2]: data = pd.read_csv("data/results.csv")
1
1.2 Exploring the dataset
A first step in the exploration of the dataset is to look at the characteristics of the data. This can be
easily done using the describe function from the pandas package:
[3]: print(data.describe())
home_score away_score
count 41586.000000 41586.000000
mean 1.745756 1.187587
std 1.753780 1.405323
min 0.000000 0.000000
25% 1.000000 0.000000
50% 1.000000 1.000000
75% 2.000000 2.000000
max 31.000000 21.000000
The description of the data shows a couple of interesting things: - The mean amount of goals that
is scored by the home team is 1.745756 while the mean amount of goals scored by the away team
equals 1.187587. - The maximum amount of goals scored is 31. The maximum amount of goals
scored by the away team equals 21. These datapoints should be further explored.
To explore the games where there was a large amount of goals, we sort the data based on the
amount of home and away goals (descending) and then look at the first 5 observations in the
dataframe:
[4]: dataSortedHomeGoals = data.sort_values("home_score", ascending = False)
dataSortedHomeGoals = dataSortedHomeGoals[['date', 'home_team','away_team'␣
→,'home_score', 'away_score']]
temp = dataSortedHomeGoals.head(5)
del dataSortedHomeGoals
temp
[4]: date home_team away_team home_score away_score
23796 2001-04-11 Australia American Samoa 31 0
7902 1971-09-13 Tahiti Cook Islands 30 0
10949 1979-08-30 Fiji Kiribati 24 0
23793 2001-04-09 Australia Tonga 22 0
28838 2006-11-24 Sápmi Monaco 21 1
[5]: dataSortedAwayGoals = data.sort_values("away_score", ascending = False)
dataSortedAwayGoals = dataSortedAwayGoals[['date','home_team','away_team'␣
→,'home_score', 'away_score']]
temp = dataSortedAwayGoals.head(5)
del dataSortedAwayGoals
temp
2
[5]: date home_team away_team home_score away_score
27373 2005-03-11 Guam North Korea 0 21
25694 2003-06-30 Sark Isle of Wight 0 20
36020 2014-06-01 Darfur Padania 0 20
14720 1987-12-15 American Samoa Papua New Guinea 0 20
36025 2014-06-02 Darfur South Ossetia 0 19
It can be seen that the extreme values for the amount of goals scored seem to be valid observations.
By example, the game between Guam and North Korea played on the 11th of March in 2005
finished indeed with a goal difference of 21 goals:
img/GuamNK.png
The number of goals that were scored by home and by away teams can be compared by plotting
two histograms.
[6]: plt.subplot(1, 2, 1)
plt.hist(data.loc[:,'home_score'].values)
plt.subplot(1, 2, 2)
plt.hist(data.loc[:,'away_score'].values)
plt.show;
Both plots look very similar indicating that both away and home teams are expected to score only
3
a couple of goals. Large amounts of goals are very rare. This matches our expectations based on
every day life. It hardly happens that a team scores more than 5 goals within one game.
1.3 Predicting the European Championship 2020
The outcome of the European Championships will be predicting by assigning probabilities to ev-
ery team winning the tournament and picking the one with the highest probability.
1.3.1 Predicting individual game results
In order to predict the outcome of the European Championship soccer, it is necessary to predict
the invidual results of the different games. In this part, we will be predicting the result of a game
between two countries by looking at past results of games between these countries. It is thus
necessary to look at how much data we have for the different games between countries.
First we have to filter the data to only include countries that are participating to the world cup:
[7]: # I will assume that Hungary, Slovakia, Scotland and Georgia passed the play-offs
countryList = (["England", "Italy", "Switzerland", "Turkey", "Wales", "Belgium",␣
→"Denmark", "Finland", "Russia", "Austria", "Netherlands", "Ukraine",␣
→"Croatia", "Czech Republic", "England", "Poland", "Spain", "Sweden", "France",␣
→"Germany", "Portugal", "Hungary", "Slovakia", "Scotland", "Georgia"])
ECdata = data.loc[data['home_team'].isin(countryList) & data['away_team'].
→isin(countryList)]
del data
Next we look at the combinations between the different countries and try to identify how many
games they have played against each other:
[8]: print(ECdata.groupby(['home_team', 'away_team']).size())
home_team away_team
Austria Belgium 6
Croatia 4
Czech Republic 3
Denmark 5
England 12
..
Wales Spain 3
Sweden 3
Switzerland 3
Turkey 3
Ukraine 1
Length: 536, dtype: int64
Exploring this data shows that this analysis is incorrect. At the moment, a game between e.g. Aus-
tria and Belgium is not the same as a game between Belgium and Austria. For the purpose of our
analysis, this is incorrect. We thus need to resolve this issue. A way we can do this is by going
4
through the observations and putting the country that appears first in the dictionary (alphabeti-
cally) as the home team.
[9]: ECdata.loc[:,'team1'] = np.where(ECdata.loc[:,'home_team'] < ECdata.loc[:
→,'away_team'], ECdata.loc[:,'home_team'], ECdata.loc[:,'away_team'])
ECdata.loc[:,'team1_score'] = np.where(ECdata.loc[:,'home_team'] < ECdata.loc[:
→,'away_team'], ECdata.loc[:,'home_score'], ECdata.loc[:,'away_score'])
ECdata.loc[:,'team2'] = np.where(ECdata.loc[:,'home_team'] > ECdata.loc[:
→,'away_team'], ECdata.loc[:,'home_team'], ECdata.loc[:,'away_team'])
ECdata.loc[:,'team2_score'] = np.where(ECdata.loc[:,'home_team'] > ECdata.loc[:
→,'away_team'], ECdata.loc[:,'home_score'], ECdata.loc[:,'away_score'])
ECdata = ECdata.loc[:, ['date', 'team1', 'team1_score', 'team2', 'team2_score',␣
→'neutral']]
print(ECdata.groupby(['team1', 'team2']).size())
team1 team2
Austria Belgium 14
Croatia 5
Czech Republic 5
Denmark 9
England 18
..
Switzerland Ukraine 2
Wales 7
Turkey Ukraine 9
Wales 6
Ukraine Wales 3
Length: 271, dtype: int64
Now it can be seen that the outputs are correctly matched. It can be seen that there are 271 kinds
of games between two European teams of which we have data.
The simplest possible prediction Let’s start by defining a simple game prediction function that
predicts the outcome of a soccer game. This function uses the following algorithm to predict the
outcome of a soccer game: - It looks at all the past games that have been played - Based on the
amount of goals scored it assigns a winner to the game - It aggregates the data and looks which
team has won most of the games. If both teams have won the same amount of games or there is
no information available. It estimates a tie.
Let’s start with defining a function that assigns a value to the new columns first_team_won, sec-
ond_team_won and tie in the dataframe:
[10]: ECdata.loc[:,'team1_won'] = np.where(ECdata.loc[:, 'team1_score'] > ECdata.loc[:
→, 'team2_score'], 1, 0)
ECdata.loc[:,'team2_won'] = np.where(ECdata.loc[:, 'team1_score'] < ECdata.loc[:
→, 'team2_score'], 1, 0)
The next step is aggregating the results:
5
[11]: ag = ECdata.groupby(['team1', 'team2'], as_index=False).agg('sum')
ag = ag[['team1', 'team1_won', 'team2', 'team2_won']]
print(ag)
team1 team1_won team2 team2_won
0 Austria 9 Belgium 2
1 Austria 0 Croatia 5
2 Austria 2 Czech Republic 2
3 Austria 4 Denmark 4
4 Austria 4 England 10
.. ... ... ... ...
266 Switzerland 0 Ukraine 0
267 Switzerland 5 Wales 2
268 Turkey 4 Ukraine 2
269 Turkey 2 Wales 3
270 Ukraine 1 Wales 0
[271 rows x 4 columns]
Now that the results are aggregated, it is trivial to assign a winner to every game:
[12]: ag.loc[:, 'winner'] = np.where(ag.loc[:, 'team1_won'] > ag.loc[:, 'team2_won'],␣
→ag.loc[:, 'team1'], "tie" )
ag.loc[:, 'winner'] = np.where(ag.loc[:, 'team1_won'] < ag.loc[:, 'team2_won'],␣
→ag.loc[:, 'team2'], ag.loc[:, 'winner'])
ag = ag[['team1', 'team2', 'winner']]
ag.head(5)
[12]: team1 team2 winner
0 Austria Belgium Austria
1 Austria Croatia Croatia
2 Austria Czech Republic tie
3 Austria Denmark tie
4 Austria England England
What follows next is the function that will be used to determine which team won the game. Note
that this function needs as input a table that shows which team wins against which team:
[13]: def matchWinner(wintable, team1, team2):
'''
This function determines the outcome of the football game
between team 1 and team 2 based on the provided win table.
Parameters
----------
wintable : DataFrame
a table that contains the teamnames and the result og the game
team1 : String
6
The first teamname
team2 : TYPE
The second teamname
Returns
-------
output : String
The name of the winner of the game or "tie"
'''
if(team1[0] > team2[0]):
temp = team1
team1 = team2
team2 = temp
output = "tie"
for index, row in wintable.iterrows():
if(row['team1'] == team1):
if(row['team2'] == team2):
output = row['winner']
return output
return output
print(matchWinner(ag, "Belgium", "Austria"))
Austria
This function only allows us to determine the winner of the game if there is data about a previous
encounter between the teams available. If there is no encounter between the teams in the dataset,
the algorithm assumes that the game will result in a tie. A limitation of this function is that it is
not possible to model the actual outcome of the game, it is only possible to determine the winner.
It is thus impossible to calculate goal differences and other relevant information.
A more complex function The previous function is very basic and has a couple of limitations: -
It does not allow to determine the end result of a game, only which team has won the game. - The
model always has exactly the same outcome as long as the data does not change. Hence, it is not
possible to assign probabilities to which team wins the game.
Both of these problems can be solved by switching to a poisson model. The poisson model esti-
mates the amount of goals that each team will score based on the average amount of goals that the
teams have scored in previous encounters. Since the model is stochastic, we can perform a monte
carlo simulation to assign probabilities to which team will win which game.
The poisson model is used to construct the wintable that is one of the inputs for the match winner
function:
[14]: def poissonWintable(data):
'''
This function determines the outcome of the football game
between teams and constructs a wintable based on the data.
7
Parameters
----------
data : DataFrame
a table that contains the results of previous games
Returns
-------
output : DataFrame
A DataFrame that contains the team names and the winner and amount of␣
→goals scored
by both teams.
'''
# Collect the mean amount of goals the teams scored in previous encounters
stoch = data.groupby(['team1', 'team2'], as_index=False).agg('mean')
# Simulate the amount of goals using a poisson model
stoch.loc[:,'team1_score'] = stoch.apply(lambda x: np.random.poisson(stoch.
→loc[:, 'team1_score'], len(stoch)),axis=1)[0]
stoch.loc[:,'team2_score'] = stoch.apply(lambda x: np.random.poisson(stoch.
→loc[:, 'team2_score'], len(stoch)), axis=1)[0]
# Determine which team has won the game
stoch.loc[:, 'winner'] = np.where(stoch.loc[:, 'team1_score'] > stoch.loc[:
→, 'team2_score'], stoch.loc[:, 'team1'], "tie")
stoch.loc[:, 'winner'] = np.where(stoch.loc[:, 'team1_score'] < stoch.loc[:,␣
→'team2_score'], stoch.loc[:, 'team2'], stoch.loc[:, 'winner'])
# Selecting the relevant output
stoch = stoch[['team1', 'team2', 'winner' ,'team1_score', 'team2_score']]
return stoch
# Example result
# Note: this result changes every time the code is rerun
poissonWintable(ECdata).head(5)
[14]: team1 team2 winner team1_score team2_score
0 Austria Belgium Austria 2 1
1 Austria Croatia tie 0 0
2 Austria Czech Republic Austria 3 2
3 Austria Denmark Austria 2 0
4 Austria England England 2 5
Since now, it is possible to determine the amount fo goals that both teams score, we can modify
the matchWinner function to also include the result of the game and not only the winner:
[15]: def matchWinner(wintable, team1, team2):
'''
This function determines the iutcome of the football game
8
between team 1 and team2 based on the provided win table.
Parameters
----------
wintable : DataFrame
a table that contains the teamnames and the result og the game
team1 : String
The first teamname
team2 : TYPE
The second teamname
Returns
-------
output : List
A list with te winner of the game as well as the score between the teams
'''
output = [None]*3
if(team1[0] > team2[0]):
temp = team1
team1 = team2
team2 = temp
output[0] = "tie"
output[1] = 1
output[2] = 1
for index, row in wintable.iterrows():
if(row['team1'] == team1):
if(row['team2'] == team2):
output[0] = row['winner']
output[1] = row['team1_score']
output[2] = row['team2_score']
return output
return output
matchWinner(poissonWintable(ECdata), "Austria", "Belgium")
[15]: ['Belgium', 1, 3]
The code above also shows that we expect the result to be a tie (1-1) if there has never been a
previous encounter between both of the teams. In reality this is not a realistic assumption.
Like mentioned earlier our model is stochastic and the results change everytime we rerun the code.
In order to get an idea about the probability that each team will win the game, we can perform a
monte carlo simulation:
[16]: def monteCarloGame(data, simulations, team1, team2):
output = pd.DataFrame(data = [], columns = ['Win Probability'])
for i in range(simulations):
9
output = output.append(pd.DataFrame(data =␣
→[matchWinner(poissonWintable(data), team1, team2)[0]], columns = ['Win␣
→Probability']))
output = output['Win Probability'].value_counts()*100/simulations
return output
monteCarloGame(ECdata, 200, "Austria", "Belgium")
[16]: Austria 75.0
Belgium 15.0
tie 10.0
Name: Win Probability, dtype: float64
The outcome based on the poisson model is much more informative compared to the outcome of
the previous model. However, there are still some limitations to this model: - The model uses
all previous data and attaches the same amount of importance to old and recent games. This
is unrealistic since teams change throughout the years and the results from a long time ago are
hardly relevant to predict the current games. Nonetheless, using the result from old games can
be valuable since it shows which countries have a rich history in football. These kind of countries
often perform better at big tournaments. - A more technical limitation: the poisson model implies
that the mean amount of goals scored is equal to the variance. Looking at the results of the describe
function at the beginning of the document. This assumption seems to be violated.
The weighted poisson model The first limitation of the model can be solved by weighing the
observations in the dataset i.e. less weight will be attached to observations from a long time ago
and a lot of weight will be attached to recent observations. This can be coded as followed:
[17]: def weightedPoissonWintable(data):
'''
This function determines the outcome of the football game
between teams and constructs a wintable based on the data.
Parameters
----------
data : DataFrame
a table that contains the results of previous games
Returns
-------
output : DataFrame
A DataFrame that contains the team names and the winner and amount of␣
→goals scored
by both teams.
'''
data = data.tail(2000) # Giving the oldest observations a weight of zero
data = data.reset_index()
10
# Give the first observation (index = 0) some weight (can be tweaked)
data.loc[:, 'weight'] = (data.index + 0.1)/max(data.index)
data.loc[:, 'team1_score'] = data.loc[:, 'weight'] * data.loc[:,␣
→'team1_score']
data.loc[:, 'team2_score'] = data.loc[:, 'weight'] * data.loc[:,␣
→'team2_score']
data = data.groupby(['team1', 'team2'], as_index=False).agg('mean')
data.loc[:, 'team1_score'] = (1 / (data.loc[:, 'weight'])) * data.loc[:,␣
→'team1_score']
data.loc[:, 'team2_score'] = (1 / (data.loc[:, 'weight'])) * data.loc[:,␣
→'team2_score']
# Simulate the amount of goals using a poisson model
data.loc[:,'team1_score'] = data.apply(lambda x: np.random.poisson(data.loc[:
→, 'team1_score'], len(data)),axis=1)[0]
data.loc[:,'team2_score'] = data.apply(lambda x: np.random.poisson(data.loc[:
→, 'team2_score'], len(data)), axis=1)[0]
data.loc[:, 'winner'] = np.where(data.loc[:, 'team1_score'] > data.loc[:,␣
→'team2_score'], data.loc[:, 'team1'], "tie")
data.loc[:, 'winner'] = np.where(data.loc[:, 'team1_score'] < data.loc[:,␣
→'team2_score'], data.loc[:, 'team2'], data.loc[:, 'winner'])
data = data[['team1', 'team2', 'winner' ,'team1_score', 'team2_score']]
return data
weightedPoissonWintable(ECdata).head(5)
[17]: team1 team2 winner team1_score team2_score
0 Austria Belgium tie 2 2
1 Austria Croatia Croatia 0 1
2 Austria Czech Republic Czech Republic 1 3
3 Austria Denmark Denmark 0 2
4 Austria England England 0 4
The monte carlo function can be easily updated to also include the case where we use the weighted
poisson model:
[18]: def monteCarloGame(data, simulations, team1, team2, weighted = True):
'''
Assigns probabilities to each possible outcome of a soccer game between
two teams
Parameters
----------
data : DataFrame
a table that contains the results of previous games
simulations: int
the number of simulations that will be used to determine the probability
the higher the number, the more stable the simulation
11
team1: String
the first team
team2: String
the second team
weighted: Boolean
True: uses the weighted poisson model (default)
False: uses the normal poisson model
Returns
-------
output : DataFrame
A DataFrame that contains the possible outcomes of the game and their
probabilities
'''
output = pd.DataFrame(data = [], columns = ['Win Probability'])
for i in range(simulations):
if weighted:
wintable = weightedPoissonWintable(data)
else:
wintable = poissonWintable(data)
output = output.append(pd.DataFrame(data = [matchWinner(wintable, team1,␣
→team2)[0]], columns = ['Win Probability']))
output = output['Win Probability'].value_counts()*100/simulations
return output
print(monteCarloGame(ECdata, 200, "Austria", "Belgium"))
Belgium 54.5
Austria 28.0
tie 17.5
Name: Win Probability, dtype: float64
It can be seen that the probability of Belgium winning has increased quite a bit compared to the
non-weighted poisson model. This is because in the past Austria performed well against Bel-
gium but recently Belgium has been dominating the games against Austria. The weighted poisson
model gives more importance to these recent games.
Possible alternative models There are various other models that can be used to predict the out-
come of an individual soccer game. In order to solve the problem with the mean and the variance,
one can by example use a negative binomial model. More complex models can also be used and
there exist various supervised learning techniques that will be able to predict the soccer games de-
cently. As mentioned before, the goal of this project is to learn data manipulation and modeling in
python, hence, the final and most complex model that will be used in this project is the weighted
poisson model. In the remainder of this document I will only be communicating the results based
on this weighted poisson model.
12
1.3.2 Predicting the outcome of the group stage
In the previous section we have defined a method that allows to predict the outcome of one soccer
game based on a weighted poisson model. The next step in the project is to predict the outcome
of the group stage of the European Championships. We start by defining which teams belong in
which group:
[19]: groupA = ['Italy', 'Switserland', 'Turkey', 'Wales']
groupB = ['Belgium', 'Denmark', 'Finland', 'Russia']
groupC = ['Austria', 'Netherlands', 'Georgia', 'Ukraine']
groupD = ['Croatia', 'Czech Republic', 'England', 'Scotland']
groupE = ['Slovakia', 'Poland', 'Spain', 'Sweden']
groupF = ['France', 'Germany', 'Hungary', 'Portugal']
groups = [groupA, groupB, groupC, groupD, groupE, groupF]
In order to be able to simulate the results of the group stage we need a system in which we assigns
points to the teams based on the result of their game. We will be modeling the group stage as a
series of updates to the leage table based on the results of the game:
[20]: def updateRanking(wintable, team1, team2, groupTable):
'''
This function updates the groupTable to include the result of the game
between team1 and team2 based on the wintable.
Parameters
----------
wintable : dataframe
a dataframe that contains the team names, the winner of a game
between these teams and the goals scored by these teams
team1 : string
The name of the first team
team2 : string
The name of the second team
groupTable : dataframe
The current group ranking during the group stage of the European␣
→Championships
Returns
-------
Does not return anything but updates the groupTable
'''
# Calculate the result of the game
result = matchWinner(wintable, team1, team2)
# Load the teams that are in the team table as an array
group = groupTable.loc[:, "Country"]
# Assign the points
# Because dataframe is a pointer, the updates to the dataframe can
# be used outside this function and it does not need to return anything
13
for i in range(4):
if group[i] == team1 and result[0] == "tie":
groupTable.loc[i, 'Points'] += 1
groupTable.loc[i, 'Goals_made'] += result[1]
groupTable.loc[i, 'Goals_recieved'] += result[2]
groupTable.loc[i, 'Goal_difference'] += result[1] - result[2]
elif group[i] == team1 and result[0] == team1:
groupTable.loc[i, 'Points'] += 3
groupTable.loc[i, 'Goals_made'] += result[1]
groupTable.loc[i, 'Goals_recieved'] += result[2]
groupTable.loc[i, 'Goal_difference'] += result[1] - result[2]
elif group[i] == team1 and result[0] == team2:
groupTable.loc[i, 'Goals_made'] += result[1]
groupTable.loc[i, 'Goals_recieved'] += result[2]
groupTable.loc[i, 'Goal_difference'] += result[1] - result[2]
if group[i] == team2 and result[0] == "tie":
groupTable.loc[i, 'Points'] += 1
groupTable.loc[i, 'Goals_made'] += result[2]
groupTable.loc[i, 'Goals_recieved'] += result[1]
groupTable.loc[i, 'Goal_difference'] += result[2] - result[1]
elif group[i] == team2 and result[0] == team2:
groupTable.loc[i, 'Points'] += 3
groupTable.loc[i, 'Goals_made'] += result[2]
groupTable.loc[i, 'Goals_recieved'] += result[1]
groupTable.loc[i, 'Goal_difference'] += result[2] - result[1]
elif group[i] == team2 and result[0] == team1:
groupTable.loc[i, 'Goals_made'] += result[2]
groupTable.loc[i, 'Goals_recieved'] += result[1]
groupTable.loc[i, 'Goal_difference'] += result[2] - result[1]
The next step is creating a function that updates the ranking after every game that has been played
in one group of the European Championships. This function allows the user to pick the method
they want to use. In this case the default method is again the weighted poisson model.
[21]: def simulateGroup(data, group, method = "weightedpoisson"):
'''
This function simulates the group stage of the European Championchip for
only one group.
Parameters
----------
wintable : dataframe
a dataframe that contains the team names, the winner of a game
between these teams and the goals scored by these teams
group : list
a list that contains the team names that are in a group
14
Returns
-------
groupTable : dataframe
contains the team names and the amount of points of these teams at the
end of the group stage, it also includes the total amount of goals
scored and conceded as well as the goal difference
'''
if method == "weightedPoisson":
wintable = weightedPoissonWintable(data)
else:
wintable = poissonWintable(data)
groupTable = [[group[0], 0, 0, 0, 0], [group[1], 0, 0, 0, 0], [group[2], 0,␣
→0, 0, 0], [group[3], 0, 0, 0, 0]]
groupTable = pd.DataFrame(groupTable, columns = ['Country', 'Points',␣
→'Goals_made', 'Goals_recieved', 'Goal_difference'])
updateRanking(wintable, group[0], group[1], groupTable)
updateRanking(wintable, group[2], group[3], groupTable)
updateRanking(wintable, group[0], group[2], groupTable)
updateRanking(wintable, group[1], group[3], groupTable)
updateRanking(wintable, group[0], group[3], groupTable)
updateRanking(wintable, group[2], group[1], groupTable)
groupTable = groupTable.sort_values(["Points", "Goals_made",␣
→"Goal_difference"], ascending = False)
return groupTable
simulateGroup(ECdata, groupA)
[21]: Country Points Goals_made Goals_recieved Goal_difference
0 Italy 7 6 3 3
3 Wales 4 3 3 0
1 Switserland 3 3 3 0
2 Turkey 1 2 5 -3
The result above can be used to determine the probability of moving on to the next round. Note
that this is not as straightforward as it seems at first since there is also a probability to continue to
the next round if the team finishes in the third position. The probability of moving on in this case
depends on the results of the other group. The real selection is quite complex and it becomes even
more complex once teams need to be assigned to a certain game in the next round. In this project,
we will be making use of a simpler model that only takes the goal difference into account and
assigns the team with the highest goal difference to the first game slot, the team with the second
highest goal difference to the second game slot and so on. This impacts the final simulation of the
tournament. Again, the goal of this project is learning python, not building the perfect model to
predict the European Championchips.
Determining the probability of moving on to the next round can be done by using a monte carlo
simulation:
15
[22]: def monteCarloGroupStage(data, simulations, groups):
mc = pd.DataFrame(data = [], columns = ['Country'])
for j in range(simulations):
groupA = simulateGroup(data, groups[0])
groupB = simulateGroup(data, groups[1])
groupC = simulateGroup(data, groups[2])
groupD = simulateGroup(data, groups[3])
groupE = simulateGroup(data, groups[4])
groupF = simulateGroup(data, groups[5])
# Defining which of the best third places moves on (this is a simplified␣
→version)
best_thirds = [groupA.loc[:, 'Country'].values[3], groupB.loc[:,␣
→'Country'].values[3], groupC.loc[:, 'Country'].values[3],
groupD.loc[:, 'Country'].values[3], groupE.loc[:,␣
→'Country'].values[3], groupF.loc[:, 'Country'].values[3]]
best_thirds_gd = [groupA.loc[:, 'Goal_difference'].values[3], groupB.
→loc[:, 'Goal_difference'].values[3], groupC.loc[:, 'Goal_difference'].
→values[3],
groupD.loc[:, 'Goal_difference'].values[3], groupE.
→loc[:, 'Goal_difference'].values[3], groupF.loc[:, 'Goal_difference'].
→values[3]]
for i in range(6):
if max(best_thirds_gd) == best_thirds_gd[i]:
A3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
for i in range(6):
if max(best_thirds_gd) == best_thirds_gd[i]:
B3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
for i in range(6):
if max(best_thirds_gd) == best_thirds_gd[i]:
C3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
for i in range(6):
if max(best_thirds_gd) == best_thirds_gd[i]:
D3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
mc = mc.append(pd.DataFrame(data = [groupA.loc[:, 'Country'].values[0]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupA.loc[:, 'Country'].values[1]],␣
→columns = ['Country']))
16
mc = mc.append(pd.DataFrame(data = [groupB.loc[:, 'Country'].values[0]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupB.loc[:, 'Country'].values[1]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupC.loc[:, 'Country'].values[0]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupC.loc[:, 'Country'].values[1]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupD.loc[:, 'Country'].values[0]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupD.loc[:, 'Country'].values[1]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupE.loc[:, 'Country'].values[0]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupE.loc[:, 'Country'].values[1]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupF.loc[:, 'Country'].values[0]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [groupF.loc[:, 'Country'].values[1]],␣
→columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [A3], columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [B3], columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [C3], columns = ['Country']))
mc = mc.append(pd.DataFrame(data = [D3], columns = ['Country']))
mc = pd.DataFrame(data = mc, columns = ['Country'])
mc = mc['Country'].value_counts()*100/simulations
return mc
monteCarloGroupStage(ECdata, 200, groups).head(5) #only the first 5 countries␣
→are shown
[22]: Italy 98.0
Russia 92.5
Spain 88.0
Netherlands 83.0
England 78.5
Name: Country, dtype: float64
1.3.3 Predicting the outcome of the tournament
The next step in this project is simulating the entire tournament. After having simulated the results
of the group stage, the next steps are straightforward. First, one should take into account the fact
that there cannot be a tie after the group stage. Thus, we should create a function that determines
the outcome of post group stage games. Note also that the score of the game is irrelevant:
17
[23]: def matchWinnerF(wintable, team1, team2):
'''
This function determines the outcome of the football game
between team 1 and team2 based on the provided win table.
This function should be used for post group stage games.
Parameters
----------
wintable : DataFrame
a table that contains the teamnames and the result og the game
team1 : String
The first teamname
team2 : TYPE
The second teamname
Returns
-------
output : String
The winner of the game
'''
output = "tie"
if(team1[0] > team2[0]):
temp = team1
team1 = team2
team2 = temp
output = "tie"
for index, row in wintable.iterrows():
if(row['team1'] == team1):
if(row['team2'] == team2):
output = row['winner']
if output == "tie":
if random.random() > 0.5:
output = team2
else:
output = team1
return output
When the initial estimate of the game was equal to “tie”, the function picks a winner of the game
randomly. It does not simulate stoppage time or something similar. Whether or not this is realistic
is debatable.
The next and final step of the project is creating a function that determines the winner of the
competition based on the weighted poisson model. We define the following function:
[24]: def simulateTournament(data, groups):
groupA = simulateGroup(data, groups[0])
groupB = simulateGroup(data, groups[1])
18
groupC = simulateGroup(data, groups[2])
groupD = simulateGroup(data, groups[3])
groupE = simulateGroup(data, groups[4])
groupF = simulateGroup(data, groups[5])
# Defining which of the best third places moves on (this is a simplified␣
→version)
best_thirds = [groupA.loc[:, 'Country'].values[3], groupB.loc[:, 'Country'].
→values[3], groupC.loc[:, 'Country'].values[3],
groupD.loc[:, 'Country'].values[3], groupE.loc[:, 'Country'].
→values[3], groupF.loc[:, 'Country'].values[3]]
best_thirds_gd = [groupA.loc[:, 'Goal_difference'].values[3], groupB.loc[:,␣
→'Goal_difference'].values[3], groupC.loc[:, 'Goal_difference'].values[3],
groupD.loc[:, 'Goal_difference'].values[3], groupE.loc[:,␣
→'Goal_difference'].values[3], groupF.loc[:, 'Goal_difference'].values[3]]
x = [0,1,2,3,4,5]
shuffle(x)
for i in x:
if max(best_thirds_gd) == best_thirds_gd[i]:
A3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
for i in x:
if max(best_thirds_gd) == best_thirds_gd[i]:
B3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
for i in x:
if max(best_thirds_gd) == best_thirds_gd[i]:
C3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
for i in x:
if max(best_thirds_gd) == best_thirds_gd[i]:
D3 = best_thirds[i]
best_thirds_gd[i] = -1000
break
A1 = groupA.loc[:, 'Country'].values[0]
A2 = groupA.loc[:, 'Country'].values[1]
B1 = groupB.loc[:, 'Country'].values[0]
B2 = groupB.loc[:, 'Country'].values[1]
C1 = groupC.loc[:, 'Country'].values[0]
C2 = groupC.loc[:, 'Country'].values[1]
D1 = groupD.loc[:, 'Country'].values[0]
D2 = groupD.loc[:, 'Country'].values[1]
E1 = groupE.loc[:, 'Country'].values[0]
E2 = groupE.loc[:, 'Country'].values[1]
F1 = groupF.loc[:, 'Country'].values[0]
19
F2 = groupF.loc[:, 'Country'].values[1]
winTable = weightedPoissonWintable(data)
winner1 = matchWinnerF(winTable, A2, B2)
winner2 = matchWinnerF(winTable, A1, C2)
winner3 = matchWinnerF(winTable, C1, D3)
winner4 = matchWinnerF(winTable, B1, A3)
winner5 = matchWinnerF(winTable, E2, D2)
winner6 = matchWinnerF(winTable, F1, B3)
winner7 = matchWinnerF(winTable, D1, F2)
winner8 = matchWinnerF(winTable, E1, C3)
winnerQF1 = matchWinnerF(winTable, winner6, winner5)
winnerQF2 = matchWinnerF(winTable, winner4, winner2)
winnerQF3 = matchWinnerF(winTable, winner3, winner1)
winnerQF4 = matchWinnerF(winTable, winner8, winner7)
winnerSF1 = matchWinnerF(winTable, winnerQF1, winnerQF2)
winnerSF2 = matchWinnerF(winTable, winnerQF3, winnerQF4)
return matchWinnerF(winTable, winnerSF1, winnerSF2)
print(simulateTournament(ECdata, groups))
France
Since the models that are used to simulate the group stage and the finals of the tournament are
stochastic, the result changes everytime we rerun the code. In order to assign probabilities to the
different teams winning we perform another monte carlo simulation.
[25]: def montecarloTournament(data, simulations, groups):
mc = pd.DataFrame(data = [], columns = ['Country'])
for j in range(simulations):
mc = mc.append(pd.DataFrame(data = [simulateTournament(data, groups)],␣
→columns = ['Country']))
mc = mc['Country'].value_counts()*100/simulations
return mc
montecarloTournament(ECdata, 200, groups)
[25]: Spain 13.0
Netherlands 12.0
Germany 11.5
Italy 11.5
France 9.5
England 8.0
Portugal 6.5
Sweden 4.5
Russia 4.0
Belgium 4.0
Switserland 3.0
Croatia 2.5
20
Ukraine 2.0
Czech Republic 1.0
Slovakia 1.0
Scotland 1.0
Poland 1.0
Turkey 1.0
Austria 1.0
Denmark 1.0
Finland 0.5
Hungary 0.5
Name: Country, dtype: float64
The results of the simulation show that the countries with the highest probability of winning the
tournament are Spain, the Netherlands, Italy, Germany and France. Based on this analysis, the
prediction for the European Championship 2020 would be that Spain wins the competition.
The advantage of making a prediction with the approach of modeling the probability of every
possible end-state is that we can look after the tournament what the likelihood was of the outcome
according to the model.
2 Conclusion
The goal of this project was to get familiar with basic data analysis and modeling in python. This
goal has been accomplished. In order to predict the outcome of the European Championships the
probability of every team winning the championship was calculated using a monte carlo simula-
tion of the entire tournament. It was shown that there are some important assumptions that were
made in order to make the modeling easier. Nonetheless, the result of the model seem to be valid
if one takes into account that only historical data was used.
21

Contenu connexe

Tendances

The Ring programming language version 1.5.3 book - Part 50 of 184
The Ring programming language version 1.5.3 book - Part 50 of 184The Ring programming language version 1.5.3 book - Part 50 of 184
The Ring programming language version 1.5.3 book - Part 50 of 184Mahmoud Samir Fayed
 
The Ring programming language version 1.3 book - Part 40 of 88
The Ring programming language version 1.3 book - Part 40 of 88The Ring programming language version 1.3 book - Part 40 of 88
The Ring programming language version 1.3 book - Part 40 of 88Mahmoud Samir Fayed
 
Adventures on live partitioning
Adventures on live partitioningAdventures on live partitioning
Adventures on live partitioningMatteo Melli
 
The Ring programming language version 1.5.2 book - Part 49 of 181
The Ring programming language version 1.5.2 book - Part 49 of 181The Ring programming language version 1.5.2 book - Part 49 of 181
The Ring programming language version 1.5.2 book - Part 49 of 181Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 79 of 184
The Ring programming language version 1.5.3 book - Part 79 of 184The Ring programming language version 1.5.3 book - Part 79 of 184
The Ring programming language version 1.5.3 book - Part 79 of 184Mahmoud Samir Fayed
 
The Ring programming language version 1.5.1 book - Part 48 of 180
The Ring programming language version 1.5.1 book - Part 48 of 180The Ring programming language version 1.5.1 book - Part 48 of 180
The Ring programming language version 1.5.1 book - Part 48 of 180Mahmoud Samir Fayed
 
The Ring programming language version 1.6 book - Part 52 of 189
The Ring programming language version 1.6 book - Part 52 of 189The Ring programming language version 1.6 book - Part 52 of 189
The Ring programming language version 1.6 book - Part 52 of 189Mahmoud Samir Fayed
 
A Search Index is Not a Database Index - Full Stack Toronto 2017
A Search Index is Not a Database Index - Full Stack Toronto 2017A Search Index is Not a Database Index - Full Stack Toronto 2017
A Search Index is Not a Database Index - Full Stack Toronto 2017Toria Gibbs
 
The Ring programming language version 1.3 book - Part 42 of 88
The Ring programming language version 1.3 book - Part 42 of 88The Ring programming language version 1.3 book - Part 42 of 88
The Ring programming language version 1.3 book - Part 42 of 88Mahmoud Samir Fayed
 
The Ring programming language version 1.9 book - Part 62 of 210
The Ring programming language version 1.9 book - Part 62 of 210The Ring programming language version 1.9 book - Part 62 of 210
The Ring programming language version 1.9 book - Part 62 of 210Mahmoud Samir Fayed
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
 
Using New Web APIs For Your Own Pleasure – How I Wrote New Features For My Vi...
Using New Web APIs For Your Own Pleasure – How I Wrote New Features For My Vi...Using New Web APIs For Your Own Pleasure – How I Wrote New Features For My Vi...
Using New Web APIs For Your Own Pleasure – How I Wrote New Features For My Vi...GeilDanke
 
The Ring programming language version 1.2 book - Part 39 of 84
The Ring programming language version 1.2 book - Part 39 of 84The Ring programming language version 1.2 book - Part 39 of 84
The Ring programming language version 1.2 book - Part 39 of 84Mahmoud Samir Fayed
 
The Ring programming language version 1.10 book - Part 61 of 212
The Ring programming language version 1.10 book - Part 61 of 212The Ring programming language version 1.10 book - Part 61 of 212
The Ring programming language version 1.10 book - Part 61 of 212Mahmoud Samir Fayed
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
 
The Ring programming language version 1.10 book - Part 56 of 212
The Ring programming language version 1.10 book - Part 56 of 212The Ring programming language version 1.10 book - Part 56 of 212
The Ring programming language version 1.10 book - Part 56 of 212Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 62 of 184
The Ring programming language version 1.5.3 book - Part 62 of 184The Ring programming language version 1.5.3 book - Part 62 of 184
The Ring programming language version 1.5.3 book - Part 62 of 184Mahmoud Samir Fayed
 
The Ring programming language version 1.5.1 book - Part 49 of 180
The Ring programming language version 1.5.1 book - Part 49 of 180The Ring programming language version 1.5.1 book - Part 49 of 180
The Ring programming language version 1.5.1 book - Part 49 of 180Mahmoud Samir Fayed
 
The Ring programming language version 1.4.1 book - Part 14 of 31
The Ring programming language version 1.4.1 book - Part 14 of 31The Ring programming language version 1.4.1 book - Part 14 of 31
The Ring programming language version 1.4.1 book - Part 14 of 31Mahmoud Samir Fayed
 

Tendances (19)

The Ring programming language version 1.5.3 book - Part 50 of 184
The Ring programming language version 1.5.3 book - Part 50 of 184The Ring programming language version 1.5.3 book - Part 50 of 184
The Ring programming language version 1.5.3 book - Part 50 of 184
 
The Ring programming language version 1.3 book - Part 40 of 88
The Ring programming language version 1.3 book - Part 40 of 88The Ring programming language version 1.3 book - Part 40 of 88
The Ring programming language version 1.3 book - Part 40 of 88
 
Adventures on live partitioning
Adventures on live partitioningAdventures on live partitioning
Adventures on live partitioning
 
The Ring programming language version 1.5.2 book - Part 49 of 181
The Ring programming language version 1.5.2 book - Part 49 of 181The Ring programming language version 1.5.2 book - Part 49 of 181
The Ring programming language version 1.5.2 book - Part 49 of 181
 
The Ring programming language version 1.5.3 book - Part 79 of 184
The Ring programming language version 1.5.3 book - Part 79 of 184The Ring programming language version 1.5.3 book - Part 79 of 184
The Ring programming language version 1.5.3 book - Part 79 of 184
 
The Ring programming language version 1.5.1 book - Part 48 of 180
The Ring programming language version 1.5.1 book - Part 48 of 180The Ring programming language version 1.5.1 book - Part 48 of 180
The Ring programming language version 1.5.1 book - Part 48 of 180
 
The Ring programming language version 1.6 book - Part 52 of 189
The Ring programming language version 1.6 book - Part 52 of 189The Ring programming language version 1.6 book - Part 52 of 189
The Ring programming language version 1.6 book - Part 52 of 189
 
A Search Index is Not a Database Index - Full Stack Toronto 2017
A Search Index is Not a Database Index - Full Stack Toronto 2017A Search Index is Not a Database Index - Full Stack Toronto 2017
A Search Index is Not a Database Index - Full Stack Toronto 2017
 
The Ring programming language version 1.3 book - Part 42 of 88
The Ring programming language version 1.3 book - Part 42 of 88The Ring programming language version 1.3 book - Part 42 of 88
The Ring programming language version 1.3 book - Part 42 of 88
 
The Ring programming language version 1.9 book - Part 62 of 210
The Ring programming language version 1.9 book - Part 62 of 210The Ring programming language version 1.9 book - Part 62 of 210
The Ring programming language version 1.9 book - Part 62 of 210
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
 
Using New Web APIs For Your Own Pleasure – How I Wrote New Features For My Vi...
Using New Web APIs For Your Own Pleasure – How I Wrote New Features For My Vi...Using New Web APIs For Your Own Pleasure – How I Wrote New Features For My Vi...
Using New Web APIs For Your Own Pleasure – How I Wrote New Features For My Vi...
 
The Ring programming language version 1.2 book - Part 39 of 84
The Ring programming language version 1.2 book - Part 39 of 84The Ring programming language version 1.2 book - Part 39 of 84
The Ring programming language version 1.2 book - Part 39 of 84
 
The Ring programming language version 1.10 book - Part 61 of 212
The Ring programming language version 1.10 book - Part 61 of 212The Ring programming language version 1.10 book - Part 61 of 212
The Ring programming language version 1.10 book - Part 61 of 212
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
The Ring programming language version 1.10 book - Part 56 of 212
The Ring programming language version 1.10 book - Part 56 of 212The Ring programming language version 1.10 book - Part 56 of 212
The Ring programming language version 1.10 book - Part 56 of 212
 
The Ring programming language version 1.5.3 book - Part 62 of 184
The Ring programming language version 1.5.3 book - Part 62 of 184The Ring programming language version 1.5.3 book - Part 62 of 184
The Ring programming language version 1.5.3 book - Part 62 of 184
 
The Ring programming language version 1.5.1 book - Part 49 of 180
The Ring programming language version 1.5.1 book - Part 49 of 180The Ring programming language version 1.5.1 book - Part 49 of 180
The Ring programming language version 1.5.1 book - Part 49 of 180
 
The Ring programming language version 1.4.1 book - Part 14 of 31
The Ring programming language version 1.4.1 book - Part 14 of 31The Ring programming language version 1.4.1 book - Part 14 of 31
The Ring programming language version 1.4.1 book - Part 14 of 31
 

Similaire à Predictions European Championships 2020

Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
 
Don't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code ReviewsDon't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code ReviewsGramener
 
EKON22 Introduction to Machinelearning
EKON22 Introduction to MachinelearningEKON22 Introduction to Machinelearning
EKON22 Introduction to MachinelearningMax Kleiner
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
How to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
How to write SQL queries | pgDay Paris 2019 | Dimitri FontaineHow to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
How to write SQL queries | pgDay Paris 2019 | Dimitri FontaineCitus Data
 
Introduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-finalIntroduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-finalM Malai
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerMydbops
 
sports-teampackage.bluej#BlueJ package fileobjectbench.heig.docx
sports-teampackage.bluej#BlueJ package fileobjectbench.heig.docxsports-teampackage.bluej#BlueJ package fileobjectbench.heig.docx
sports-teampackage.bluej#BlueJ package fileobjectbench.heig.docxwhitneyleman54422
 
10. Getting Spatial
10. Getting Spatial10. Getting Spatial
10. Getting SpatialFAO
 
Credit Risk Assessment using Machine Learning Techniques with WEKA
Credit Risk Assessment using Machine Learning Techniques with WEKACredit Risk Assessment using Machine Learning Techniques with WEKA
Credit Risk Assessment using Machine Learning Techniques with WEKAMehnaz Newaz
 
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)Wataru Shito
 
Lab 3 Set Working Directory, Scatterplots and Introduction to.docx
Lab 3 Set Working Directory, Scatterplots and Introduction to.docxLab 3 Set Working Directory, Scatterplots and Introduction to.docx
Lab 3 Set Working Directory, Scatterplots and Introduction to.docxDIPESH30
 
K-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonK-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonAfzal Ahmad
 

Similaire à Predictions European Championships 2020 (20)

Cs229 final report
Cs229 final reportCs229 final report
Cs229 final report
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Don't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code ReviewsDon't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code Reviews
 
EKON22 Introduction to Machinelearning
EKON22 Introduction to MachinelearningEKON22 Introduction to Machinelearning
EKON22 Introduction to Machinelearning
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
How to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
How to write SQL queries | pgDay Paris 2019 | Dimitri FontaineHow to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
How to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
 
R Language
R LanguageR Language
R Language
 
Introduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-finalIntroduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-final
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizer
 
Do You Have the Time
Do You Have the TimeDo You Have the Time
Do You Have the Time
 
sports-teampackage.bluej#BlueJ package fileobjectbench.heig.docx
sports-teampackage.bluej#BlueJ package fileobjectbench.heig.docxsports-teampackage.bluej#BlueJ package fileobjectbench.heig.docx
sports-teampackage.bluej#BlueJ package fileobjectbench.heig.docx
 
10. Getting Spatial
10. Getting Spatial10. Getting Spatial
10. Getting Spatial
 
Big Data Analytics Lab File
Big Data Analytics Lab FileBig Data Analytics Lab File
Big Data Analytics Lab File
 
Groovy kind of test
Groovy kind of testGroovy kind of test
Groovy kind of test
 
Groovy kind of test
Groovy kind of testGroovy kind of test
Groovy kind of test
 
10. R getting spatial
10.  R getting spatial10.  R getting spatial
10. R getting spatial
 
Credit Risk Assessment using Machine Learning Techniques with WEKA
Credit Risk Assessment using Machine Learning Techniques with WEKACredit Risk Assessment using Machine Learning Techniques with WEKA
Credit Risk Assessment using Machine Learning Techniques with WEKA
 
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
 
Lab 3 Set Working Directory, Scatterplots and Introduction to.docx
Lab 3 Set Working Directory, Scatterplots and Introduction to.docxLab 3 Set Working Directory, Scatterplots and Introduction to.docx
Lab 3 Set Working Directory, Scatterplots and Introduction to.docx
 
K-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonK-Means Algorithm Implementation In python
K-Means Algorithm Implementation In python
 

Dernier

Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 

Dernier (20)

Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 

Predictions European Championships 2020

  • 1. EC2020 Soccer predictions March 29, 2020 1 EC2020 Soccer predictions The goal of this notebook is to make a prediction for the European Championships of 2020. The data that will be used in order to make these predictions are the results of international soccer games from 1872 to 2020. The main goal of this analysis is to get familiar with the python language hence a simplistic algorithm will be used for the predictions. The algorithm that will be used to predict the outcome of the European Championships is the following: - Look at the (weighted) average amount of goals both teams scored in previous encounters - Model the amount of goals these teams scored following a poisson model - Determine the winner of the game based on the amount of goals both teams scored - Give the teams points based on the result of the game in the groupstage or determine which team will move on to the next stage - Define a function that simulates the entire tournament game by game - Perform a monte carlo simulation As I go through the code, it will become clear what kind of assumptions have been made. Note: The 2020 European championship soccer has been postponed due to the spread of the covid- 19 virus. 1.1 Setting up the environment In order to manage this project in the most efficient way, the following set-up will be used: - A github repository is used to update the code - A python script is used to make sure that the code is self contained - This Jupyter notebook is created to document what has been done so far Next, the necessary packages will be loaded in the environment: [1]: import numpy as np import pandas as pd import matplotlib.pyplot as plt from random import random from random import shuffle import random Now, the data is loaded into the environment: [2]: data = pd.read_csv("data/results.csv") 1
  • 2. 1.2 Exploring the dataset A first step in the exploration of the dataset is to look at the characteristics of the data. This can be easily done using the describe function from the pandas package: [3]: print(data.describe()) home_score away_score count 41586.000000 41586.000000 mean 1.745756 1.187587 std 1.753780 1.405323 min 0.000000 0.000000 25% 1.000000 0.000000 50% 1.000000 1.000000 75% 2.000000 2.000000 max 31.000000 21.000000 The description of the data shows a couple of interesting things: - The mean amount of goals that is scored by the home team is 1.745756 while the mean amount of goals scored by the away team equals 1.187587. - The maximum amount of goals scored is 31. The maximum amount of goals scored by the away team equals 21. These datapoints should be further explored. To explore the games where there was a large amount of goals, we sort the data based on the amount of home and away goals (descending) and then look at the first 5 observations in the dataframe: [4]: dataSortedHomeGoals = data.sort_values("home_score", ascending = False) dataSortedHomeGoals = dataSortedHomeGoals[['date', 'home_team','away_team'␣ →,'home_score', 'away_score']] temp = dataSortedHomeGoals.head(5) del dataSortedHomeGoals temp [4]: date home_team away_team home_score away_score 23796 2001-04-11 Australia American Samoa 31 0 7902 1971-09-13 Tahiti Cook Islands 30 0 10949 1979-08-30 Fiji Kiribati 24 0 23793 2001-04-09 Australia Tonga 22 0 28838 2006-11-24 Sápmi Monaco 21 1 [5]: dataSortedAwayGoals = data.sort_values("away_score", ascending = False) dataSortedAwayGoals = dataSortedAwayGoals[['date','home_team','away_team'␣ →,'home_score', 'away_score']] temp = dataSortedAwayGoals.head(5) del dataSortedAwayGoals temp 2
  • 3. [5]: date home_team away_team home_score away_score 27373 2005-03-11 Guam North Korea 0 21 25694 2003-06-30 Sark Isle of Wight 0 20 36020 2014-06-01 Darfur Padania 0 20 14720 1987-12-15 American Samoa Papua New Guinea 0 20 36025 2014-06-02 Darfur South Ossetia 0 19 It can be seen that the extreme values for the amount of goals scored seem to be valid observations. By example, the game between Guam and North Korea played on the 11th of March in 2005 finished indeed with a goal difference of 21 goals: img/GuamNK.png The number of goals that were scored by home and by away teams can be compared by plotting two histograms. [6]: plt.subplot(1, 2, 1) plt.hist(data.loc[:,'home_score'].values) plt.subplot(1, 2, 2) plt.hist(data.loc[:,'away_score'].values) plt.show; Both plots look very similar indicating that both away and home teams are expected to score only 3
  • 4. a couple of goals. Large amounts of goals are very rare. This matches our expectations based on every day life. It hardly happens that a team scores more than 5 goals within one game. 1.3 Predicting the European Championship 2020 The outcome of the European Championships will be predicting by assigning probabilities to ev- ery team winning the tournament and picking the one with the highest probability. 1.3.1 Predicting individual game results In order to predict the outcome of the European Championship soccer, it is necessary to predict the invidual results of the different games. In this part, we will be predicting the result of a game between two countries by looking at past results of games between these countries. It is thus necessary to look at how much data we have for the different games between countries. First we have to filter the data to only include countries that are participating to the world cup: [7]: # I will assume that Hungary, Slovakia, Scotland and Georgia passed the play-offs countryList = (["England", "Italy", "Switzerland", "Turkey", "Wales", "Belgium",␣ →"Denmark", "Finland", "Russia", "Austria", "Netherlands", "Ukraine",␣ →"Croatia", "Czech Republic", "England", "Poland", "Spain", "Sweden", "France",␣ →"Germany", "Portugal", "Hungary", "Slovakia", "Scotland", "Georgia"]) ECdata = data.loc[data['home_team'].isin(countryList) & data['away_team']. →isin(countryList)] del data Next we look at the combinations between the different countries and try to identify how many games they have played against each other: [8]: print(ECdata.groupby(['home_team', 'away_team']).size()) home_team away_team Austria Belgium 6 Croatia 4 Czech Republic 3 Denmark 5 England 12 .. Wales Spain 3 Sweden 3 Switzerland 3 Turkey 3 Ukraine 1 Length: 536, dtype: int64 Exploring this data shows that this analysis is incorrect. At the moment, a game between e.g. Aus- tria and Belgium is not the same as a game between Belgium and Austria. For the purpose of our analysis, this is incorrect. We thus need to resolve this issue. A way we can do this is by going 4
  • 5. through the observations and putting the country that appears first in the dictionary (alphabeti- cally) as the home team. [9]: ECdata.loc[:,'team1'] = np.where(ECdata.loc[:,'home_team'] < ECdata.loc[: →,'away_team'], ECdata.loc[:,'home_team'], ECdata.loc[:,'away_team']) ECdata.loc[:,'team1_score'] = np.where(ECdata.loc[:,'home_team'] < ECdata.loc[: →,'away_team'], ECdata.loc[:,'home_score'], ECdata.loc[:,'away_score']) ECdata.loc[:,'team2'] = np.where(ECdata.loc[:,'home_team'] > ECdata.loc[: →,'away_team'], ECdata.loc[:,'home_team'], ECdata.loc[:,'away_team']) ECdata.loc[:,'team2_score'] = np.where(ECdata.loc[:,'home_team'] > ECdata.loc[: →,'away_team'], ECdata.loc[:,'home_score'], ECdata.loc[:,'away_score']) ECdata = ECdata.loc[:, ['date', 'team1', 'team1_score', 'team2', 'team2_score',␣ →'neutral']] print(ECdata.groupby(['team1', 'team2']).size()) team1 team2 Austria Belgium 14 Croatia 5 Czech Republic 5 Denmark 9 England 18 .. Switzerland Ukraine 2 Wales 7 Turkey Ukraine 9 Wales 6 Ukraine Wales 3 Length: 271, dtype: int64 Now it can be seen that the outputs are correctly matched. It can be seen that there are 271 kinds of games between two European teams of which we have data. The simplest possible prediction Let’s start by defining a simple game prediction function that predicts the outcome of a soccer game. This function uses the following algorithm to predict the outcome of a soccer game: - It looks at all the past games that have been played - Based on the amount of goals scored it assigns a winner to the game - It aggregates the data and looks which team has won most of the games. If both teams have won the same amount of games or there is no information available. It estimates a tie. Let’s start with defining a function that assigns a value to the new columns first_team_won, sec- ond_team_won and tie in the dataframe: [10]: ECdata.loc[:,'team1_won'] = np.where(ECdata.loc[:, 'team1_score'] > ECdata.loc[: →, 'team2_score'], 1, 0) ECdata.loc[:,'team2_won'] = np.where(ECdata.loc[:, 'team1_score'] < ECdata.loc[: →, 'team2_score'], 1, 0) The next step is aggregating the results: 5
  • 6. [11]: ag = ECdata.groupby(['team1', 'team2'], as_index=False).agg('sum') ag = ag[['team1', 'team1_won', 'team2', 'team2_won']] print(ag) team1 team1_won team2 team2_won 0 Austria 9 Belgium 2 1 Austria 0 Croatia 5 2 Austria 2 Czech Republic 2 3 Austria 4 Denmark 4 4 Austria 4 England 10 .. ... ... ... ... 266 Switzerland 0 Ukraine 0 267 Switzerland 5 Wales 2 268 Turkey 4 Ukraine 2 269 Turkey 2 Wales 3 270 Ukraine 1 Wales 0 [271 rows x 4 columns] Now that the results are aggregated, it is trivial to assign a winner to every game: [12]: ag.loc[:, 'winner'] = np.where(ag.loc[:, 'team1_won'] > ag.loc[:, 'team2_won'],␣ →ag.loc[:, 'team1'], "tie" ) ag.loc[:, 'winner'] = np.where(ag.loc[:, 'team1_won'] < ag.loc[:, 'team2_won'],␣ →ag.loc[:, 'team2'], ag.loc[:, 'winner']) ag = ag[['team1', 'team2', 'winner']] ag.head(5) [12]: team1 team2 winner 0 Austria Belgium Austria 1 Austria Croatia Croatia 2 Austria Czech Republic tie 3 Austria Denmark tie 4 Austria England England What follows next is the function that will be used to determine which team won the game. Note that this function needs as input a table that shows which team wins against which team: [13]: def matchWinner(wintable, team1, team2): ''' This function determines the outcome of the football game between team 1 and team 2 based on the provided win table. Parameters ---------- wintable : DataFrame a table that contains the teamnames and the result og the game team1 : String 6
  • 7. The first teamname team2 : TYPE The second teamname Returns ------- output : String The name of the winner of the game or "tie" ''' if(team1[0] > team2[0]): temp = team1 team1 = team2 team2 = temp output = "tie" for index, row in wintable.iterrows(): if(row['team1'] == team1): if(row['team2'] == team2): output = row['winner'] return output return output print(matchWinner(ag, "Belgium", "Austria")) Austria This function only allows us to determine the winner of the game if there is data about a previous encounter between the teams available. If there is no encounter between the teams in the dataset, the algorithm assumes that the game will result in a tie. A limitation of this function is that it is not possible to model the actual outcome of the game, it is only possible to determine the winner. It is thus impossible to calculate goal differences and other relevant information. A more complex function The previous function is very basic and has a couple of limitations: - It does not allow to determine the end result of a game, only which team has won the game. - The model always has exactly the same outcome as long as the data does not change. Hence, it is not possible to assign probabilities to which team wins the game. Both of these problems can be solved by switching to a poisson model. The poisson model esti- mates the amount of goals that each team will score based on the average amount of goals that the teams have scored in previous encounters. Since the model is stochastic, we can perform a monte carlo simulation to assign probabilities to which team will win which game. The poisson model is used to construct the wintable that is one of the inputs for the match winner function: [14]: def poissonWintable(data): ''' This function determines the outcome of the football game between teams and constructs a wintable based on the data. 7
  • 8. Parameters ---------- data : DataFrame a table that contains the results of previous games Returns ------- output : DataFrame A DataFrame that contains the team names and the winner and amount of␣ →goals scored by both teams. ''' # Collect the mean amount of goals the teams scored in previous encounters stoch = data.groupby(['team1', 'team2'], as_index=False).agg('mean') # Simulate the amount of goals using a poisson model stoch.loc[:,'team1_score'] = stoch.apply(lambda x: np.random.poisson(stoch. →loc[:, 'team1_score'], len(stoch)),axis=1)[0] stoch.loc[:,'team2_score'] = stoch.apply(lambda x: np.random.poisson(stoch. →loc[:, 'team2_score'], len(stoch)), axis=1)[0] # Determine which team has won the game stoch.loc[:, 'winner'] = np.where(stoch.loc[:, 'team1_score'] > stoch.loc[: →, 'team2_score'], stoch.loc[:, 'team1'], "tie") stoch.loc[:, 'winner'] = np.where(stoch.loc[:, 'team1_score'] < stoch.loc[:,␣ →'team2_score'], stoch.loc[:, 'team2'], stoch.loc[:, 'winner']) # Selecting the relevant output stoch = stoch[['team1', 'team2', 'winner' ,'team1_score', 'team2_score']] return stoch # Example result # Note: this result changes every time the code is rerun poissonWintable(ECdata).head(5) [14]: team1 team2 winner team1_score team2_score 0 Austria Belgium Austria 2 1 1 Austria Croatia tie 0 0 2 Austria Czech Republic Austria 3 2 3 Austria Denmark Austria 2 0 4 Austria England England 2 5 Since now, it is possible to determine the amount fo goals that both teams score, we can modify the matchWinner function to also include the result of the game and not only the winner: [15]: def matchWinner(wintable, team1, team2): ''' This function determines the iutcome of the football game 8
  • 9. between team 1 and team2 based on the provided win table. Parameters ---------- wintable : DataFrame a table that contains the teamnames and the result og the game team1 : String The first teamname team2 : TYPE The second teamname Returns ------- output : List A list with te winner of the game as well as the score between the teams ''' output = [None]*3 if(team1[0] > team2[0]): temp = team1 team1 = team2 team2 = temp output[0] = "tie" output[1] = 1 output[2] = 1 for index, row in wintable.iterrows(): if(row['team1'] == team1): if(row['team2'] == team2): output[0] = row['winner'] output[1] = row['team1_score'] output[2] = row['team2_score'] return output return output matchWinner(poissonWintable(ECdata), "Austria", "Belgium") [15]: ['Belgium', 1, 3] The code above also shows that we expect the result to be a tie (1-1) if there has never been a previous encounter between both of the teams. In reality this is not a realistic assumption. Like mentioned earlier our model is stochastic and the results change everytime we rerun the code. In order to get an idea about the probability that each team will win the game, we can perform a monte carlo simulation: [16]: def monteCarloGame(data, simulations, team1, team2): output = pd.DataFrame(data = [], columns = ['Win Probability']) for i in range(simulations): 9
  • 10. output = output.append(pd.DataFrame(data =␣ →[matchWinner(poissonWintable(data), team1, team2)[0]], columns = ['Win␣ →Probability'])) output = output['Win Probability'].value_counts()*100/simulations return output monteCarloGame(ECdata, 200, "Austria", "Belgium") [16]: Austria 75.0 Belgium 15.0 tie 10.0 Name: Win Probability, dtype: float64 The outcome based on the poisson model is much more informative compared to the outcome of the previous model. However, there are still some limitations to this model: - The model uses all previous data and attaches the same amount of importance to old and recent games. This is unrealistic since teams change throughout the years and the results from a long time ago are hardly relevant to predict the current games. Nonetheless, using the result from old games can be valuable since it shows which countries have a rich history in football. These kind of countries often perform better at big tournaments. - A more technical limitation: the poisson model implies that the mean amount of goals scored is equal to the variance. Looking at the results of the describe function at the beginning of the document. This assumption seems to be violated. The weighted poisson model The first limitation of the model can be solved by weighing the observations in the dataset i.e. less weight will be attached to observations from a long time ago and a lot of weight will be attached to recent observations. This can be coded as followed: [17]: def weightedPoissonWintable(data): ''' This function determines the outcome of the football game between teams and constructs a wintable based on the data. Parameters ---------- data : DataFrame a table that contains the results of previous games Returns ------- output : DataFrame A DataFrame that contains the team names and the winner and amount of␣ →goals scored by both teams. ''' data = data.tail(2000) # Giving the oldest observations a weight of zero data = data.reset_index() 10
  • 11. # Give the first observation (index = 0) some weight (can be tweaked) data.loc[:, 'weight'] = (data.index + 0.1)/max(data.index) data.loc[:, 'team1_score'] = data.loc[:, 'weight'] * data.loc[:,␣ →'team1_score'] data.loc[:, 'team2_score'] = data.loc[:, 'weight'] * data.loc[:,␣ →'team2_score'] data = data.groupby(['team1', 'team2'], as_index=False).agg('mean') data.loc[:, 'team1_score'] = (1 / (data.loc[:, 'weight'])) * data.loc[:,␣ →'team1_score'] data.loc[:, 'team2_score'] = (1 / (data.loc[:, 'weight'])) * data.loc[:,␣ →'team2_score'] # Simulate the amount of goals using a poisson model data.loc[:,'team1_score'] = data.apply(lambda x: np.random.poisson(data.loc[: →, 'team1_score'], len(data)),axis=1)[0] data.loc[:,'team2_score'] = data.apply(lambda x: np.random.poisson(data.loc[: →, 'team2_score'], len(data)), axis=1)[0] data.loc[:, 'winner'] = np.where(data.loc[:, 'team1_score'] > data.loc[:,␣ →'team2_score'], data.loc[:, 'team1'], "tie") data.loc[:, 'winner'] = np.where(data.loc[:, 'team1_score'] < data.loc[:,␣ →'team2_score'], data.loc[:, 'team2'], data.loc[:, 'winner']) data = data[['team1', 'team2', 'winner' ,'team1_score', 'team2_score']] return data weightedPoissonWintable(ECdata).head(5) [17]: team1 team2 winner team1_score team2_score 0 Austria Belgium tie 2 2 1 Austria Croatia Croatia 0 1 2 Austria Czech Republic Czech Republic 1 3 3 Austria Denmark Denmark 0 2 4 Austria England England 0 4 The monte carlo function can be easily updated to also include the case where we use the weighted poisson model: [18]: def monteCarloGame(data, simulations, team1, team2, weighted = True): ''' Assigns probabilities to each possible outcome of a soccer game between two teams Parameters ---------- data : DataFrame a table that contains the results of previous games simulations: int the number of simulations that will be used to determine the probability the higher the number, the more stable the simulation 11
  • 12. team1: String the first team team2: String the second team weighted: Boolean True: uses the weighted poisson model (default) False: uses the normal poisson model Returns ------- output : DataFrame A DataFrame that contains the possible outcomes of the game and their probabilities ''' output = pd.DataFrame(data = [], columns = ['Win Probability']) for i in range(simulations): if weighted: wintable = weightedPoissonWintable(data) else: wintable = poissonWintable(data) output = output.append(pd.DataFrame(data = [matchWinner(wintable, team1,␣ →team2)[0]], columns = ['Win Probability'])) output = output['Win Probability'].value_counts()*100/simulations return output print(monteCarloGame(ECdata, 200, "Austria", "Belgium")) Belgium 54.5 Austria 28.0 tie 17.5 Name: Win Probability, dtype: float64 It can be seen that the probability of Belgium winning has increased quite a bit compared to the non-weighted poisson model. This is because in the past Austria performed well against Bel- gium but recently Belgium has been dominating the games against Austria. The weighted poisson model gives more importance to these recent games. Possible alternative models There are various other models that can be used to predict the out- come of an individual soccer game. In order to solve the problem with the mean and the variance, one can by example use a negative binomial model. More complex models can also be used and there exist various supervised learning techniques that will be able to predict the soccer games de- cently. As mentioned before, the goal of this project is to learn data manipulation and modeling in python, hence, the final and most complex model that will be used in this project is the weighted poisson model. In the remainder of this document I will only be communicating the results based on this weighted poisson model. 12
  • 13. 1.3.2 Predicting the outcome of the group stage In the previous section we have defined a method that allows to predict the outcome of one soccer game based on a weighted poisson model. The next step in the project is to predict the outcome of the group stage of the European Championships. We start by defining which teams belong in which group: [19]: groupA = ['Italy', 'Switserland', 'Turkey', 'Wales'] groupB = ['Belgium', 'Denmark', 'Finland', 'Russia'] groupC = ['Austria', 'Netherlands', 'Georgia', 'Ukraine'] groupD = ['Croatia', 'Czech Republic', 'England', 'Scotland'] groupE = ['Slovakia', 'Poland', 'Spain', 'Sweden'] groupF = ['France', 'Germany', 'Hungary', 'Portugal'] groups = [groupA, groupB, groupC, groupD, groupE, groupF] In order to be able to simulate the results of the group stage we need a system in which we assigns points to the teams based on the result of their game. We will be modeling the group stage as a series of updates to the leage table based on the results of the game: [20]: def updateRanking(wintable, team1, team2, groupTable): ''' This function updates the groupTable to include the result of the game between team1 and team2 based on the wintable. Parameters ---------- wintable : dataframe a dataframe that contains the team names, the winner of a game between these teams and the goals scored by these teams team1 : string The name of the first team team2 : string The name of the second team groupTable : dataframe The current group ranking during the group stage of the European␣ →Championships Returns ------- Does not return anything but updates the groupTable ''' # Calculate the result of the game result = matchWinner(wintable, team1, team2) # Load the teams that are in the team table as an array group = groupTable.loc[:, "Country"] # Assign the points # Because dataframe is a pointer, the updates to the dataframe can # be used outside this function and it does not need to return anything 13
  • 14. for i in range(4): if group[i] == team1 and result[0] == "tie": groupTable.loc[i, 'Points'] += 1 groupTable.loc[i, 'Goals_made'] += result[1] groupTable.loc[i, 'Goals_recieved'] += result[2] groupTable.loc[i, 'Goal_difference'] += result[1] - result[2] elif group[i] == team1 and result[0] == team1: groupTable.loc[i, 'Points'] += 3 groupTable.loc[i, 'Goals_made'] += result[1] groupTable.loc[i, 'Goals_recieved'] += result[2] groupTable.loc[i, 'Goal_difference'] += result[1] - result[2] elif group[i] == team1 and result[0] == team2: groupTable.loc[i, 'Goals_made'] += result[1] groupTable.loc[i, 'Goals_recieved'] += result[2] groupTable.loc[i, 'Goal_difference'] += result[1] - result[2] if group[i] == team2 and result[0] == "tie": groupTable.loc[i, 'Points'] += 1 groupTable.loc[i, 'Goals_made'] += result[2] groupTable.loc[i, 'Goals_recieved'] += result[1] groupTable.loc[i, 'Goal_difference'] += result[2] - result[1] elif group[i] == team2 and result[0] == team2: groupTable.loc[i, 'Points'] += 3 groupTable.loc[i, 'Goals_made'] += result[2] groupTable.loc[i, 'Goals_recieved'] += result[1] groupTable.loc[i, 'Goal_difference'] += result[2] - result[1] elif group[i] == team2 and result[0] == team1: groupTable.loc[i, 'Goals_made'] += result[2] groupTable.loc[i, 'Goals_recieved'] += result[1] groupTable.loc[i, 'Goal_difference'] += result[2] - result[1] The next step is creating a function that updates the ranking after every game that has been played in one group of the European Championships. This function allows the user to pick the method they want to use. In this case the default method is again the weighted poisson model. [21]: def simulateGroup(data, group, method = "weightedpoisson"): ''' This function simulates the group stage of the European Championchip for only one group. Parameters ---------- wintable : dataframe a dataframe that contains the team names, the winner of a game between these teams and the goals scored by these teams group : list a list that contains the team names that are in a group 14
  • 15. Returns ------- groupTable : dataframe contains the team names and the amount of points of these teams at the end of the group stage, it also includes the total amount of goals scored and conceded as well as the goal difference ''' if method == "weightedPoisson": wintable = weightedPoissonWintable(data) else: wintable = poissonWintable(data) groupTable = [[group[0], 0, 0, 0, 0], [group[1], 0, 0, 0, 0], [group[2], 0,␣ →0, 0, 0], [group[3], 0, 0, 0, 0]] groupTable = pd.DataFrame(groupTable, columns = ['Country', 'Points',␣ →'Goals_made', 'Goals_recieved', 'Goal_difference']) updateRanking(wintable, group[0], group[1], groupTable) updateRanking(wintable, group[2], group[3], groupTable) updateRanking(wintable, group[0], group[2], groupTable) updateRanking(wintable, group[1], group[3], groupTable) updateRanking(wintable, group[0], group[3], groupTable) updateRanking(wintable, group[2], group[1], groupTable) groupTable = groupTable.sort_values(["Points", "Goals_made",␣ →"Goal_difference"], ascending = False) return groupTable simulateGroup(ECdata, groupA) [21]: Country Points Goals_made Goals_recieved Goal_difference 0 Italy 7 6 3 3 3 Wales 4 3 3 0 1 Switserland 3 3 3 0 2 Turkey 1 2 5 -3 The result above can be used to determine the probability of moving on to the next round. Note that this is not as straightforward as it seems at first since there is also a probability to continue to the next round if the team finishes in the third position. The probability of moving on in this case depends on the results of the other group. The real selection is quite complex and it becomes even more complex once teams need to be assigned to a certain game in the next round. In this project, we will be making use of a simpler model that only takes the goal difference into account and assigns the team with the highest goal difference to the first game slot, the team with the second highest goal difference to the second game slot and so on. This impacts the final simulation of the tournament. Again, the goal of this project is learning python, not building the perfect model to predict the European Championchips. Determining the probability of moving on to the next round can be done by using a monte carlo simulation: 15
  • 16. [22]: def monteCarloGroupStage(data, simulations, groups): mc = pd.DataFrame(data = [], columns = ['Country']) for j in range(simulations): groupA = simulateGroup(data, groups[0]) groupB = simulateGroup(data, groups[1]) groupC = simulateGroup(data, groups[2]) groupD = simulateGroup(data, groups[3]) groupE = simulateGroup(data, groups[4]) groupF = simulateGroup(data, groups[5]) # Defining which of the best third places moves on (this is a simplified␣ →version) best_thirds = [groupA.loc[:, 'Country'].values[3], groupB.loc[:,␣ →'Country'].values[3], groupC.loc[:, 'Country'].values[3], groupD.loc[:, 'Country'].values[3], groupE.loc[:,␣ →'Country'].values[3], groupF.loc[:, 'Country'].values[3]] best_thirds_gd = [groupA.loc[:, 'Goal_difference'].values[3], groupB. →loc[:, 'Goal_difference'].values[3], groupC.loc[:, 'Goal_difference']. →values[3], groupD.loc[:, 'Goal_difference'].values[3], groupE. →loc[:, 'Goal_difference'].values[3], groupF.loc[:, 'Goal_difference']. →values[3]] for i in range(6): if max(best_thirds_gd) == best_thirds_gd[i]: A3 = best_thirds[i] best_thirds_gd[i] = -1000 break for i in range(6): if max(best_thirds_gd) == best_thirds_gd[i]: B3 = best_thirds[i] best_thirds_gd[i] = -1000 break for i in range(6): if max(best_thirds_gd) == best_thirds_gd[i]: C3 = best_thirds[i] best_thirds_gd[i] = -1000 break for i in range(6): if max(best_thirds_gd) == best_thirds_gd[i]: D3 = best_thirds[i] best_thirds_gd[i] = -1000 break mc = mc.append(pd.DataFrame(data = [groupA.loc[:, 'Country'].values[0]],␣ →columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [groupA.loc[:, 'Country'].values[1]],␣ →columns = ['Country'])) 16
  • 17. mc = mc.append(pd.DataFrame(data = [groupB.loc[:, 'Country'].values[0]],␣ →columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [groupB.loc[:, 'Country'].values[1]],␣ →columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [groupC.loc[:, 'Country'].values[0]],␣ →columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [groupC.loc[:, 'Country'].values[1]],␣ →columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [groupD.loc[:, 'Country'].values[0]],␣ →columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [groupD.loc[:, 'Country'].values[1]],␣ →columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [groupE.loc[:, 'Country'].values[0]],␣ →columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [groupE.loc[:, 'Country'].values[1]],␣ →columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [groupF.loc[:, 'Country'].values[0]],␣ →columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [groupF.loc[:, 'Country'].values[1]],␣ →columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [A3], columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [B3], columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [C3], columns = ['Country'])) mc = mc.append(pd.DataFrame(data = [D3], columns = ['Country'])) mc = pd.DataFrame(data = mc, columns = ['Country']) mc = mc['Country'].value_counts()*100/simulations return mc monteCarloGroupStage(ECdata, 200, groups).head(5) #only the first 5 countries␣ →are shown [22]: Italy 98.0 Russia 92.5 Spain 88.0 Netherlands 83.0 England 78.5 Name: Country, dtype: float64 1.3.3 Predicting the outcome of the tournament The next step in this project is simulating the entire tournament. After having simulated the results of the group stage, the next steps are straightforward. First, one should take into account the fact that there cannot be a tie after the group stage. Thus, we should create a function that determines the outcome of post group stage games. Note also that the score of the game is irrelevant: 17
  • 18. [23]: def matchWinnerF(wintable, team1, team2): ''' This function determines the outcome of the football game between team 1 and team2 based on the provided win table. This function should be used for post group stage games. Parameters ---------- wintable : DataFrame a table that contains the teamnames and the result og the game team1 : String The first teamname team2 : TYPE The second teamname Returns ------- output : String The winner of the game ''' output = "tie" if(team1[0] > team2[0]): temp = team1 team1 = team2 team2 = temp output = "tie" for index, row in wintable.iterrows(): if(row['team1'] == team1): if(row['team2'] == team2): output = row['winner'] if output == "tie": if random.random() > 0.5: output = team2 else: output = team1 return output When the initial estimate of the game was equal to “tie”, the function picks a winner of the game randomly. It does not simulate stoppage time or something similar. Whether or not this is realistic is debatable. The next and final step of the project is creating a function that determines the winner of the competition based on the weighted poisson model. We define the following function: [24]: def simulateTournament(data, groups): groupA = simulateGroup(data, groups[0]) groupB = simulateGroup(data, groups[1]) 18
  • 19. groupC = simulateGroup(data, groups[2]) groupD = simulateGroup(data, groups[3]) groupE = simulateGroup(data, groups[4]) groupF = simulateGroup(data, groups[5]) # Defining which of the best third places moves on (this is a simplified␣ →version) best_thirds = [groupA.loc[:, 'Country'].values[3], groupB.loc[:, 'Country']. →values[3], groupC.loc[:, 'Country'].values[3], groupD.loc[:, 'Country'].values[3], groupE.loc[:, 'Country']. →values[3], groupF.loc[:, 'Country'].values[3]] best_thirds_gd = [groupA.loc[:, 'Goal_difference'].values[3], groupB.loc[:,␣ →'Goal_difference'].values[3], groupC.loc[:, 'Goal_difference'].values[3], groupD.loc[:, 'Goal_difference'].values[3], groupE.loc[:,␣ →'Goal_difference'].values[3], groupF.loc[:, 'Goal_difference'].values[3]] x = [0,1,2,3,4,5] shuffle(x) for i in x: if max(best_thirds_gd) == best_thirds_gd[i]: A3 = best_thirds[i] best_thirds_gd[i] = -1000 break for i in x: if max(best_thirds_gd) == best_thirds_gd[i]: B3 = best_thirds[i] best_thirds_gd[i] = -1000 break for i in x: if max(best_thirds_gd) == best_thirds_gd[i]: C3 = best_thirds[i] best_thirds_gd[i] = -1000 break for i in x: if max(best_thirds_gd) == best_thirds_gd[i]: D3 = best_thirds[i] best_thirds_gd[i] = -1000 break A1 = groupA.loc[:, 'Country'].values[0] A2 = groupA.loc[:, 'Country'].values[1] B1 = groupB.loc[:, 'Country'].values[0] B2 = groupB.loc[:, 'Country'].values[1] C1 = groupC.loc[:, 'Country'].values[0] C2 = groupC.loc[:, 'Country'].values[1] D1 = groupD.loc[:, 'Country'].values[0] D2 = groupD.loc[:, 'Country'].values[1] E1 = groupE.loc[:, 'Country'].values[0] E2 = groupE.loc[:, 'Country'].values[1] F1 = groupF.loc[:, 'Country'].values[0] 19
  • 20. F2 = groupF.loc[:, 'Country'].values[1] winTable = weightedPoissonWintable(data) winner1 = matchWinnerF(winTable, A2, B2) winner2 = matchWinnerF(winTable, A1, C2) winner3 = matchWinnerF(winTable, C1, D3) winner4 = matchWinnerF(winTable, B1, A3) winner5 = matchWinnerF(winTable, E2, D2) winner6 = matchWinnerF(winTable, F1, B3) winner7 = matchWinnerF(winTable, D1, F2) winner8 = matchWinnerF(winTable, E1, C3) winnerQF1 = matchWinnerF(winTable, winner6, winner5) winnerQF2 = matchWinnerF(winTable, winner4, winner2) winnerQF3 = matchWinnerF(winTable, winner3, winner1) winnerQF4 = matchWinnerF(winTable, winner8, winner7) winnerSF1 = matchWinnerF(winTable, winnerQF1, winnerQF2) winnerSF2 = matchWinnerF(winTable, winnerQF3, winnerQF4) return matchWinnerF(winTable, winnerSF1, winnerSF2) print(simulateTournament(ECdata, groups)) France Since the models that are used to simulate the group stage and the finals of the tournament are stochastic, the result changes everytime we rerun the code. In order to assign probabilities to the different teams winning we perform another monte carlo simulation. [25]: def montecarloTournament(data, simulations, groups): mc = pd.DataFrame(data = [], columns = ['Country']) for j in range(simulations): mc = mc.append(pd.DataFrame(data = [simulateTournament(data, groups)],␣ →columns = ['Country'])) mc = mc['Country'].value_counts()*100/simulations return mc montecarloTournament(ECdata, 200, groups) [25]: Spain 13.0 Netherlands 12.0 Germany 11.5 Italy 11.5 France 9.5 England 8.0 Portugal 6.5 Sweden 4.5 Russia 4.0 Belgium 4.0 Switserland 3.0 Croatia 2.5 20
  • 21. Ukraine 2.0 Czech Republic 1.0 Slovakia 1.0 Scotland 1.0 Poland 1.0 Turkey 1.0 Austria 1.0 Denmark 1.0 Finland 0.5 Hungary 0.5 Name: Country, dtype: float64 The results of the simulation show that the countries with the highest probability of winning the tournament are Spain, the Netherlands, Italy, Germany and France. Based on this analysis, the prediction for the European Championship 2020 would be that Spain wins the competition. The advantage of making a prediction with the approach of modeling the probability of every possible end-state is that we can look after the tournament what the likelihood was of the outcome according to the model. 2 Conclusion The goal of this project was to get familiar with basic data analysis and modeling in python. This goal has been accomplished. In order to predict the outcome of the European Championships the probability of every team winning the championship was calculated using a monte carlo simula- tion of the entire tournament. It was shown that there are some important assumptions that were made in order to make the modeling easier. Nonetheless, the result of the model seem to be valid if one takes into account that only historical data was used. 21