Our Data Scientist Maxime presented the results of his predictive model for the games of the 2015 Rugby World Cup.
You can also read his blogpost here : http://www.dataiku.com/blog/2015/09/18/Sports_analytics_World_Cup.html
2. Introduction
Why? Because of the analysis of the last Rugby World Cup by this guy:
http://andrewyuan.github.io/EDAV-project.html
Outline:
•Getting Data
•Exploring the Data
•Building the preliminary model
•Discussing limits and possible improvements of the model
3. Collecting Data
•Web scraping in python (beautifulSoup + urllib2)
•Thank you to rugbydata.com :
http://www.rugbydata.com/italy/romania/gamesplayed/
•Easy to parse!
7. Average number of points per game
Japan, Argentina and Namibia have the most points per game, while the 6
nations teams are the lowest…
8. Graph of games played
South Africa and Japan have never played each other!
9. Predicting the outcome of a game
• Outcome of a game : 0 if team 1 loses.
1 if team1 wins.
• Choosing the features: -Historic of the games (weighed or not)
-Historic of points (weighed or not)
-Historic of confrontations 1v1 (weighed or not)
-Home game or not
-Series of wins
• Particular precautions : No features like number of games played.
Choice of algorithm: Random Forest
12. Results (good and not so good)
Assessing your predictions: - common good sense
- Bookmakers
Comparison with four games played so far:
France vs Italie : 0.881 (bookmaker : 0.909)
England vs Fidji : 0.880 (bookmaker : 0.933)
New-Zeeland vs Argentina : 0.943 (bookmaker : 0.980)
South Africa vs Japon : 0.496 (bookmaker : 0.964)
13. Limits and possible improvements
• Predictions aren’t very good when there are very few direct
games between the two teams (Namibia and Japan for
example)
• Adding the global rankings
• No possible simulation on the long term
• Doesn’t take into account bonus/malus
• Adding new features (teams in common…)
• Taking into account the players that compose the team: