1. Major League Soccer Analytics
with Python
Chris Armstrong, Dan Derringer, Jude Ken-Kwofie,
Hemanth Mahadevaiah and Sujana Veeraganti
Stevens Institute of Technology
Abstract– Unlike European soccer leagues and popular
American sports, relatively little work as been done on The next issue faced was how to merge the tables. The first
Major League Soccer (MLS) player and team idea was to use a for loop in Python to match the players’
performance analytics.With MLS growing in popularity names and produce a master table with all their all-time stats
combined with the small community of individuals and salaries from 2012. Although it was successful, it was
conducting MLS analytics, we decided to apply web less than ideal as it would take over 45 minutes to merge
analytics concepts taught in Business Intelligence & these five tables. The next idea was to write a script in R to
Analytics class (BIA 660) to help determine player merge the tables; since R is designed to be a statistical tool
ratings and compensation. To this end we used the and can better manipulate tables. This plan successfully
Python programming language and related modules to: reduced the processing time down to less than a minute and
1)crawl the web, 2) scrape relevant data, 3) compile we added the ability for Python to run the R script
captured data into a data set, 4) determine player ratings automatically after the data scraping was complete.
and simple statistics, and 5) create attractive plots However, this wasn’t as clean as we would like it to be. The
showcasing the data relationships. final solution was to use the Pandas module for Python. The
Pandas module gave us the ability to manipulate data the
Index Terms–Major League Soccer, Python, Visualization, way we need it, without having to go outside of Python.
Web Scraping.
The key Python scripts used in our work are as follows:
PROJECT GOAL
The primary goal of the project was to use BIA 660 web MLS_Statistical_Application.py – Includes a full
analytics lessons on the Python programming language and scraping function plusan interactive plotting feature
related modules to analyze and visualize MLS specific developed in Tkinter. The Tkinter function imports
data.The following Python modules were used in this work: a comma-separated value (csv) and allows the user
to plot results by selecting column names as the x
Web – mechanize, urllib2, BeautifulSoup, PyPDF2 and y-axis.
Regular Expression – re
DATA ANALYSIS
System & I/O - sys, StringIO, csv, print, json
Data Analysis - R, Pandas, Numpy, Scipy Initially, our analysis focused on determining 1) the best XI
Data Visualization – Tkinter, Matplotlib MLS players of all time and 2) if a reasonable correlation
exists between player compensation and performance, i.e.,
The following sections describe our python data scraping, goals, assists, and shots. However, due to the lack of
compilation, and analysis and visualization efforts. publically available player passing efficiency data we found
it challenging to build relationships between salary and
DATA SCRAPING& COMPILATION
performance and to determine the best players. Ultimately,
The Python script has gone through severaliterations. The we decided to analyze player compensation versus player
original plan was to extract four tables of players’ all-time goals, assists, shot as well as to simply calculate statistics
stats and six pdf files with salary data for players in 2007- based on player minutes, goal, assists, shots, shots on goal,
2012. The idea was to merge these ten lists to create one game winning goals and game winning assists. From a data
master list; however, not all players in the all-time stats set of 251 MLS players we determined for the year 2012:
tables collected a salary in 2012 and not all of those that
collected salaries in 2012 also collected a salary in 2007. The average MLS player earns $200,262.58.
This issue drastically reduced the number of records to The lowest paid player, Jeb Brovsky earns $33,750.
analyze in the master list. Therefore, it was decided to only The highest paid player, Thierry Henry earns
the salary data from 2012 would be used. $5,000,000.
November 13, 2012, Hoboken, NJ
Major League Soccer Analytics with Python
1
2. The above statistics shows the average salary, median,
Out of the 251 players, 55.77% of the players make salaries lowest salary and highest salary by position. Also included
greater than or equal to $100,000. Additional statistics are in the table are the top five players with highest salary
presented below. among each position. As anticipated, the forwards are paid a
higher salary of the four positions. Goalkeepers are the
We also found with the data on hand that in the MLS there is lowest wage earners on average.
little to no correlation between player’s salaries and goals,
assists and shots (shown in Figure 1). Player compensation FIGURE 1 GOALS AND ASSISTS VERSUS SALARY
seems to be based on their popularity than their ability to
score goals, assists and shots. There is a solid relationship
between players Google search hit rate and salary.
The lack of correlation between salary and performance is an
interesting result since in other leagues the highest paid
players are usually the best at scoring and assisting. As
mentioned earlier, an adequate data set on player passing
may provide better insights and results between salary and
performance.
DATA VISUALIZATION
The visual representation of the statistics was generated with
R, Matplotlib and Pandas. Scatter plots and histograms were
developed to show:
Player compensation versus player goals, assists
and shots (scatter plots)
Player minutes, goal, assists, shots, shots on goal,
game winning goals and game winning assists
(histrograms)
The following section presents a few of the generated
visuals.
FIGURES, TABLES AND EQUATIONS
TABLE 1 - PLAYER PAY BY POSITION
Results
November 13, 2012, Hoboken, NJ
Major League Soccer Analytics with Python
2
3. FIGURE 3 FORWARD, DEFENDER AND MIDFIELDER GOALS Owners can get the similar goal/assist production
VERSUS SALARY
from someone making < $200K as with someone
making >$400K to $1.2M. This tends to suggest
that higher paid players have the same impact on
goals or assists as a low wager, which is
interesting.
Data shows that that the players have similar skill
sets. It takes special players to score goals or give
assists.
FIGURE 2 - 3D PLOT OF FORWARDS GOALS, ASSISTS AND
MINUTES
Figure 2 shows a 3D rendering of player assists, minutes and
game winning assists. In general, the plots sh
ows little correlation between the fields. However, for
defenders there is a strong correlation between the fields
suggesting assists by defenders lead to wins.
FIGURE 3 - HISTROGRAMS OF PLAYER MINUTES, GOALS,
ASSISTS, SHOTS, SHORTS ON GOAL, GAME WINNING GOALS,
GAME WINNING ASSISTS AND SALARY
Results
Results
The plot shows exploratory data analysis of the
There is little correlation between a goals or assists
various attributes like Minutes, Goals, Shots,
and a high salary.
Assists, and Shots on Goals, Game Winning Goals,
November 13, 2012, Hoboken, NJ
Major League Soccer Analytics with Python
3
4. Game Winning Assists and Salary to summarize the
main characteristics in easy-to-understand form.
CONCLUSION
Unlike European soccer leagues and popular American
sports, relatively little work as been done on Major League
Soccer (MLS) player and team performance analytics. With
MLS growing in popularity combined with the small
community of individuals conducting MLS analytics, we
decided to apply web analytics concepts taught in Business
Intelligence & Analytics class (BIA 660) to help determine
player ratings and compensation.
The primary goal of the project was to use BIA 660 web
analytics lessons on the Python programming language and
related modules to analyze and visualize MLS specific data.
ACKNOWLEDGMENT
We acknowledge the mentoring of Professor Winter Mason.
REFERENCES
PYTHON PROGRAMMING LANGUAGE –
HTTP://WWW.PYTHON.ORG/
HTTP://WIKI.PYTHON.ORG/MOIN/TKINTER
1
AUTHOR INFORMATION
Chris Armstrong,chris.r.armstrong@gmail.com
Dan Derringer, dderringer311@gmail.com
Jude Ken-Kwofie, jkenkwof@stevens.edu
Hemanth Mahadevaiah,hemanth.m1@gmail.com
Sujana Veeraganti, sujanaveeraganti@gmail.com
1
Stevens Institute of Technology Business Intelligence & Analytics
Graduate Students
November 13, 2012, Hoboken, NJ
Major League Soccer Analytics with Python
4
5. November 13, 2012, Hoboken, NJ
Major League Soccer Analytics with Python
5