SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Using R for Scraping Data
             Ryan Elmore
    National Renewable Energy Lab

        rtelmore@gmail.com
          Twitter: rtelmore

            June 13, 2012
              useR! 2012
A Baseball Challenge

Question: Has the minimum number of pitches
per (full) inning (6 pitches) has ever been
attained?
Answer: I don’t know; scrape the boxscores at
baseball-reference.com.


http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
A Baseball Challenge

Question: Has the minimum number of pitches
per (full) inning (6 pitches) has ever been
attained?
Answer: I don’t know; scrape the boxscores at
baseball-reference.com.


http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
The Boxscore




This column seems useful!
The Boxscore




This column seems useful!
Dissecting the URL
http://www.baseball-reference.com/boxes/COL/COL201104010.shtml




   Just step through
    all of the teams:     YearMonthDay         Game ID
    COL, BOS, etc.
How Do We Proceed?
The most systematic way that I could find
was to break it down like this:
• 30 Teams
• 2005 - 2010
• Everyday from Apr 1 through Oct 31
• This is a little more than 78K URLs!
• My program took about 3 hrs 25 min.
How Do We Proceed?
The most systematic way that I could find
was to break it down like this:
• 30 Teams
• 2005 - 2010
• Everyday from Apr 1 through Oct 31
• This is a little more than 78K URLs!
• My program took about 3 hrs 25 min.
R Code
for (team in teams){
  for (year in years){
    out.string <- paste(Sys.time(), "--", team, year, sep = " ")
    print(out.string)
    for (month in months){
      for (day in days){
        for (i in 0:1){
          full.url <- paste(paste(base.url, team, date.url,
             sep="/"), i, ".shtml", sep="")
          table.stats <- readHTMLTable(full.url)
          ## Process the list of data.frames returned by
          ## the call to readHTMLTable
        }
      }
    }
  }
}
R Code
for (team in teams){
  for (year in years){
    out.string <- paste(Sys.time(), "--", team, year, sep = " ")
    print(out.string)
    for (month in months){
      for (day in days){
        for (i in 0:1){
          full.url <- paste(paste(base.url, team, date.url,
             sep="/"), i, ".shtml", sep="")
          table.stats <- readHTMLTable(full.url)
          ## Process the list of data.frames returned by
          ## the call to readHTMLTable
        }
      }
    }
  }
}
Tools

•   base: paste, strsplit, unlist, lapply
•   XML: readHTMLTable, htmlTreeParse,
    getNodeSet, xmlValue, xmlSApply
•   httr, stringr, and other Hadley things
•   useful, but not necessary: regex, xpath,
    XML, etc.
Tools

•   base: paste, strsplit, unlist, lapply
•   XML: readHTMLTable, htmlTreeParse,
    getNodeSet, xmlValue, xmlSApply
•   httr, stringr, and other Hadley things
•   useful, but not necessary: regex, xpath,
    XML, etc.
Conclusions/Discussion

• There is a lot of data available on the web!
• You can access this data from a browser;
  however, you can access A LOT more data
  if you let your computer do the work.
• R and its libraries provide a great platform
  for scraping data and data mining.
• Download data and see where you go.
Conclusions/Discussion

• There is a lot of data available on the web!
• You can access this data from a browser;
  however, you can access A LOT more data
  if you let your computer do the work.
• R and its libraries provide a great platform
  for scraping data and data mining.
• Download data and see where you go.
Was That Minimum Attained?

• NO! Unless there is an error in my code.
• Did we learn something? Of course.
• The skills are transferrable to other
  websites with data.

Contenu connexe

Tendances

Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...తేజ దండిభట్ల
 
SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)andyseaborne
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQLOlaf Hartig
 
WebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaWebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaKatrien Verbert
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioOpen Knowledge Belgium
 
Building social network with Neo4j and Python
Building social network with Neo4j and PythonBuilding social network with Neo4j and Python
Building social network with Neo4j and PythonAndrii Soldatenko
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma APIKyle Banerjee
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and SemanticsTatiana Al-Chueyr
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Kai Chan
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Storesandyseaborne
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsMaribel Acosta Deibe
 
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfsSWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfsMariano Rodriguez-Muro
 

Tendances (20)

Reproducible research
Reproducible researchReproducible research
Reproducible research
 
Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...
 
Linked Data Fragments
Linked Data FragmentsLinked Data Fragments
Linked Data Fragments
 
SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
 
WebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaWebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPedia
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
4 sw architectures and sparql
4 sw architectures and sparql4 sw architectures and sparql
4 sw architectures and sparql
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 
Unit 3
Unit 3Unit 3
Unit 3
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
 
Building social network with Neo4j and Python
Building social network with Neo4j and PythonBuilding social network with Neo4j and Python
Building social network with Neo4j and Python
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and Semantics
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia Mappings
 
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfsSWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
 

Similaire à Using R for Scraping Baseball Data from Baseball-Reference

Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrBrooke Ganz
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsAjay Ohri
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?ProCogia
 
Build Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to OmegaBuild Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to OmegaRavi Mynampaty
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1Charles Givre
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databaseBarry Jones
 
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensFive Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensCitus Data
 
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Citus Data
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really DoingDave Stokes
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data setsCreditas
 
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017Amazon Web Services
 
Amazon Athena (March 2017)
Amazon Athena (March 2017)Amazon Athena (March 2017)
Amazon Athena (March 2017)Julien SIMON
 
Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)Ganesh Samarthyam
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008eComm2008
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internetdrgath
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014PyData
 

Similaire à Using R for Scraping Baseball Data from Baseball-Reference (20)

Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
 
e_lumley.pdf
e_lumley.pdfe_lumley.pdf
e_lumley.pdf
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?
 
Build Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to OmegaBuild Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to Omega
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty database
 
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensFive Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
 
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really Doing
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data sets
 
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
 
Amazon Athena (March 2017)
Amazon Athena (March 2017)Amazon Athena (March 2017)
Amazon Athena (March 2017)
 
Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
 
Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014
 

Dernier

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 

Dernier (20)

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 

Using R for Scraping Baseball Data from Baseball-Reference

  • 1. Using R for Scraping Data Ryan Elmore National Renewable Energy Lab rtelmore@gmail.com Twitter: rtelmore June 13, 2012 useR! 2012
  • 2. A Baseball Challenge Question: Has the minimum number of pitches per (full) inning (6 pitches) has ever been attained? Answer: I don’t know; scrape the boxscores at baseball-reference.com. http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
  • 3. A Baseball Challenge Question: Has the minimum number of pitches per (full) inning (6 pitches) has ever been attained? Answer: I don’t know; scrape the boxscores at baseball-reference.com. http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
  • 4. The Boxscore This column seems useful!
  • 5. The Boxscore This column seems useful!
  • 6. Dissecting the URL http://www.baseball-reference.com/boxes/COL/COL201104010.shtml Just step through all of the teams: YearMonthDay Game ID COL, BOS, etc.
  • 7. How Do We Proceed? The most systematic way that I could find was to break it down like this: • 30 Teams • 2005 - 2010 • Everyday from Apr 1 through Oct 31 • This is a little more than 78K URLs! • My program took about 3 hrs 25 min.
  • 8. How Do We Proceed? The most systematic way that I could find was to break it down like this: • 30 Teams • 2005 - 2010 • Everyday from Apr 1 through Oct 31 • This is a little more than 78K URLs! • My program took about 3 hrs 25 min.
  • 9. R Code for (team in teams){ for (year in years){ out.string <- paste(Sys.time(), "--", team, year, sep = " ") print(out.string) for (month in months){ for (day in days){ for (i in 0:1){ full.url <- paste(paste(base.url, team, date.url, sep="/"), i, ".shtml", sep="") table.stats <- readHTMLTable(full.url) ## Process the list of data.frames returned by ## the call to readHTMLTable } } } } }
  • 10. R Code for (team in teams){ for (year in years){ out.string <- paste(Sys.time(), "--", team, year, sep = " ") print(out.string) for (month in months){ for (day in days){ for (i in 0:1){ full.url <- paste(paste(base.url, team, date.url, sep="/"), i, ".shtml", sep="") table.stats <- readHTMLTable(full.url) ## Process the list of data.frames returned by ## the call to readHTMLTable } } } } }
  • 11. Tools • base: paste, strsplit, unlist, lapply • XML: readHTMLTable, htmlTreeParse, getNodeSet, xmlValue, xmlSApply • httr, stringr, and other Hadley things • useful, but not necessary: regex, xpath, XML, etc.
  • 12. Tools • base: paste, strsplit, unlist, lapply • XML: readHTMLTable, htmlTreeParse, getNodeSet, xmlValue, xmlSApply • httr, stringr, and other Hadley things • useful, but not necessary: regex, xpath, XML, etc.
  • 13. Conclusions/Discussion • There is a lot of data available on the web! • You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work. • R and its libraries provide a great platform for scraping data and data mining. • Download data and see where you go.
  • 14. Conclusions/Discussion • There is a lot of data available on the web! • You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work. • R and its libraries provide a great platform for scraping data and data mining. • Download data and see where you go.
  • 15. Was That Minimum Attained? • NO! Unless there is an error in my code. • Did we learn something? Of course. • The skills are transferrable to other websites with data.