SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Big Data and Automated Content Analysis
Week 6 – Monday
»Web scraping«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
12 March 2018
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Today
1 A first intro to parsing webpages
2 OK, but this surely can be doe more elegantly? Yes!
3 Scaling up
4 Next steps
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Let’s have a look of a webpage with comments and try to
understand the underlying structure.
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Let’s make a plan!
Which elements from the page do we need?
• What do they mean?
• How are they represented in the source code?
How should our output look like?
• What lists do we want?
• . . .
And how can we achieve this?
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
A first plan
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
A first plan
1 Download the page
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
A first plan
1 Download the page
• Possibly taking measures to deal with cookie walls, being
blocked, etc.
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
A first plan
1 Download the page
• Possibly taking measures to deal with cookie walls, being
blocked, etc.
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
A first plan
1 Download the page
• Possibly taking measures to deal with cookie walls, being
blocked, etc.
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
3 Try to isolate the comments
• Do we see any pattern in the source code? ⇒ last week: if we
can see a pattern, we can describe it with a regular expression
Big Data and Automated Content Analysis Damian Trilling
1 import requests
2 import re
3
4 response = requests.get(’http://kudtkoekiewet.nl/setcookie.php?t=http://
www.geenstijl.nl/mt/archieven/2014/05/das_toch_niet_normaal.html’)
5
6 tekst=response.text.replace("n"," ").replace("t"," ")
7
8 comments=re.findall(r’<div class="cmt-content">(.*?)</div>’,tekst)
9 print("There are",len(comments),"comments")
10 print("These are the first two:")
11 print(comments[:2])
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
• The parentheses in (.*?) make sure that the function only
returns what’s between them and not the surrounding stuff
(like <div> and </div>)
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
• The parentheses in (.*?) make sure that the function only
returns what’s between them and not the surrounding stuff
(like <div> and </div>)
Optimization
• Parse usernames, date, time, . . .
• Replace <p> tags
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Further reading
Doing this with other sites?
• It’s basically puzzling with regular expressions.
• Look at the source code of the website to see how
well-structured it is.
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
OK, but this surely can be doe more elegantly? Yes!
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
• especially when the structure of the site is more complex
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
• especially when the structure of the site is more complex
The following example is based on http://www.chicagoreader.com/
chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228. It
uses the module lxml
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
What do we need?
• the URL (of course)
• the XPATH of the element we want to scrape (you’ll see in a
minute what this is)
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But you can also create your XPATH yourself
There are multiple different XPATHs to address a specific element.
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But you can also create your XPATH yourself
There are multiple different XPATHs to address a specific element.
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But you can also create your XPATH yourself
There are multiple different XPATHs to address a specific element.
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use @. For example, every
*[@id="reviews-container"] would grap a tag like <div
id=reviews-container” class=”’user-content’
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But you can also create your XPATH yourself
There are multiple different XPATHs to address a specific element.
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use @. For example, every
*[@id="reviews-container"] would grap a tag like <div
id=reviews-container” class=”’user-content’
• Let the XPATH end with /text() to get all text
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But you can also create your XPATH yourself
There are multiple different XPATHs to address a specific element.
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use @. For example, every
*[@id="reviews-container"] would grap a tag like <div
id=reviews-container” class=”’user-content’
• Let the XPATH end with /text() to get all text
• Have a look at the source code (via ‘inspect elements’) of the
web page to think of other possible XPATHs!
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Let’s test it!
https://www.kieskeurig.nl/smartphone/product/
3518003-samsung-galaxy-a5-2017-zwart/reviews
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Let’s scrape them!
1 from lxml import html
2 import requests
3
4 htmlsource = requests.get("https://www.kieskeurig.nl/smartphone/product
/3518003-samsung-galaxy-a5-2017-zwart/reviews").text
5 tree = html.fromstring(htmlsource)
6
7 reviews = tree.xpath(’//*[@class="reviews-single__text"]/text()’)
8
9 # remove empty reviews
10 reviews = [r.strip() for r in reviews if r.strip()!=""]
11
12 print (len(reviews),"reviews scraped. Showing the first 60 characters of
each:")
13 i=0
14 for review in reviews:
15 print("Review",i,":",review[:60])
16 i+=1
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
The output – perfect!
1 61 reviews scraped. Showing the first 60 characters of each:
2 Review 0 : Wat een fenomenale smartphone is de nieuwe Samsung Galaxy A5
3 Review 1 : Bij het openen van de verpakking valt direct op dat dit toes
4 Review 2 : Tijd om deze badboy eens aan de tand te voelen dus direct he
5 Review 3 : Na een volle week dit toestel gebruikt te hebben kan ik het
6 Review 4 : ik ben verliefd op de Samsung Galaxy A5 2017!
Big Data and Automated Content Analysis Damian Trilling
Scaling up
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But this was on one page only, right?
Next step: Repeat for each relevant page.
Possibility 1: Based on url schemes
If the url of one review page is
https://www.hostelworld.com/hosteldetails.php/
ClinkNOORD/Amsterdam/93919/reviews?page=2
. . . then the next one is probably?
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But this was on one page only, right?
Next step: Repeat for each relevant page.
Possibility 1: Based on url schemes
If the url of one review page is
https://www.hostelworld.com/hosteldetails.php/
ClinkNOORD/Amsterdam/93919/reviews?page=2
. . . then the next one is probably?
⇒ you can construct a list of all possible URLs:
1 baseurl = ’https://www.hostelworld.com/hosteldetails.php/ClinkNOORD/
Amsterdam/93919/reviews?page=’
2
3 allurls = [baseurl+str(i+1) for i in range(20)]
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But this was on one page only, right?
Next step: Repeat for each relevant page.
Possibility 2: Based on XPATHs
Use XPATH to get the url of the next page (i.e., to get the link
that you would click to get the next review)
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Recap
General idea
1 Identify each element by its XPATH (look it up in your
browser)
2 Read the webpage into a (loooooong) string
3 Use the XPATH to extract the relevant text into a list (with a
module like lxml)
4 Do something with the list (preprocess, analyze, save)
5 Repeat
Alternatives: scrapy, beautifulsoup, regular expressions, . . .
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Last remarks
There is often more than one way to specify an XPATH
1 Sometimes, you might want to use a different suggestion to
be able to generalize better (e.g., using the attributes rather
than the tags)
2 in that case, it makes sense to look deeper into the structure
of the HTML code, for example with “Inspect Element” and
use that information to play around with in the XPATH
Checker
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Next steps
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
From now on. . .
. . . focus on individual projects!
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Wednesday
• Write a scraper for a website of your choice!
• Prepare a bit, so that you know where to start
• Suggestions: Review texts and grades from http://iens.nl,
articles from a news site of your choice, . . .
Big Data and Automated Content Analysis Damian Trilling
BDACA - Lecture6

Contenu connexe

Tendances

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
Moving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedMoving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow Extended
Jonathan Mugan
 

Tendances (20)

BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7
 
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
 
BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5
 
BDACA1516s2 - Lecture2
BDACA1516s2 - Lecture2BDACA1516s2 - Lecture2
BDACA1516s2 - Lecture2
 
BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
MIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningMIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine Learning
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Moving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedMoving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow Extended
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 

Similaire à BDACA - Lecture6

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Similaire à BDACA - Lecture6 (20)

BD-ACA Week6
BD-ACA Week6BD-ACA Week6
BD-ACA Week6
 
SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
 
Python Homework Help
Python Homework HelpPython Homework Help
Python Homework Help
 
Gaps in the algorithm
Gaps in the algorithmGaps in the algorithm
Gaps in the algorithm
 
Essential Technical SEO learnings from 120+ site migrations
Essential Technical SEO learnings from 120+ site migrationsEssential Technical SEO learnings from 120+ site migrations
Essential Technical SEO learnings from 120+ site migrations
 
Content Management for Publishers
Content Management for PublishersContent Management for Publishers
Content Management for Publishers
 
Lecture 9 Professional Practices
Lecture 9 Professional PracticesLecture 9 Professional Practices
Lecture 9 Professional Practices
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
MongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB Galaxy
MongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB GalaxyMongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB Galaxy
MongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB Galaxy
 
Migration Best Practices - Search Y 2019, Paris
Migration Best Practices - Search Y 2019, ParisMigration Best Practices - Search Y 2019, Paris
Migration Best Practices - Search Y 2019, Paris
 
SEARCH Y - Bastian Grimm - Migrations Best Practices
SEARCH Y - Bastian Grimm -  Migrations Best PracticesSEARCH Y - Bastian Grimm -  Migrations Best Practices
SEARCH Y - Bastian Grimm - Migrations Best Practices
 
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
 
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
 
What book and journal publishers need to know to get accessibility right
What book and journal publishers need to know to get accessibility rightWhat book and journal publishers need to know to get accessibility right
What book and journal publishers need to know to get accessibility right
 
BrightonSEO - Indexation, Cannibalization, Experimentation, Oh My!
BrightonSEO - Indexation, Cannibalization, Experimentation, Oh My!BrightonSEO - Indexation, Cannibalization, Experimentation, Oh My!
BrightonSEO - Indexation, Cannibalization, Experimentation, Oh My!
 
Provenance in Production-Grade Machine Learning
Provenance in Production-Grade Machine LearningProvenance in Production-Grade Machine Learning
Provenance in Production-Grade Machine Learning
 
Developer Grade SEO
Developer Grade SEODeveloper Grade SEO
Developer Grade SEO
 
JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4
 
SEO and Accessibility
SEO and AccessibilitySEO and Accessibility
SEO and Accessibility
 

Plus de Department of Communication Science, University of Amsterdam

Plus de Department of Communication Science, University of Amsterdam (13)

BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
Should we worry about filter bubbles?
Should we worry about filter bubbles?Should we worry about filter bubbles?
Should we worry about filter bubbles?
 

Dernier

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Dernier (20)

Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

BDACA - Lecture6

  • 1. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Big Data and Automated Content Analysis Week 6 – Monday »Web scraping« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 12 March 2018 Big Data and Automated Content Analysis Damian Trilling
  • 2. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Today 1 A first intro to parsing webpages 2 OK, but this surely can be doe more elegantly? Yes! 3 Scaling up 4 Next steps Big Data and Automated Content Analysis Damian Trilling
  • 3. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Let’s have a look of a webpage with comments and try to understand the underlying structure. Big Data and Automated Content Analysis Damian Trilling
  • 4.
  • 5.
  • 6.
  • 7. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Let’s make a plan! Which elements from the page do we need? • What do they mean? • How are they represented in the source code? How should our output look like? • What lists do we want? • . . . And how can we achieve this? Big Data and Automated Content Analysis Damian Trilling
  • 8. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps A first plan Big Data and Automated Content Analysis Damian Trilling
  • 9. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps A first plan 1 Download the page Big Data and Automated Content Analysis Damian Trilling
  • 10. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps A first plan 1 Download the page • Possibly taking measures to deal with cookie walls, being blocked, etc. Big Data and Automated Content Analysis Damian Trilling
  • 11. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps A first plan 1 Download the page • Possibly taking measures to deal with cookie walls, being blocked, etc. 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string Big Data and Automated Content Analysis Damian Trilling
  • 12. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps A first plan 1 Download the page • Possibly taking measures to deal with cookie walls, being blocked, etc. 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string 3 Try to isolate the comments • Do we see any pattern in the source code? ⇒ last week: if we can see a pattern, we can describe it with a regular expression Big Data and Automated Content Analysis Damian Trilling
  • 13. 1 import requests 2 import re 3 4 response = requests.get(’http://kudtkoekiewet.nl/setcookie.php?t=http:// www.geenstijl.nl/mt/archieven/2014/05/das_toch_niet_normaal.html’) 5 6 tekst=response.text.replace("n"," ").replace("t"," ") 7 8 comments=re.findall(r’<div class="cmt-content">(.*?)</div>’,tekst) 9 print("There are",len(comments),"comments") 10 print("These are the first two:") 11 print(comments[:2])
  • 14. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). Big Data and Automated Content Analysis Damian Trilling
  • 15. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). • The parentheses in (.*?) make sure that the function only returns what’s between them and not the surrounding stuff (like <div> and </div>) Big Data and Automated Content Analysis Damian Trilling
  • 16. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). • The parentheses in (.*?) make sure that the function only returns what’s between them and not the surrounding stuff (like <div> and </div>) Optimization • Parse usernames, date, time, . . . • Replace <p> tags Big Data and Automated Content Analysis Damian Trilling
  • 17. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Further reading Doing this with other sites? • It’s basically puzzling with regular expressions. • Look at the source code of the website to see how well-structured it is. Big Data and Automated Content Analysis Damian Trilling
  • 18. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps OK, but this surely can be doe more elegantly? Yes! Big Data and Automated Content Analysis Damian Trilling
  • 19. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) Big Data and Automated Content Analysis Damian Trilling
  • 20. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) Big Data and Automated Content Analysis Damian Trilling
  • 21. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) • especially when the structure of the site is more complex Big Data and Automated Content Analysis Damian Trilling
  • 22. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) • especially when the structure of the site is more complex The following example is based on http://www.chicagoreader.com/ chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228. It uses the module lxml Big Data and Automated Content Analysis Damian Trilling
  • 23. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps What do we need? • the URL (of course) • the XPATH of the element we want to scrape (you’ll see in a minute what this is) Big Data and Automated Content Analysis Damian Trilling
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But you can also create your XPATH yourself There are multiple different XPATHs to address a specific element. Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) Big Data and Automated Content Analysis Damian Trilling
  • 29. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But you can also create your XPATH yourself There are multiple different XPATHs to address a specific element. Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all Big Data and Automated Content Analysis Damian Trilling
  • 30. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But you can also create your XPATH yourself There are multiple different XPATHs to address a specific element. Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use @. For example, every *[@id="reviews-container"] would grap a tag like <div id=reviews-container” class=”’user-content’ Big Data and Automated Content Analysis Damian Trilling
  • 31. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But you can also create your XPATH yourself There are multiple different XPATHs to address a specific element. Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use @. For example, every *[@id="reviews-container"] would grap a tag like <div id=reviews-container” class=”’user-content’ • Let the XPATH end with /text() to get all text Big Data and Automated Content Analysis Damian Trilling
  • 32. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But you can also create your XPATH yourself There are multiple different XPATHs to address a specific element. Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use @. For example, every *[@id="reviews-container"] would grap a tag like <div id=reviews-container” class=”’user-content’ • Let the XPATH end with /text() to get all text • Have a look at the source code (via ‘inspect elements’) of the web page to think of other possible XPATHs! Big Data and Automated Content Analysis Damian Trilling
  • 33. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Let’s test it! https://www.kieskeurig.nl/smartphone/product/ 3518003-samsung-galaxy-a5-2017-zwart/reviews Big Data and Automated Content Analysis Damian Trilling
  • 34. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Let’s scrape them! 1 from lxml import html 2 import requests 3 4 htmlsource = requests.get("https://www.kieskeurig.nl/smartphone/product /3518003-samsung-galaxy-a5-2017-zwart/reviews").text 5 tree = html.fromstring(htmlsource) 6 7 reviews = tree.xpath(’//*[@class="reviews-single__text"]/text()’) 8 9 # remove empty reviews 10 reviews = [r.strip() for r in reviews if r.strip()!=""] 11 12 print (len(reviews),"reviews scraped. Showing the first 60 characters of each:") 13 i=0 14 for review in reviews: 15 print("Review",i,":",review[:60]) 16 i+=1 Big Data and Automated Content Analysis Damian Trilling
  • 35. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps The output – perfect! 1 61 reviews scraped. Showing the first 60 characters of each: 2 Review 0 : Wat een fenomenale smartphone is de nieuwe Samsung Galaxy A5 3 Review 1 : Bij het openen van de verpakking valt direct op dat dit toes 4 Review 2 : Tijd om deze badboy eens aan de tand te voelen dus direct he 5 Review 3 : Na een volle week dit toestel gebruikt te hebben kan ik het 6 Review 4 : ik ben verliefd op de Samsung Galaxy A5 2017! Big Data and Automated Content Analysis Damian Trilling
  • 37. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But this was on one page only, right? Next step: Repeat for each relevant page. Possibility 1: Based on url schemes If the url of one review page is https://www.hostelworld.com/hosteldetails.php/ ClinkNOORD/Amsterdam/93919/reviews?page=2 . . . then the next one is probably? Big Data and Automated Content Analysis Damian Trilling
  • 38. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But this was on one page only, right? Next step: Repeat for each relevant page. Possibility 1: Based on url schemes If the url of one review page is https://www.hostelworld.com/hosteldetails.php/ ClinkNOORD/Amsterdam/93919/reviews?page=2 . . . then the next one is probably? ⇒ you can construct a list of all possible URLs: 1 baseurl = ’https://www.hostelworld.com/hosteldetails.php/ClinkNOORD/ Amsterdam/93919/reviews?page=’ 2 3 allurls = [baseurl+str(i+1) for i in range(20)] Big Data and Automated Content Analysis Damian Trilling
  • 39. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But this was on one page only, right? Next step: Repeat for each relevant page. Possibility 2: Based on XPATHs Use XPATH to get the url of the next page (i.e., to get the link that you would click to get the next review) Big Data and Automated Content Analysis Damian Trilling
  • 40. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Recap General idea 1 Identify each element by its XPATH (look it up in your browser) 2 Read the webpage into a (loooooong) string 3 Use the XPATH to extract the relevant text into a list (with a module like lxml) 4 Do something with the list (preprocess, analyze, save) 5 Repeat Alternatives: scrapy, beautifulsoup, regular expressions, . . . Big Data and Automated Content Analysis Damian Trilling
  • 41. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Last remarks There is often more than one way to specify an XPATH 1 Sometimes, you might want to use a different suggestion to be able to generalize better (e.g., using the attributes rather than the tags) 2 in that case, it makes sense to look deeper into the structure of the HTML code, for example with “Inspect Element” and use that information to play around with in the XPATH Checker Big Data and Automated Content Analysis Damian Trilling
  • 42.
  • 43. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Next steps Big Data and Automated Content Analysis Damian Trilling
  • 44. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps From now on. . . . . . focus on individual projects! Big Data and Automated Content Analysis Damian Trilling
  • 45. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Wednesday • Write a scraper for a website of your choice! • Prepare a bit, so that you know where to start • Suggestions: Review texts and grades from http://iens.nl, articles from a news site of your choice, . . . Big Data and Automated Content Analysis Damian Trilling