SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
The data

The script

Your turn

Questions?

Hands-on-Workshop
Big (Twitter) Data
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam

30 January 2014
10.45
#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

In this sesion (2/4):
1 The data

Recording tweets with yourTwapperkeeper
CSV-files
Other ways to collect tweets
Not that different: Facebook posts
2 The script

Pseudo-code
Python code
The output
3 Your turn
4 Questions?

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Recording tweets with yourTwapperkeeper

The data:
Recording tweets with yourTwapperkeeper
http://datacollection.followthenews-uva.cloudlet.sara.nl

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Recording tweets with yourTwapperkeeper

yourTwapperkeeper

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Recording tweets with yourTwapperkeeper

yourTwapperkeeper

Storage
Continuosly calls the Twitter-API and saves all
tweets containing specific hashtags to a
mySQL-database.
You tell it once which data to collect – and
wait some months.

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Recording tweets with yourTwapperkeeper

yourTwapperkeeper

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Recording tweets with yourTwapperkeeper

yourTwapperkeeper

Retrieving the data
You could access the MySQL-database directly.
But yourTwapperkeeper has a nice interface
that allows you to export the data to a format
we can use for the analysis.

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

CSV-files

The data:
CSV-files

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

CSV-files

CSV-files

The format of our choice
• All programs can read it
• Even human-readable in a simple text editor:
• Plain text, with a comma (or a semicolon) denoting column

breaks
• No limits regarging the size

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

CSV-files

1

2

3

text,to_user_id,from_user,id,from_user_id,
iso_language_code,source,profile_image_url,geo_type,
geo_coordinates_0,geo_coordinates_1,created_at,time
:-) #Lectrr #wereldleiders #uitspraken #Wikileaks #
klimaattop http://t.co/Udjpk48EIB,,henklbr
,407085917011079169,118374840,nl,web,http://pbs.twimg.
com/profile_images/378800000673845195/
b47785b1595e6a1c63b93e463f3d0ccc_normal.jpeg,,0,0,Sun
Dec 01 09:57:00 +0000 2013,1385891820
Wat zijn de resulaten vd #klimaattop in #Warschau waard?
@EP_Environment ontmoet voorzitter klimaattop
@MarcinKorolec http://t.co/4Lmiaopf60,,Europarl_NL
,406058792573730816,37623918,en,<a href="http://www.
hootsuite.com" rel="nofollow">HootSuite</a>,http://pbs
.twimg.com/profile_images/2943831271/
b6631b23a86502fae808ca3efde23d0d_normal.png,,0,0,Thu
Nov 28 13:55:35 +0000 2013,1385646935

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Other ways to collect tweets

The data:
Other ways to collect tweets

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Other ways to collect tweets

Other ways to collect tweets
Again, we want a CSV file. . .
• If you want tweets per person: www.allmytweets.net
• Up to six days backwards: www.scraperwiki.com
• Buy it from a commercial vendor
• TCAT (from the guys at DMI/mediastudies)
• For specific purposes, write your own Python script to access

the Twitter-API
(if you want to, I can show you more about this tomorrow)

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Not that different: Facebook posts

The data:
Not that different: Facebook posts

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Not that different: Facebook posts

Not that different: Facebook posts
Have a look at netvizz
• Gephi-files for network analysis
• . . . and a tab-seperated (essentially the same as CSV) file with

the content)

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Not that different: Facebook posts

Not that different: Facebook posts
Have a look at netvizz
• Gephi-files for network analysis
• . . . and a tab-seperated (essentially the same as CSV) file with

the content)

An alternative: Facepager
• Tool to query different APIs (a.o. Twitter and Facebook) and

to store the result in a CSV table
• http://www.ls1.ifkw.uni-muenchen.de/personen/

wiss_ma/keyling_till/software.html

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Pseudo-code

The script:
Pseudo-code

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Pseudo-code

Our task: Identify all tweets that include a reference to Poland
Let’s start with some pseudo-code!
1
2
3
4
5
6
7

open csv-table
for each line:
append column 1 to a list of tweets
append column 3 to a list of corresponding users
look for searchstring in column 1
append search result to a list of results
save lists to a new csv-file

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Python code

The script:
Python code

#bigdata

Damian Trilling
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

#!/usr/bin/python
from unicsv import CsvUnicodeReader
from unicsv import CsvUnicodeWriter
import re
inputfilename="mytweets.csv"
outputfilename="myoutput.csv"
user_list=[]
tweet_list=[]
search_list=[]
searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’)
print "Opening "+inputfilename
reader=CsvUnicodeReader(open(inputfilename,"r"))
for row in reader:
tweet_list.append(row[0])
user_list.append(row[2])
matches1 = searchstring1.findall(row[0])
matchcount1=0
for word in matches1:
matchcount1=matchcount1+1
search_list.append(matchcount1)
print "Constructing data matrix"
outputdata=zip(tweet_list,user_list,search_list)
headers=zip(["tweet"],["user"],["how often is Poland mentioned?"])
print "Write data matrix to ",outputfilename
writer=CsvUnicodeWriter(open(outputfilename,"wb"))
writer.writerows(headers)
writer.writerows(outputdata)
The data

The script

Your turn

Questions?

Python code

1
2
3
4
5

#!/usr/bin/python
# We start with importing some modules:
from unicsv import CsvUnicodeReader
from unicsv import CsvUnicodeWriter
import re

6
7
8
9
10

# Let us define two variables that contain
# the names of the files we want to use
inputfilename="mytweets.csv"
outputfilename="myoutput.csv"

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Python code

1
2
3
4
5
6

# We create some empty lists that we will use later on.
# A list can contain several variables
# and is denoted by square brackets.
user_list=[]
tweet_list=[]
search_list=[]

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Python code

1
2

# What do we want to look for?
searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau
|[Ww]arszawa’)

3
4
5
6

# Enough preparation, let the program begin!
# We tell the user what is going on...
print "Opening "+inputfilename

7
8
9

# ... and call the module that reads the input file.
reader=CsvUnicodeReader(open(inputfilename,"r"))

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Python code

1
2
3
4
5
6
7
8

# Now we read the file line by line.
# The indented block is repeated for each row
# (thus, each tweet)
for row in reader:
# append data from the current row to our lists.
# Note that we start counting with 0.
tweet_list.append(row[0])
user_list.append(row[2])

9
10
11
12
13
14
15
16

#bigdata

# Let us count how often our searchstring is used in
# in this tweet
matches1 = searchstring1.findall(row[0])
matchcount1=0
for word in matches1:
matchcount1=matchcount1+1
search_list.append(matchcount1)

Damian Trilling
The data

The script

Your turn

Questions?

Python code

1
2

# Time to put all the data in one container
# and save it:

3
4
5
6

7
8
9
10

print "Constructing data matrix"
outputdata=zip(tweet_list,user_list,search_list)
headers=zip(["tweet"],["user"],["how often is Poland
mentioned?"])
print "Write data matrix to ",outputfilename
writer=CsvUnicodeWriter(open(outputfilename,"wb"))
writer.writerows(headers)
writer.writerows(outputdata)

#bigdata

Damian Trilling
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

#!/usr/bin/python
from unicsv import CsvUnicodeReader
from unicsv import CsvUnicodeWriter
import re
inputfilename="mytweets.csv"
outputfilename="myoutput.csv"
user_list=[]
tweet_list=[]
search_list=[]
searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’)
print "Opening "+inputfilename
reader=CsvUnicodeReader(open(inputfilename,"r"))
for row in reader:
tweet_list.append(row[0])
user_list.append(row[2])
matches1 = searchstring1.findall(row[0])
matchcount1=0
for word in matches1:
matchcount1=matchcount1+1
search_list.append(matchcount1)
print "Constructing data matrix"
outputdata=zip(tweet_list,user_list,search_list)
headers=zip(["tweet"],["user"],["how often is Poland mentioned?"])
print "Write data matrix to ",outputfilename
writer=CsvUnicodeWriter(open(outputfilename,"wb"))
writer.writerows(headers)
writer.writerows(outputdata)
The data

The script

Your turn

Questions?

The output

The script:
myoutput.csv

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

The output

1
2

3

4

5

tweet,user,how often is Poland mentioned?
:-) #Lectrr #wereldleiders #uitspraken #Wikileaks #
klimaattop http://t.co/Udjpk48EIB,henklbr,0
Wat zijn de resulaten vd #klimaattop in #Warschau waard?
@EP_Environment ontmoet voorzitter klimaattop
@MarcinKorolec http://t.co/4Lmiaopf60,Europarl_NL,1
RT @greenami1: De winnaars en verliezers van de
lachwekkende #klimaattop in #Warschau (interview):
http://t.co/DEYqnqXHdy #Misserfolg #Kli...,LarsMoratis
,1
De winnaars en verliezers van de lachwekkende #klimaattop
in #Warschau (interview): http://t.co/DEYqnqXHdy #
Misserfolg #Klimaschutz #FAZ,greenami1,1

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

The output

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Try it yourself!
We’ll help you getting started. Please go to
http://beehub.nl/bigdata-cw/workshop and download the
some files. Save the Python files
unicsv.py
myfirstscript.py as well as the dataset
mytweets.csv in a new folder called workshop on your
H-drive.
When you are done, start Python (GUI) from the
Windows Start Menu.

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Recap
1 The data

Recording tweets with yourTwapperkeeper
CSV-files
Other ways to collect tweets
Not that different: Facebook posts
2 The script

Pseudo-code
Python code
The output
3 Your turn
4 Questions?

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

This afternoon

Your own script

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Vragen of opmerkingen?

Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
#bigdata

Damian Trilling

Contenu connexe

En vedette

Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
Nati Shalom
 
Gephi Tutorial Visualization
Gephi Tutorial VisualizationGephi Tutorial Visualization
Gephi Tutorial Visualization
Gephi Consortium
 

En vedette (7)

BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
 
Real Time Analytics for Big Data - A twitter inspired case study
Real Time Analytics for Big Data - A twitter inspired case studyReal Time Analytics for Big Data - A twitter inspired case study
Real Time Analytics for Big Data - A twitter inspired case study
 
Gephi Tutorial Visualization
Gephi Tutorial VisualizationGephi Tutorial Visualization
Gephi Tutorial Visualization
 
Twitter bootstrap tutorial
Twitter bootstrap tutorialTwitter bootstrap tutorial
Twitter bootstrap tutorial
 

Similaire à Analyzing social media with Python and other tools (2/4)

Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)
Matthew Russell
 
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
PyData
 

Similaire à Analyzing social media with Python and other tools (2/4) (20)

Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4) Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4)
 
Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at Twitter
 
What to expect when you are visualizing
What to expect when you are visualizingWhat to expect when you are visualizing
What to expect when you are visualizing
 
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
 
Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
 
Aws r
Aws rAws r
Aws r
 
BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
Five steps to get tweets sent by a list of users
Five steps to get tweets sent by a list of usersFive steps to get tweets sent by a list of users
Five steps to get tweets sent by a list of users
 
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
 
Natural Language Processing sample code by Aiden
Natural Language Processing sample code by AidenNatural Language Processing sample code by Aiden
Natural Language Processing sample code by Aiden
 
Linking Feral Event Data: IWMW 2009 Case Study
Linking Feral Event Data: IWMW 2009 Case StudyLinking Feral Event Data: IWMW 2009 Case Study
Linking Feral Event Data: IWMW 2009 Case Study
 
Collect twitter data using python
Collect twitter data using pythonCollect twitter data using python
Collect twitter data using python
 
01-intro.pptx
01-intro.pptx01-intro.pptx
01-intro.pptx
 
Sentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and HiveSentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and Hive
 
Collect twitter data using python
Collect twitter data using pythonCollect twitter data using python
Collect twitter data using python
 

Plus de Department of Communication Science, University of Amsterdam

Plus de Department of Communication Science, University of Amsterdam (18)

BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8
 
BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 

Dernier

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Dernier (20)

Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 

Analyzing social media with Python and other tools (2/4)

  • 1. The data The script Your turn Questions? Hands-on-Workshop Big (Twitter) Data Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 30 January 2014 10.45 #bigdata Damian Trilling
  • 2. The data The script Your turn Questions? In this sesion (2/4): 1 The data Recording tweets with yourTwapperkeeper CSV-files Other ways to collect tweets Not that different: Facebook posts 2 The script Pseudo-code Python code The output 3 Your turn 4 Questions? #bigdata Damian Trilling
  • 3. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper The data: Recording tweets with yourTwapperkeeper http://datacollection.followthenews-uva.cloudlet.sara.nl #bigdata Damian Trilling
  • 4. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper yourTwapperkeeper #bigdata Damian Trilling
  • 5. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper yourTwapperkeeper Storage Continuosly calls the Twitter-API and saves all tweets containing specific hashtags to a mySQL-database. You tell it once which data to collect – and wait some months. #bigdata Damian Trilling
  • 6. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper yourTwapperkeeper #bigdata Damian Trilling
  • 7. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper yourTwapperkeeper Retrieving the data You could access the MySQL-database directly. But yourTwapperkeeper has a nice interface that allows you to export the data to a format we can use for the analysis. #bigdata Damian Trilling
  • 8.
  • 9.
  • 10.
  • 11. The data The script Your turn Questions? CSV-files The data: CSV-files #bigdata Damian Trilling
  • 12. The data The script Your turn Questions? CSV-files CSV-files The format of our choice • All programs can read it • Even human-readable in a simple text editor: • Plain text, with a comma (or a semicolon) denoting column breaks • No limits regarging the size #bigdata Damian Trilling
  • 13. The data The script Your turn Questions? CSV-files 1 2 3 text,to_user_id,from_user,id,from_user_id, iso_language_code,source,profile_image_url,geo_type, geo_coordinates_0,geo_coordinates_1,created_at,time :-) #Lectrr #wereldleiders #uitspraken #Wikileaks # klimaattop http://t.co/Udjpk48EIB,,henklbr ,407085917011079169,118374840,nl,web,http://pbs.twimg. com/profile_images/378800000673845195/ b47785b1595e6a1c63b93e463f3d0ccc_normal.jpeg,,0,0,Sun Dec 01 09:57:00 +0000 2013,1385891820 Wat zijn de resulaten vd #klimaattop in #Warschau waard? @EP_Environment ontmoet voorzitter klimaattop @MarcinKorolec http://t.co/4Lmiaopf60,,Europarl_NL ,406058792573730816,37623918,en,<a href="http://www. hootsuite.com" rel="nofollow">HootSuite</a>,http://pbs .twimg.com/profile_images/2943831271/ b6631b23a86502fae808ca3efde23d0d_normal.png,,0,0,Thu Nov 28 13:55:35 +0000 2013,1385646935 #bigdata Damian Trilling
  • 14. The data The script Your turn Questions? Other ways to collect tweets The data: Other ways to collect tweets #bigdata Damian Trilling
  • 15. The data The script Your turn Questions? Other ways to collect tweets Other ways to collect tweets Again, we want a CSV file. . . • If you want tweets per person: www.allmytweets.net • Up to six days backwards: www.scraperwiki.com • Buy it from a commercial vendor • TCAT (from the guys at DMI/mediastudies) • For specific purposes, write your own Python script to access the Twitter-API (if you want to, I can show you more about this tomorrow) #bigdata Damian Trilling
  • 16. The data The script Your turn Questions? Not that different: Facebook posts The data: Not that different: Facebook posts #bigdata Damian Trilling
  • 17. The data The script Your turn Questions? Not that different: Facebook posts Not that different: Facebook posts Have a look at netvizz • Gephi-files for network analysis • . . . and a tab-seperated (essentially the same as CSV) file with the content) #bigdata Damian Trilling
  • 18. The data The script Your turn Questions? Not that different: Facebook posts Not that different: Facebook posts Have a look at netvizz • Gephi-files for network analysis • . . . and a tab-seperated (essentially the same as CSV) file with the content) An alternative: Facepager • Tool to query different APIs (a.o. Twitter and Facebook) and to store the result in a CSV table • http://www.ls1.ifkw.uni-muenchen.de/personen/ wiss_ma/keyling_till/software.html #bigdata Damian Trilling
  • 19.
  • 20. The data The script Your turn Questions? Pseudo-code The script: Pseudo-code #bigdata Damian Trilling
  • 21. The data The script Your turn Questions? Pseudo-code Our task: Identify all tweets that include a reference to Poland Let’s start with some pseudo-code! 1 2 3 4 5 6 7 open csv-table for each line: append column 1 to a list of tweets append column 3 to a list of corresponding users look for searchstring in column 1 append search result to a list of results save lists to a new csv-file #bigdata Damian Trilling
  • 22. The data The script Your turn Questions? Python code The script: Python code #bigdata Damian Trilling
  • 23. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 #!/usr/bin/python from unicsv import CsvUnicodeReader from unicsv import CsvUnicodeWriter import re inputfilename="mytweets.csv" outputfilename="myoutput.csv" user_list=[] tweet_list=[] search_list=[] searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’) print "Opening "+inputfilename reader=CsvUnicodeReader(open(inputfilename,"r")) for row in reader: tweet_list.append(row[0]) user_list.append(row[2]) matches1 = searchstring1.findall(row[0]) matchcount1=0 for word in matches1: matchcount1=matchcount1+1 search_list.append(matchcount1) print "Constructing data matrix" outputdata=zip(tweet_list,user_list,search_list) headers=zip(["tweet"],["user"],["how often is Poland mentioned?"]) print "Write data matrix to ",outputfilename writer=CsvUnicodeWriter(open(outputfilename,"wb")) writer.writerows(headers) writer.writerows(outputdata)
  • 24. The data The script Your turn Questions? Python code 1 2 3 4 5 #!/usr/bin/python # We start with importing some modules: from unicsv import CsvUnicodeReader from unicsv import CsvUnicodeWriter import re 6 7 8 9 10 # Let us define two variables that contain # the names of the files we want to use inputfilename="mytweets.csv" outputfilename="myoutput.csv" #bigdata Damian Trilling
  • 25. The data The script Your turn Questions? Python code 1 2 3 4 5 6 # We create some empty lists that we will use later on. # A list can contain several variables # and is denoted by square brackets. user_list=[] tweet_list=[] search_list=[] #bigdata Damian Trilling
  • 26. The data The script Your turn Questions? Python code 1 2 # What do we want to look for? searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau |[Ww]arszawa’) 3 4 5 6 # Enough preparation, let the program begin! # We tell the user what is going on... print "Opening "+inputfilename 7 8 9 # ... and call the module that reads the input file. reader=CsvUnicodeReader(open(inputfilename,"r")) #bigdata Damian Trilling
  • 27. The data The script Your turn Questions? Python code 1 2 3 4 5 6 7 8 # Now we read the file line by line. # The indented block is repeated for each row # (thus, each tweet) for row in reader: # append data from the current row to our lists. # Note that we start counting with 0. tweet_list.append(row[0]) user_list.append(row[2]) 9 10 11 12 13 14 15 16 #bigdata # Let us count how often our searchstring is used in # in this tweet matches1 = searchstring1.findall(row[0]) matchcount1=0 for word in matches1: matchcount1=matchcount1+1 search_list.append(matchcount1) Damian Trilling
  • 28. The data The script Your turn Questions? Python code 1 2 # Time to put all the data in one container # and save it: 3 4 5 6 7 8 9 10 print "Constructing data matrix" outputdata=zip(tweet_list,user_list,search_list) headers=zip(["tweet"],["user"],["how often is Poland mentioned?"]) print "Write data matrix to ",outputfilename writer=CsvUnicodeWriter(open(outputfilename,"wb")) writer.writerows(headers) writer.writerows(outputdata) #bigdata Damian Trilling
  • 29. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 #!/usr/bin/python from unicsv import CsvUnicodeReader from unicsv import CsvUnicodeWriter import re inputfilename="mytweets.csv" outputfilename="myoutput.csv" user_list=[] tweet_list=[] search_list=[] searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’) print "Opening "+inputfilename reader=CsvUnicodeReader(open(inputfilename,"r")) for row in reader: tweet_list.append(row[0]) user_list.append(row[2]) matches1 = searchstring1.findall(row[0]) matchcount1=0 for word in matches1: matchcount1=matchcount1+1 search_list.append(matchcount1) print "Constructing data matrix" outputdata=zip(tweet_list,user_list,search_list) headers=zip(["tweet"],["user"],["how often is Poland mentioned?"]) print "Write data matrix to ",outputfilename writer=CsvUnicodeWriter(open(outputfilename,"wb")) writer.writerows(headers) writer.writerows(outputdata)
  • 30. The data The script Your turn Questions? The output The script: myoutput.csv #bigdata Damian Trilling
  • 31. The data The script Your turn Questions? The output 1 2 3 4 5 tweet,user,how often is Poland mentioned? :-) #Lectrr #wereldleiders #uitspraken #Wikileaks # klimaattop http://t.co/Udjpk48EIB,henklbr,0 Wat zijn de resulaten vd #klimaattop in #Warschau waard? @EP_Environment ontmoet voorzitter klimaattop @MarcinKorolec http://t.co/4Lmiaopf60,Europarl_NL,1 RT @greenami1: De winnaars en verliezers van de lachwekkende #klimaattop in #Warschau (interview): http://t.co/DEYqnqXHdy #Misserfolg #Kli...,LarsMoratis ,1 De winnaars en verliezers van de lachwekkende #klimaattop in #Warschau (interview): http://t.co/DEYqnqXHdy # Misserfolg #Klimaschutz #FAZ,greenami1,1 #bigdata Damian Trilling
  • 32. The data The script Your turn Questions? The output #bigdata Damian Trilling
  • 33. The data The script Your turn Questions? Try it yourself! We’ll help you getting started. Please go to http://beehub.nl/bigdata-cw/workshop and download the some files. Save the Python files unicsv.py myfirstscript.py as well as the dataset mytweets.csv in a new folder called workshop on your H-drive. When you are done, start Python (GUI) from the Windows Start Menu. #bigdata Damian Trilling
  • 34. The data The script Your turn Questions? Recap 1 The data Recording tweets with yourTwapperkeeper CSV-files Other ways to collect tweets Not that different: Facebook posts 2 The script Pseudo-code Python code The output 3 Your turn 4 Questions? #bigdata Damian Trilling
  • 35. The data The script Your turn Questions? This afternoon Your own script #bigdata Damian Trilling
  • 36. The data The script Your turn Questions? Vragen of opmerkingen? Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net #bigdata Damian Trilling