SlideShare une entreprise Scribd logo
1  sur  55
Télécharger pour lire hors ligne
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Using sampling methods to estimate rare stats on
Twitter’s graph
Antoine Rebecq
INSEE - Universit´e Paris X
12/14/15
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Sommaire
1 Stats on social networks / Twitter
Motivation
Towards design-based estimation
2 Survey sampling
Estimates
Sampling design
3 Extending the sampling design
Snowball sampling
Adaptive sampling
4 Results and future work
Results
Sample size
Future work
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Motivation
Towards design-based estimation
Section 1
Stats on social networks / Twitter
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Motivation
Towards design-based estimation
Subsection 1
Motivation
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Motivation
Towards design-based estimation
Big data begets big graph
Twitter in 2013
Image from [2]
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Motivation
Towards design-based estimation
Studies - Twitter
A large range of studies used Twitter data (Computer Science,
Sociology, Psychology, etc.)
Data on Twitter can be collected via :
The REST API (limited number of queries - queries can be on
anything)
The Streaming API (Only 1% of tweets matching some
criteria)
The Firehose (Unlimited access. Expensive)
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Motivation
Towards design-based estimation
The Twitter graph
The Twitter graph ([7]) :
Is undirected
Degree distribution is heavy-tailed
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Motivation
Towards design-based estimation
The Twitter graph
Has small path lengths
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Motivation
Towards design-based estimation
Subsection 2
Towards design-based estimation
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Motivation
Towards design-based estimation
Towards design-based estimation
Model-based estimation :
Scale-free networks, Barab´asi-Albert ([1])
Small-world networks, Watts-Strogatz ([13])
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Motivation
Towards design-based estimation
Towards design-based estimation
Very little exists about design-based statistical inference on
networks (Kolaczyk 2009 , [6])
We try survey sampling methods used in official Statistics
Institutes to make design-based inference about “big graphs”
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Motivation
Towards design-based estimation
Example : Star Wars : The Force Awakens
Star Wars : The Force Awakens
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Motivation
Towards design-based estimation
Example : “Star Wars, The Force Awakens”
Let’s write :
yk = Number of tweets @starwars by user k
between 10/29/15, 7 :48 - 10 :48 PM EST
zk = 1{yk ≥ 1}
Goal : estimate NC = T(Z)
Additionally, we write : nC =
k∈s
zk
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Section 2
Survey sampling
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Subsection 1
Estimates
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Horvitz-Thompson estimator
Population U : vertices of the Twitter graph.
Assign all k ∈ U an inclusion probability P(k ∈ s) = πk
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Horvitz-Thompson estimator
Classic unbiased estimator for totals and means :
Horvitz-Thompson
ˆT(Y )HT =
k∈s
yk
πk
ˆ¯y =
1
N
k∈s
yk
πk
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Horvitz-Thompson estimator
Variance of the Horvitz-Thompson estimator depends on the first
and second-order inclusion probabilities :
πk = P(k ∈ s)
πkl = P(k, l ∈ s)
V( ˆT(Y )HT ) =
k∈U l∈U
(πkl − πkπl )
yk
πk
yl
πl
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Calibrated estimator
Deville-Sarndal, 1992 ([3]). Modification of the Horvitz-Thompson
estimator to take auxiliary information into account. For example :
T(Y ) = Number of tweets @StarWars
N = Number of users in scope
Structure of number of followers
Number of verified users
. . .
Very similar to empirical likelihood methods ([9]).
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Subsection 2
Sampling design
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Sampling frame
Each Twitter user is assigned a unique id. When a new user is
created, the id that is assigned to it is greater than the last
previous id.
But, not all ids match an existing user (≈ 3.1 · 109 ids as of
October 2015), which means our frame over-covers the
population. Over-coverage can be corrected either by using a
Horvitz-Thompson or Hajek estimator (see [10]).
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Sampling design : Bernoulli
Poisson sampling : For each k ∈ U , run a πk-Bernoulli experiment
to decide whether to include unit k in the sample.
Bernoulli sampling : ∀k, πk = p
Sampling design of non-fixed sample size. We set the expected
sample size to 20000.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Sampling design : Stratified Bernoulli
We write : U = U1 U2 (h = 1, 2 being called “strata”) and
draw two independant Bernoulli samples in U1 and U2.
Here :
U1 = Followers of official @starwars account
U2 = Rest of Twitter users
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Sampling design : Neyman allocation
Optimal variance of the Horvitz-Thompson estimator is obtained
for (Neyman, [8]) :
nh =
NhS2
h
h
NhS2
h
Given the expected values, we set :
n1 = 9700
n2 = 10300
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Sampling design : Stratified Bernoulli
Estimators for the two “simple” designs :
ˆNC1 =
nC
p
ˆNC2 =
N1
n1
nC1 +
N − N1
n2
nC2
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Estimates
Sampling design
Variance estimators
ˆV( ˆT(Y ))1 =
k∈s
(1 − p)yk
p2
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Section 3
Extending the sampling design
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Snowball sampling
From now on, our sampling designs will include extensions :
s = s0 ∪ sext
s0 is still selected using stratified Bernoulli, but with expected
sample size of 1000, so that the expected sample size of s is more
or less 20000.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Subsection 1
Snowball sampling
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Snowball sampling
Population U
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Snowball sampling
Initial sample s0
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Snowball sampling
One stage snowball extension s = A(s0)
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Snowball sampling
Formally, we write :
Bi = {i} ∪ {j ∈ V , Eji = ∅}
Ai = {i} ∪ {j ∈ V , Eij = ∅}
s = A(s0)
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Snowball sampling
ˆNC3 =
k∈s
zi
1 − ¯π(Bi )
where :
¯π(Bi ) = P(Bi ⊂ ¯s)
=
k∈Bi
(1 − P(k ∈ s))
= q
#(Bi ∩U1)
S1 · q
#(Bi ∩U2)
S2
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Snowball sampling
ˆV( ˆNC3) =
i∈s j∈s
zi zj
¯π(Bi ∪ Bj )
γij
where :
γij =
¯π(Bi ∪ Bj ) − ¯π(Bi )¯π(Bj )
[1 − ¯π(Bi )][1 − ¯π(Bj )]
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Subsection 2
Adaptive sampling
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Adaptive sampling
In adaptive sampling, when (Thompson, [11])
Used in official statistics to measure number of drugs users or
HIV-positive people
Sampling design often compared to the video game
“minesweeper”
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Adaptive sampling
Image from [12]
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Adaptive sampling
Once a unit bearing the characteristic of interest (i.e. a user who
tweeted about the Star Wars trailer) is found, all its network (i.e.
its friends and friends of friends, etc. who have tweeted about Star
Wars) is included in the sample.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Adaptive sampling
Estimator :
ˆNC4 =
K
k=1
n∗
CkJk
πgk
where :
K = number of networks
y∗
k = total of Y in the network k
n∗
Ck
= Number of people with yk ≥ 1in the network k
Jk = 1{k ∈ C}
πgk = probability that the initial sample intersects k
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Snowball sampling
Adaptive sampling
Adaptive sampling
When using an adaptive design, it is often better to use the
Rao-Blackwell of the previous estimate. It has a very simple closed
form in the case of the adaptive stratified.
ˆNC5 = n0
+
K
k=1
nr
1 − (1 − p)nr
where : n0 = #s0 and s0 = ∪r {k ∈ s, δ(k, C) = 1} is the union of
the sides of C.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Section 4
Results and future work
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Subsection 1
Results
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Results
Design n nscope n0
ˆNC
ˆCV ˆDeff
Bernoulli 20013 3946 354121 0.231 1.04
Stratified 20094 9832 316889 0.097 0.68
1-snowball 159957 73570 1000 331097 0.031 0.60
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Results
Mean number of tweets @StarWars per user : 1.18 ± 0.07
Suggests that bots are not responsible for this very large number of
tweets (see [5], [4]) !
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Subsection 2
Sample size
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Snowball sampling - sample size
Expected sample size ≈ 20000.
Actual sample size : > 150000 !
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Adaptive sampling
With our test subject (tweets @AmericanIdol), average network
size was no greater than a few units (≈ 10000 tweets in the scope)
With Star Wars (≈ 300000 tweets in the scope, with much less
tweets per people), we couldn’t get to the end of every network !
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Subsection 3
Future work
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Future work
Control sample size
Estimates and calibration on graph totals (centrality,
clustering coefficients, path length, etc.)
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Conclusion
Thank you !
http://nc233.com/cmstatistics2015
@nc233
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Albert-L´aszl´o Barab´asi and R´eka Albert.
Emergence of scaling in random networks.
science, 286(5439) :509–512, 1999.
Paul Burkhardt and Chris Waring.
An nsa big graph experiment.
In presentation at the Carnegie Mellon University SDI/ISTC
Seminar, Pittsburgh, Pa, 2013.
Jean-Claude Deville and Carl-Erik S¨arndal.
Calibration estimators in survey sampling.
Journal of the American statistical Association,
87(418) :376–382, 1992.
Emilio Ferrara.
”manipulation and abuse on social media” by emilio ferrara
with ching-man au yeung as coordinator.
SIGWEB Newsl., (Spring) :4 :1–4 :9, April 2015.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer,
and Alessandro Flammini.
The rise of social bots.
arXiv preprint arXiv :1407.5225, 2014.
Eric D Kolaczyk.
Statistical analysis of network data.
Springer, 2009.
Seth A Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin.
Information network or social network ? : the structure of the
twitter follow graph.
In Proceedings of the companion publication of the 23rd
international conference on World wide web companion, pages
493–498. International World Wide Web Conferences Steering
Committee, 2014.
Jerzy Neyman.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
On the two different aspects of the representative method :
the method of stratified sampling and the method of purposive
selection.
Journal of the Royal Statistical Society, pages 558–625, 1934.
Art B. Owen.
Empirical likelihood.
CRC press, 2010.
Olivier Sautory.
Les enjeux m´ethodologiques li´es `a l’usage de bases de sondage
imparfaites.
Steven K Thompson.
Adaptive cluster sampling.
Journal of the American Statistical Association,
85(412) :1050–1059, 1990.
Steven K Thompson.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Results
Sample size
Future work
Stratified adaptive cluster sampling.
Biometrika, pages 389–397, 1991.
Duncan J Watts and Steven H Strogatz.
Collective dynamics of ‘small-world’networks.
nature, 393(6684) :440–442, 1998.
Antoine Rebecq Sampling the Twitter graph

Contenu connexe

Tendances

Contextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender SystemsContextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender SystemsMatthias Braunhofer
 
Graph based Semi Supervised Learning V1
Graph based Semi Supervised Learning V1Graph based Semi Supervised Learning V1
Graph based Semi Supervised Learning V1Neeta Pande
 
Outlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsOutlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsQuantUniversity
 
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and TagsCold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and TagsMatthias Braunhofer
 
001 20151005 ranking_nodesingrowingnetwork
001 20151005 ranking_nodesingrowingnetwork001 20151005 ranking_nodesingrowingnetwork
001 20151005 ranking_nodesingrowingnetworkHa Phuong
 
Revisiting The UK EU Membership Referendum (Brexit) Poll Tracker
Revisiting The UK EU Membership Referendum (Brexit) Poll TrackerRevisiting The UK EU Membership Referendum (Brexit) Poll Tracker
Revisiting The UK EU Membership Referendum (Brexit) Poll TrackerMichaelino Mervisiano
 

Tendances (6)

Contextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender SystemsContextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender Systems
 
Graph based Semi Supervised Learning V1
Graph based Semi Supervised Learning V1Graph based Semi Supervised Learning V1
Graph based Semi Supervised Learning V1
 
Outlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsOutlier analysis for Temporal Datasets
Outlier analysis for Temporal Datasets
 
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and TagsCold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
 
001 20151005 ranking_nodesingrowingnetwork
001 20151005 ranking_nodesingrowingnetwork001 20151005 ranking_nodesingrowingnetwork
001 20151005 ranking_nodesingrowingnetwork
 
Revisiting The UK EU Membership Referendum (Brexit) Poll Tracker
Revisiting The UK EU Membership Referendum (Brexit) Poll TrackerRevisiting The UK EU Membership Referendum (Brexit) Poll Tracker
Revisiting The UK EU Membership Referendum (Brexit) Poll Tracker
 

Similaire à Sampling the Twitter graph

Sampling methods for graphs
Sampling methods for graphsSampling methods for graphs
Sampling methods for graphsAntoine Rebecq
 
No estimates - 10 new principles for testing
No estimates  - 10 new principles for testingNo estimates  - 10 new principles for testing
No estimates - 10 new principles for testingVasco Duarte
 
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...Istituto nazionale di statistica
 
Introduction to Computational Statistics
Introduction to Computational StatisticsIntroduction to Computational Statistics
Introduction to Computational StatisticsSetia Pramana
 
You Don't Have to Be a Data Scientist to Do Data Science
You Don't Have to Be a Data Scientist to Do Data ScienceYou Don't Have to Be a Data Scientist to Do Data Science
You Don't Have to Be a Data Scientist to Do Data ScienceCarmen Mardiros
 
t-Test Project Instructions and Rubric Project Overvi.docx
t-Test Project Instructions and Rubric  Project Overvi.docxt-Test Project Instructions and Rubric  Project Overvi.docx
t-Test Project Instructions and Rubric Project Overvi.docxmattinsonjanel
 
No estimates - a controversial way to improve estimation with results-handouts
No estimates - a controversial way to improve estimation with results-handoutsNo estimates - a controversial way to improve estimation with results-handouts
No estimates - a controversial way to improve estimation with results-handoutsVasco Duarte
 
PROPOSAL2PROPOSAL 3ORIGINAL SUBMITTED PROPOSALTech.docx
PROPOSAL2PROPOSAL 3ORIGINAL SUBMITTED PROPOSALTech.docxPROPOSAL2PROPOSAL 3ORIGINAL SUBMITTED PROPOSALTech.docx
PROPOSAL2PROPOSAL 3ORIGINAL SUBMITTED PROPOSALTech.docxwoodruffeloisa
 
Guidelines for Final Year Engineering & Technology Project.ppt
Guidelines for Final Year Engineering & Technology  Project.pptGuidelines for Final Year Engineering & Technology  Project.ppt
Guidelines for Final Year Engineering & Technology Project.pptPradeepmane16
 
Quantitative Project Risk Analysis
Quantitative Project Risk AnalysisQuantitative Project Risk Analysis
Quantitative Project Risk AnalysisIntaver Insititute
 
IEEE 2015 ASP.NET with C# Projects
IEEE 2015 ASP.NET with C# ProjectsIEEE 2015 ASP.NET with C# Projects
IEEE 2015 ASP.NET with C# ProjectsVijay Karan
 
A Survey on Analysis of Twitter Opinion Mining using Sentiment Analysis
A Survey on Analysis of Twitter Opinion Mining using Sentiment AnalysisA Survey on Analysis of Twitter Opinion Mining using Sentiment Analysis
A Survey on Analysis of Twitter Opinion Mining using Sentiment AnalysisIRJET Journal
 
Knowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital LibrariesKnowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital LibrariesAnsgar Scherp
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEnrico Daga
 
Final Presentation Slide--yw5244
Final Presentation Slide--yw5244Final Presentation Slide--yw5244
Final Presentation Slide--yw5244ssuserdb31951
 

Similaire à Sampling the Twitter graph (20)

Sampling methods for graphs
Sampling methods for graphsSampling methods for graphs
Sampling methods for graphs
 
No estimates - 10 new principles for testing
No estimates  - 10 new principles for testingNo estimates  - 10 new principles for testing
No estimates - 10 new principles for testing
 
Sp150502ss
Sp150502ssSp150502ss
Sp150502ss
 
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
 
Introduction to Computational Statistics
Introduction to Computational StatisticsIntroduction to Computational Statistics
Introduction to Computational Statistics
 
You Don't Have to Be a Data Scientist to Do Data Science
You Don't Have to Be a Data Scientist to Do Data ScienceYou Don't Have to Be a Data Scientist to Do Data Science
You Don't Have to Be a Data Scientist to Do Data Science
 
t-Test Project Instructions and Rubric Project Overvi.docx
t-Test Project Instructions and Rubric  Project Overvi.docxt-Test Project Instructions and Rubric  Project Overvi.docx
t-Test Project Instructions and Rubric Project Overvi.docx
 
Week_3_Lecture.pdf
Week_3_Lecture.pdfWeek_3_Lecture.pdf
Week_3_Lecture.pdf
 
No estimates - a controversial way to improve estimation with results-handouts
No estimates - a controversial way to improve estimation with results-handoutsNo estimates - a controversial way to improve estimation with results-handouts
No estimates - a controversial way to improve estimation with results-handouts
 
PROPOSAL2PROPOSAL 3ORIGINAL SUBMITTED PROPOSALTech.docx
PROPOSAL2PROPOSAL 3ORIGINAL SUBMITTED PROPOSALTech.docxPROPOSAL2PROPOSAL 3ORIGINAL SUBMITTED PROPOSALTech.docx
PROPOSAL2PROPOSAL 3ORIGINAL SUBMITTED PROPOSALTech.docx
 
Guidelines for Final Year Engineering & Technology Project.ppt
Guidelines for Final Year Engineering & Technology  Project.pptGuidelines for Final Year Engineering & Technology  Project.ppt
Guidelines for Final Year Engineering & Technology Project.ppt
 
Data analysis
Data analysisData analysis
Data analysis
 
Quantitative Project Risk Analysis
Quantitative Project Risk AnalysisQuantitative Project Risk Analysis
Quantitative Project Risk Analysis
 
IEEE 2015 ASP.NET with C# Projects
IEEE 2015 ASP.NET with C# ProjectsIEEE 2015 ASP.NET with C# Projects
IEEE 2015 ASP.NET with C# Projects
 
A Survey on Analysis of Twitter Opinion Mining using Sentiment Analysis
A Survey on Analysis of Twitter Opinion Mining using Sentiment AnalysisA Survey on Analysis of Twitter Opinion Mining using Sentiment Analysis
A Survey on Analysis of Twitter Opinion Mining using Sentiment Analysis
 
CAPM study session 4
CAPM study session 4CAPM study session 4
CAPM study session 4
 
Knowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital LibrariesKnowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital Libraries
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
Final Presentation Slide--yw5244
Final Presentation Slide--yw5244Final Presentation Slide--yw5244
Final Presentation Slide--yw5244
 

Dernier

GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry Areesha Ahmad
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Silpa
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Silpa
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLkantirani197
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 

Dernier (20)

GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 

Sampling the Twitter graph

  • 1. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Using sampling methods to estimate rare stats on Twitter’s graph Antoine Rebecq INSEE - Universit´e Paris X 12/14/15 Antoine Rebecq Sampling the Twitter graph
  • 2. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Sommaire 1 Stats on social networks / Twitter Motivation Towards design-based estimation 2 Survey sampling Estimates Sampling design 3 Extending the sampling design Snowball sampling Adaptive sampling 4 Results and future work Results Sample size Future work Antoine Rebecq Sampling the Twitter graph
  • 3. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Motivation Towards design-based estimation Section 1 Stats on social networks / Twitter Antoine Rebecq Sampling the Twitter graph
  • 4. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Motivation Towards design-based estimation Subsection 1 Motivation Antoine Rebecq Sampling the Twitter graph
  • 5. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Motivation Towards design-based estimation Big data begets big graph Twitter in 2013 Image from [2] Antoine Rebecq Sampling the Twitter graph
  • 6. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Motivation Towards design-based estimation Studies - Twitter A large range of studies used Twitter data (Computer Science, Sociology, Psychology, etc.) Data on Twitter can be collected via : The REST API (limited number of queries - queries can be on anything) The Streaming API (Only 1% of tweets matching some criteria) The Firehose (Unlimited access. Expensive) Antoine Rebecq Sampling the Twitter graph
  • 7. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Motivation Towards design-based estimation The Twitter graph The Twitter graph ([7]) : Is undirected Degree distribution is heavy-tailed Antoine Rebecq Sampling the Twitter graph
  • 8. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Motivation Towards design-based estimation The Twitter graph Has small path lengths Antoine Rebecq Sampling the Twitter graph
  • 9. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Motivation Towards design-based estimation Subsection 2 Towards design-based estimation Antoine Rebecq Sampling the Twitter graph
  • 10. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Motivation Towards design-based estimation Towards design-based estimation Model-based estimation : Scale-free networks, Barab´asi-Albert ([1]) Small-world networks, Watts-Strogatz ([13]) Antoine Rebecq Sampling the Twitter graph
  • 11. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Motivation Towards design-based estimation Towards design-based estimation Very little exists about design-based statistical inference on networks (Kolaczyk 2009 , [6]) We try survey sampling methods used in official Statistics Institutes to make design-based inference about “big graphs” Antoine Rebecq Sampling the Twitter graph
  • 12. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Motivation Towards design-based estimation Example : Star Wars : The Force Awakens Star Wars : The Force Awakens Antoine Rebecq Sampling the Twitter graph
  • 13. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Motivation Towards design-based estimation Example : “Star Wars, The Force Awakens” Let’s write : yk = Number of tweets @starwars by user k between 10/29/15, 7 :48 - 10 :48 PM EST zk = 1{yk ≥ 1} Goal : estimate NC = T(Z) Additionally, we write : nC = k∈s zk Antoine Rebecq Sampling the Twitter graph
  • 14. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Section 2 Survey sampling Antoine Rebecq Sampling the Twitter graph
  • 15. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Subsection 1 Estimates Antoine Rebecq Sampling the Twitter graph
  • 16. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Horvitz-Thompson estimator Population U : vertices of the Twitter graph. Assign all k ∈ U an inclusion probability P(k ∈ s) = πk Antoine Rebecq Sampling the Twitter graph
  • 17. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Horvitz-Thompson estimator Classic unbiased estimator for totals and means : Horvitz-Thompson ˆT(Y )HT = k∈s yk πk ˆ¯y = 1 N k∈s yk πk Antoine Rebecq Sampling the Twitter graph
  • 18. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Horvitz-Thompson estimator Variance of the Horvitz-Thompson estimator depends on the first and second-order inclusion probabilities : πk = P(k ∈ s) πkl = P(k, l ∈ s) V( ˆT(Y )HT ) = k∈U l∈U (πkl − πkπl ) yk πk yl πl Antoine Rebecq Sampling the Twitter graph
  • 19. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Calibrated estimator Deville-Sarndal, 1992 ([3]). Modification of the Horvitz-Thompson estimator to take auxiliary information into account. For example : T(Y ) = Number of tweets @StarWars N = Number of users in scope Structure of number of followers Number of verified users . . . Very similar to empirical likelihood methods ([9]). Antoine Rebecq Sampling the Twitter graph
  • 20. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Subsection 2 Sampling design Antoine Rebecq Sampling the Twitter graph
  • 21. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Sampling frame Each Twitter user is assigned a unique id. When a new user is created, the id that is assigned to it is greater than the last previous id. But, not all ids match an existing user (≈ 3.1 · 109 ids as of October 2015), which means our frame over-covers the population. Over-coverage can be corrected either by using a Horvitz-Thompson or Hajek estimator (see [10]). Antoine Rebecq Sampling the Twitter graph
  • 22. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Sampling design : Bernoulli Poisson sampling : For each k ∈ U , run a πk-Bernoulli experiment to decide whether to include unit k in the sample. Bernoulli sampling : ∀k, πk = p Sampling design of non-fixed sample size. We set the expected sample size to 20000. Antoine Rebecq Sampling the Twitter graph
  • 23. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Sampling design : Stratified Bernoulli We write : U = U1 U2 (h = 1, 2 being called “strata”) and draw two independant Bernoulli samples in U1 and U2. Here : U1 = Followers of official @starwars account U2 = Rest of Twitter users Antoine Rebecq Sampling the Twitter graph
  • 24. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Sampling design : Neyman allocation Optimal variance of the Horvitz-Thompson estimator is obtained for (Neyman, [8]) : nh = NhS2 h h NhS2 h Given the expected values, we set : n1 = 9700 n2 = 10300 Antoine Rebecq Sampling the Twitter graph
  • 25. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Sampling design : Stratified Bernoulli Estimators for the two “simple” designs : ˆNC1 = nC p ˆNC2 = N1 n1 nC1 + N − N1 n2 nC2 Antoine Rebecq Sampling the Twitter graph
  • 26. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Estimates Sampling design Variance estimators ˆV( ˆT(Y ))1 = k∈s (1 − p)yk p2 Antoine Rebecq Sampling the Twitter graph
  • 27. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Section 3 Extending the sampling design Antoine Rebecq Sampling the Twitter graph
  • 28. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling From now on, our sampling designs will include extensions : s = s0 ∪ sext s0 is still selected using stratified Bernoulli, but with expected sample size of 1000, so that the expected sample size of s is more or less 20000. Antoine Rebecq Sampling the Twitter graph
  • 29. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Subsection 1 Snowball sampling Antoine Rebecq Sampling the Twitter graph
  • 30. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling Population U Antoine Rebecq Sampling the Twitter graph
  • 31. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling Initial sample s0 Antoine Rebecq Sampling the Twitter graph
  • 32. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling One stage snowball extension s = A(s0) Antoine Rebecq Sampling the Twitter graph
  • 33. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling Formally, we write : Bi = {i} ∪ {j ∈ V , Eji = ∅} Ai = {i} ∪ {j ∈ V , Eij = ∅} s = A(s0) Antoine Rebecq Sampling the Twitter graph
  • 34. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling ˆNC3 = k∈s zi 1 − ¯π(Bi ) where : ¯π(Bi ) = P(Bi ⊂ ¯s) = k∈Bi (1 − P(k ∈ s)) = q #(Bi ∩U1) S1 · q #(Bi ∩U2) S2 Antoine Rebecq Sampling the Twitter graph
  • 35. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling ˆV( ˆNC3) = i∈s j∈s zi zj ¯π(Bi ∪ Bj ) γij where : γij = ¯π(Bi ∪ Bj ) − ¯π(Bi )¯π(Bj ) [1 − ¯π(Bi )][1 − ¯π(Bj )] Antoine Rebecq Sampling the Twitter graph
  • 36. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Subsection 2 Adaptive sampling Antoine Rebecq Sampling the Twitter graph
  • 37. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Adaptive sampling In adaptive sampling, when (Thompson, [11]) Used in official statistics to measure number of drugs users or HIV-positive people Sampling design often compared to the video game “minesweeper” Antoine Rebecq Sampling the Twitter graph
  • 38. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Adaptive sampling Image from [12] Antoine Rebecq Sampling the Twitter graph
  • 39. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Adaptive sampling Once a unit bearing the characteristic of interest (i.e. a user who tweeted about the Star Wars trailer) is found, all its network (i.e. its friends and friends of friends, etc. who have tweeted about Star Wars) is included in the sample. Antoine Rebecq Sampling the Twitter graph
  • 40. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Adaptive sampling Estimator : ˆNC4 = K k=1 n∗ CkJk πgk where : K = number of networks y∗ k = total of Y in the network k n∗ Ck = Number of people with yk ≥ 1in the network k Jk = 1{k ∈ C} πgk = probability that the initial sample intersects k Antoine Rebecq Sampling the Twitter graph
  • 41. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Adaptive sampling When using an adaptive design, it is often better to use the Rao-Blackwell of the previous estimate. It has a very simple closed form in the case of the adaptive stratified. ˆNC5 = n0 + K k=1 nr 1 − (1 − p)nr where : n0 = #s0 and s0 = ∪r {k ∈ s, δ(k, C) = 1} is the union of the sides of C. Antoine Rebecq Sampling the Twitter graph
  • 42. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Section 4 Results and future work Antoine Rebecq Sampling the Twitter graph
  • 43. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Subsection 1 Results Antoine Rebecq Sampling the Twitter graph
  • 44. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Results Design n nscope n0 ˆNC ˆCV ˆDeff Bernoulli 20013 3946 354121 0.231 1.04 Stratified 20094 9832 316889 0.097 0.68 1-snowball 159957 73570 1000 331097 0.031 0.60 Antoine Rebecq Sampling the Twitter graph
  • 45. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Results Mean number of tweets @StarWars per user : 1.18 ± 0.07 Suggests that bots are not responsible for this very large number of tweets (see [5], [4]) ! Antoine Rebecq Sampling the Twitter graph
  • 46. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Subsection 2 Sample size Antoine Rebecq Sampling the Twitter graph
  • 47. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Snowball sampling - sample size Expected sample size ≈ 20000. Actual sample size : > 150000 ! Antoine Rebecq Sampling the Twitter graph
  • 48. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Adaptive sampling With our test subject (tweets @AmericanIdol), average network size was no greater than a few units (≈ 10000 tweets in the scope) With Star Wars (≈ 300000 tweets in the scope, with much less tweets per people), we couldn’t get to the end of every network ! Antoine Rebecq Sampling the Twitter graph
  • 49. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Subsection 3 Future work Antoine Rebecq Sampling the Twitter graph
  • 50. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Future work Control sample size Estimates and calibration on graph totals (centrality, clustering coefficients, path length, etc.) Antoine Rebecq Sampling the Twitter graph
  • 51. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Conclusion Thank you ! http://nc233.com/cmstatistics2015 @nc233 Antoine Rebecq Sampling the Twitter graph
  • 52. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Albert-L´aszl´o Barab´asi and R´eka Albert. Emergence of scaling in random networks. science, 286(5439) :509–512, 1999. Paul Burkhardt and Chris Waring. An nsa big graph experiment. In presentation at the Carnegie Mellon University SDI/ISTC Seminar, Pittsburgh, Pa, 2013. Jean-Claude Deville and Carl-Erik S¨arndal. Calibration estimators in survey sampling. Journal of the American statistical Association, 87(418) :376–382, 1992. Emilio Ferrara. ”manipulation and abuse on social media” by emilio ferrara with ching-man au yeung as coordinator. SIGWEB Newsl., (Spring) :4 :1–4 :9, April 2015. Antoine Rebecq Sampling the Twitter graph
  • 53. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro Flammini. The rise of social bots. arXiv preprint arXiv :1407.5225, 2014. Eric D Kolaczyk. Statistical analysis of network data. Springer, 2009. Seth A Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin. Information network or social network ? : the structure of the twitter follow graph. In Proceedings of the companion publication of the 23rd international conference on World wide web companion, pages 493–498. International World Wide Web Conferences Steering Committee, 2014. Jerzy Neyman. Antoine Rebecq Sampling the Twitter graph
  • 54. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work On the two different aspects of the representative method : the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, pages 558–625, 1934. Art B. Owen. Empirical likelihood. CRC press, 2010. Olivier Sautory. Les enjeux m´ethodologiques li´es `a l’usage de bases de sondage imparfaites. Steven K Thompson. Adaptive cluster sampling. Journal of the American Statistical Association, 85(412) :1050–1059, 1990. Steven K Thompson. Antoine Rebecq Sampling the Twitter graph
  • 55. Stats on social networks / Twitter Survey sampling Extending the sampling design Results and future work Results Sample size Future work Stratified adaptive cluster sampling. Biometrika, pages 389–397, 1991. Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. nature, 393(6684) :440–442, 1998. Antoine Rebecq Sampling the Twitter graph