SlideShare une entreprise Scribd logo
1  sur  82
Télécharger pour lire hors ligne
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Sampling graphs efficiently: model assisted designs
and application to Twitter data
Antoine Rebecq
Universit´e Paris X - INSEE
3/23/17
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
1 Statistics and networks
Graphs and stats
Methods - algorithms - models
2 Survey sampling
Estimates
Use of auxiliary information
3 Extending the sampling design
Snowball sampling
Adaptive sampling
4 Application to Twitter data
The problem
Results
Model-assisted sampling
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Section 1
Statistics and networks
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Subsection 1
Graphs and stats
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Graphs
Graph G, set of vertices and edges : G = (V , E)
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Directed graphs
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Statistics of interest - graphs
Size
Degree
Centrality
Clustering
Communities
. . .
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Degree
dv = number of edges incident upon vertex v
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Degree / scale-free property
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Path lengths
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Centrality
Measure of “importance” of a node.
Examples : Google Pagerank, betweenness centrality (number of
times a node acts as a bridge along the shortest path between two
other nodes)
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Betweenness centrality
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Clustering
Global clustering coefficient =
3 · number of triangles
number of connected triplets
Local clustering coefficient of a vertex = how close its neighbours
are to being a clique (complete graph).
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Local clustering coefficient
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
The rise of “big graphs”
Rise of “big graphs”
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
The rise of “big graphs”
Example : The Graph500 benchmark
(http://www.graph500.org). Size of data sets up to 1.1 PB
adjacency list (human connectome size)
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Subsection 2
Methods - algorithms - models
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Methods for graph statistics
Algorithms (computer science, “big data”)
Model-based estimation
Sampling (“Design-based estimation”)
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Methods for graph statistics
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Computer science methods
Efficient algorithms (speed / memory).
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Computer science methods
Efficient algorithms (speed / memory).
Sometimes require sampling.
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Model-based estimation
Famous graph models :
Erd˝os-R´enyi
Price / Barab´asi-Albert (High tailed degree distribution)
Watts-Strogatz / “small-world” (short path lengths)
Stochastic block models (communities)
Images from [8]
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Model-based estimation : Erd˝os-R´enyi (“random graphs”)
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Model-based estimation : Barab´asi-Albert (“preferential
attachment”)
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Model-based estimation : Watts-Strogatz (“small world”)
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Model-based estimation : Stochastic Block Models
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Graphs and stats
Methods - algorithms - models
Sampling / Design-based estimation
Sampling : select a few vertices/edges and compute estimators
using sample data. Very little exists about design-based statistical
inference on networks (Kolaczyk 2009 , [5])
We try survey sampling methods used in official Statistics
Institutes to make design-based inference about “big graphs”
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Estimates
Use of auxiliary information
Section 2
Survey sampling
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Estimates
Use of auxiliary information
Subsection 1
Estimates
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Estimates
Use of auxiliary information
Horvitz-Thompson estimator
Population U (here vertices of the graph).
Assign all k ∈ U an inclusion probability P(k ∈ s) = πk
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Estimates
Use of auxiliary information
Horvitz-Thompson estimator
Classic unbiased estimator for totals and means :
Horvitz-Thompson
ˆT(Y )HT =
k∈s
yk
πk
ˆ¯y =
1
N
k∈s
yk
πk
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Estimates
Use of auxiliary information
Horvitz-Thompson estimator
Variance of the Horvitz-Thompson estimator depends on the first
and second-order inclusion probabilities :
πk = P(k ∈ s)
πkl = P(k, l ∈ s)
V( ˆT(Y )HT ) =
k∈U l∈U
(πkl − πkπl )
yk
πk
yl
πl
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Estimates
Use of auxiliary information
Bernoulli sampling
Poisson sampling : For each k ∈ U , run a πk-Bernoulli experiment
to decide whether to include unit k in the sample.
Bernoulli sampling : ∀k, πk = p
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Estimates
Use of auxiliary information
Subsection 2
Use of auxiliary information
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Estimates
Use of auxiliary information
Auxiliary information
If πk ∝ yk then V( ˆT(Y )HT ) = 0
In practice, use auxiliary variable : X which is well correlated to Y .
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Estimates
Use of auxiliary information
Stratified sampling
We write : U = U1 U2 . . . UH and draw independant
samples in each Uh.
Strata should be formed so that intra dispersion of yk is the lowest
possible.
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Estimates
Use of auxiliary information
Stratified sampling : Neyman allocation
Given a set of strata and a sample size n, optimal variance is
obtained for :
nh =
NhS2
h
h
NhS2
h
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Estimates
Use of auxiliary information
Calibrated estimator
Deville-Sarndal, 1992 ([2]). Modification of the Horvitz-Thompson
estimator to take auxiliary information into account.
Very similar to empirical likelihood methods ([7]).
Computing variances for calibrated estimators is easy.
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Section 3
Extending the sampling design
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Official statistics
Measuring “hidden populations”
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Community structure
When trying to measure the size of a community ( ˆNC ), use of
edges as auxiliary variables.
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Snowball sampling
From now on, our sampling designs will include extensions :
s = s0 ∪ sext
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Subsection 1
Snowball sampling
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Snowball sampling
Population U
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Snowball sampling
Initial sample s0
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Snowball sampling
One stage snowball extension s = A(s0)
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Snowball sampling
Formally, we write :
Bi = {i} ∪ {j ∈ V , Eji = ∅}
Ai = {i} ∪ {j ∈ V , Eij = ∅}
s = A(s0)
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Snowball sampling
ˆNC3 =
k∈s
zi
1 − ¯π(Bi )
where :
¯π(Bi ) = P(Bi ⊂ ¯s)
=
k∈Bi
(1 − P(k ∈ s))
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Snowball sampling
ˆV( ˆNC3) =
i∈s j∈s
zi zj
¯π(Bi ∪ Bj )
γij
where :
γij =
¯π(Bi ∪ Bj ) − ¯π(Bi )¯π(Bj )
[1 − ¯π(Bi )][1 − ¯π(Bj )]
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Subsection 2
Adaptive sampling
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Adaptive sampling
Adaptive sampling (Thompson, [9])
Used in official statistics to measure number of drugs users or
HIV-positive people
Sampling design often compared to the video game
“minesweeper”
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Adaptive sampling
Image from [10]
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Adaptive sampling
Once a unit bearing the characteristic of interest is found, all its
network is included in the sample.
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Adaptive sampling
Estimator :
ˆNC4 =
K
k=1
n∗
CkJk
πgk
where :
K = number of networks
y∗
k = total of Y in the network k
n∗
Ck
= Number of people with yk ≥ 1 in the network k
Jk = 1{k ∈ C}
πgk = probability that the initial sample intersects k
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Adaptive sampling
When using an adaptive design, it is often better to use the
Rao-Blackwell of the previous estimate. It has a very simple closed
form in the case of the adaptive stratified.
ˆNC5 = n0
+
K
k=1
nr
1 − (1 − p)nr
where : n0 = #s0 and s0 = ∪r {k ∈ s, δ(k, C) = 1} is the union of
the sides of C.
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Adaptive sampling - Variance
ˆV( ˆNC4) =
K
k=1
K
k =1
ykyk
πgkk
πgkk
πgkπgk
− 1
where :
πgkk = 1 − πgk − πgk + (1 − p)ngk +ngk
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
Snowball sampling
Adaptive sampling
Adaptive sampling - Variance
Variance estimation for the Rao-Blackwell can be done by selecting
m samples :
ˆV( ˆNC5) = ˆV( ˆNC4) −
1
m − 1
m
i=1
( ˆNC5i − ˆNC4)2
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Section 4
Application to Twitter data
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Subsection 1
The problem
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
The Twitter graph
Twitter in 2013
Image from [1]
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
The Twitter API
Access to the Twitter data through an API (Application
programming interface), which limits the number of calls per hour.
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Example : Star Wars : The Force Awakens
How many (real) users behind tweets talking about the new Star
Wars movie ?
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Example : “Star Wars, The Force Awakens”
Let’s write :
yk = Number of tweets @starwars by user k
between 10/29/15, 7 :48 - 10 :48 PM EST
zk = 1{yk ≥ 1}
Goal : estimate NC = T(Z)
Additionally, we write : nC =
k∈s
zk
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
The Twitter graph
The Twitter graph ([6]) :
Is directed
Degree distribution is heavy-tailed
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
The Twitter graph
Has small path lengths
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Sampling designs
1 Bernoulli sample
2 Stratified Bernoulli
3 Snowball over the stratified Bernoulli
4 Adaptive over the stratified Bernoulli
5 (Rao-blackwell of the adaptive estimator)
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Stratification
U1 = Followers of official @starwars account
U2 = Rest of Twitter users
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Stratification : Neyman allocation
Given some preliminary exploratory data, we get (for n = 2000) :
n1 = 9700
n2 = 10300
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Sample size - extension
Size of s0 : 1000 (so that total sample size, with extensions, would
be about n = 20000).
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Calibration variables
N = Number of users in scope
Structure of number of followers
Number of verified users
. . .
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Estimators
ˆNC1 =
nC
p
ˆNC2 =
N1
n1
nC1 +
N − N1
n2
nC2
ˆNC3 =
k∈s
zi
1 − ¯π(Bi )
ˆNC4 =
K
k=1
n∗
CkJk
πgk
ˆNC5 = n0
+
K
k=1
nr
1 − (1 − p)nr
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Exclusion probabilities
¯π(Bi ) = P(Bi ⊂ ¯s)
=
k∈Bi
(1 − P(k ∈ s))
= q
#(Bi ∩U1)
S1 · q
#(Bi ∩U2)
S2
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Subsection 2
Results
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Results
Design n nscope n0
ˆNC
ˆCV ˆDeff
Bernoulli 20013 3946 354121 0.231 1.04
Stratified 20094 9832 316889 0.097 0.68
1-snowball 159957 73570 1000 331097 0.031 0.60
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Results
Mean number of tweets @StarWars per user : 1.18 ± 0.07
Suggests that bots are not responsible for this very large number of
tweets (see [4], [3]) !
Adaptive sampling did not converge.
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Subsection 3
Model-assisted sampling
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Auxiliary information for Barab´asi-Albert model :
Degree Centrality Local clustering Mean path Max path
Degree ++ - - - -
Centrality - - - -
Local clustering + +
Mean path ++
Max path
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Future work
Combine all these (optimal allocations, etc.)
Asymptotics
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Conclusion
Thank you !
http://nc233.com/madstat2017
@nc233
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
Paul Burkhardt and Chris Waring.
An nsa big graph experiment.
In presentation at the Carnegie Mellon University SDI/ISTC
Seminar, Pittsburgh, Pa, 2013.
Jean-Claude Deville and Carl-Erik S¨arndal.
Calibration estimators in survey sampling.
Journal of the American statistical Association,
87(418) :376–382, 1992.
Emilio Ferrara.
”manipulation and abuse on social media” by emilio ferrara
with ching-man au yeung as coordinator.
SIGWEB Newsl., (Spring) :4 :1–4 :9, April 2015.
Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer,
and Alessandro Flammini.
The rise of social bots.
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
arXiv preprint arXiv :1407.5225, 2014.
Eric D Kolaczyk.
Statistical analysis of network data.
Springer, 2009.
Seth A Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin.
Information network or social network ? : the structure of the
twitter follow graph.
In Proceedings of the companion publication of the 23rd
international conference on World wide web companion, pages
493–498. International World Wide Web Conferences Steering
Committee, 2014.
Art B. Owen.
Empirical likelihood.
CRC press, 2010.
Tiago P. Peixoto.
Antoine Rebecq Sampling designs for graphs
Statistics and networks
Survey sampling
Extending the sampling design
Application to Twitter data
The problem
Results
Model-assisted sampling
The graph-tool python library.
figshare, 2014.
Steven K Thompson.
Adaptive cluster sampling.
Journal of the American Statistical Association,
85(412) :1050–1059, 1990.
Steven K Thompson.
Stratified adaptive cluster sampling.
Biometrika, pages 389–397, 1991.
Antoine Rebecq Sampling designs for graphs

Contenu connexe

Tendances

Graph based Semi Supervised Learning V1
Graph based Semi Supervised Learning V1Graph based Semi Supervised Learning V1
Graph based Semi Supervised Learning V1
Neeta Pande
 
PPT slides
PPT slidesPPT slides
PPT slides
butest
 

Tendances (9)

Graph based Semi Supervised Learning V1
Graph based Semi Supervised Learning V1Graph based Semi Supervised Learning V1
Graph based Semi Supervised Learning V1
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
Contextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender SystemsContextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender Systems
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and TagsCold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
 
PPT slides
PPT slidesPPT slides
PPT slides
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
 
Investigating the effects of popularity data on predictive relevance judgment...
Investigating the effects of popularity data on predictive relevance judgment...Investigating the effects of popularity data on predictive relevance judgment...
Investigating the effects of popularity data on predictive relevance judgment...
 

En vedette

Digital 122446 t 26137-pemetaan distribusi-analisis
Digital 122446 t 26137-pemetaan distribusi-analisisDigital 122446 t 26137-pemetaan distribusi-analisis
Digital 122446 t 26137-pemetaan distribusi-analisis
keta gini-ama dila
 
Auditing sampling presentation
Auditing sampling  presentationAuditing sampling  presentation
Auditing sampling presentation
Dominic Korkoryi
 
Audit sampling
Audit samplingAudit sampling
Audit sampling
zaur2009
 

En vedette (20)

Tirage spatialement équilibré - INSEE
Tirage spatialement équilibré - INSEETirage spatialement équilibré - INSEE
Tirage spatialement équilibré - INSEE
 
Calage sur bornes minimales
Calage sur bornes minimalesCalage sur bornes minimales
Calage sur bornes minimales
 
Optimisation d'une allocation mixte
Optimisation d'une allocation mixteOptimisation d'une allocation mixte
Optimisation d'une allocation mixte
 
Side sampling theory and application
Side sampling theory and applicationSide sampling theory and application
Side sampling theory and application
 
Diversity and distribution of butterflies in the open and close canopy forest...
Diversity and distribution of butterflies in the open and close canopy forest...Diversity and distribution of butterflies in the open and close canopy forest...
Diversity and distribution of butterflies in the open and close canopy forest...
 
Digital 122446 t 26137-pemetaan distribusi-analisis
Digital 122446 t 26137-pemetaan distribusi-analisisDigital 122446 t 26137-pemetaan distribusi-analisis
Digital 122446 t 26137-pemetaan distribusi-analisis
 
Research 021
Research 021Research 021
Research 021
 
Blind-Spectrum Non-uniform Sampling and its Application in Wideband Spectrum ...
Blind-Spectrum Non-uniform Sampling and its Application in Wideband Spectrum ...Blind-Spectrum Non-uniform Sampling and its Application in Wideband Spectrum ...
Blind-Spectrum Non-uniform Sampling and its Application in Wideband Spectrum ...
 
Fisher paper#2
Fisher paper#2Fisher paper#2
Fisher paper#2
 
Audio Sampling: Application and Techniques
Audio Sampling: Application and TechniquesAudio Sampling: Application and Techniques
Audio Sampling: Application and Techniques
 
Research 012
Research 012Research 012
Research 012
 
Opinionated Analysis Development -- rstudio::conf
Opinionated Analysis Development -- rstudio::confOpinionated Analysis Development -- rstudio::conf
Opinionated Analysis Development -- rstudio::conf
 
Auditing sampling presentation
Auditing sampling  presentationAuditing sampling  presentation
Auditing sampling presentation
 
Research 022
Research 022Research 022
Research 022
 
Sampling & Types of Sampling
Sampling & Types of SamplingSampling & Types of Sampling
Sampling & Types of Sampling
 
Audit sampling
Audit samplingAudit sampling
Audit sampling
 
Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017
 
Extension of grid soil sampling technology; application of extended Technolog...
Extension of grid soil sampling technology; application of extended Technolog...Extension of grid soil sampling technology; application of extended Technolog...
Extension of grid soil sampling technology; application of extended Technolog...
 
Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 

Similaire à Sampling graphs efficiently - MAD Stat (TSE)

Knowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital LibrariesKnowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital Libraries
Ansgar Scherp
 
Symyx Notebook by Accelrys and the Enterprise R&D Architecture
Symyx Notebook by Accelrys and the Enterprise R&D ArchitectureSymyx Notebook by Accelrys and the Enterprise R&D Architecture
Symyx Notebook by Accelrys and the Enterprise R&D Architecture
BIOVIA
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithms
Farhan Zaki
 
MYashar_UCB_BIDS
MYashar_UCB_BIDSMYashar_UCB_BIDS
MYashar_UCB_BIDS
Mark Yashar
 
t-Test Project Instructions and Rubric Project Overvi.docx
t-Test Project Instructions and Rubric  Project Overvi.docxt-Test Project Instructions and Rubric  Project Overvi.docx
t-Test Project Instructions and Rubric Project Overvi.docx
mattinsonjanel
 
A First Step Towards Stream Reasoning at FIS 2008
A First Step Towards Stream Reasoning at FIS 2008A First Step Towards Stream Reasoning at FIS 2008
A First Step Towards Stream Reasoning at FIS 2008
Emanuele Della Valle
 
Educ 190_Data Analysis and Collection Tools
Educ 190_Data Analysis and Collection ToolsEduc 190_Data Analysis and Collection Tools
Educ 190_Data Analysis and Collection Tools
Teacher Pauline
 

Similaire à Sampling graphs efficiently - MAD Stat (TSE) (20)

Introduction to Computational Statistics
Introduction to Computational StatisticsIntroduction to Computational Statistics
Introduction to Computational Statistics
 
You Don't Have to Be a Data Scientist to Do Data Science
You Don't Have to Be a Data Scientist to Do Data ScienceYou Don't Have to Be a Data Scientist to Do Data Science
You Don't Have to Be a Data Scientist to Do Data Science
 
Knowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital LibrariesKnowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital Libraries
 
CodeOne 2018 - Microservices in action at the Dutch National Police
CodeOne 2018 - Microservices in action at the Dutch National PoliceCodeOne 2018 - Microservices in action at the Dutch National Police
CodeOne 2018 - Microservices in action at the Dutch National Police
 
Symyx Notebook by Accelrys and the Enterprise R&D Architecture
Symyx Notebook by Accelrys and the Enterprise R&D ArchitectureSymyx Notebook by Accelrys and the Enterprise R&D Architecture
Symyx Notebook by Accelrys and the Enterprise R&D Architecture
 
Bike Sharing Demand: Akshay Patil
Bike Sharing Demand: Akshay PatilBike Sharing Demand: Akshay Patil
Bike Sharing Demand: Akshay Patil
 
How Graphs Enhance AI
How Graphs Enhance AIHow Graphs Enhance AI
How Graphs Enhance AI
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithms
 
Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithms
 
MYashar_UCB_BIDS
MYashar_UCB_BIDSMYashar_UCB_BIDS
MYashar_UCB_BIDS
 
Lecture_1_-_Course_Overview_(Inked).pdf
Lecture_1_-_Course_Overview_(Inked).pdfLecture_1_-_Course_Overview_(Inked).pdf
Lecture_1_-_Course_Overview_(Inked).pdf
 
t-Test Project Instructions and Rubric Project Overvi.docx
t-Test Project Instructions and Rubric  Project Overvi.docxt-Test Project Instructions and Rubric  Project Overvi.docx
t-Test Project Instructions and Rubric Project Overvi.docx
 
Why Data Science is a Science
Why Data Science is a ScienceWhy Data Science is a Science
Why Data Science is a Science
 
A First Step Towards Stream Reasoning at FIS 2008
A First Step Towards Stream Reasoning at FIS 2008A First Step Towards Stream Reasoning at FIS 2008
A First Step Towards Stream Reasoning at FIS 2008
 
Fast top k path-based relevance query on massive graphs
Fast top k path-based relevance query on massive graphsFast top k path-based relevance query on massive graphs
Fast top k path-based relevance query on massive graphs
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
 
data summarization.pptx
data summarization.pptxdata summarization.pptx
data summarization.pptx
 
Educ 190_Data Analysis and Collection Tools
Educ 190_Data Analysis and Collection ToolsEduc 190_Data Analysis and Collection Tools
Educ 190_Data Analysis and Collection Tools
 
Self-Service IoT Data Analytics with StreamPipes
Self-Service IoT Data Analytics with StreamPipesSelf-Service IoT Data Analytics with StreamPipes
Self-Service IoT Data Analytics with StreamPipes
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networking
 

Dernier

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 

Dernier (20)

GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 

Sampling graphs efficiently - MAD Stat (TSE)

  • 1. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Sampling graphs efficiently: model assisted designs and application to Twitter data Antoine Rebecq Universit´e Paris X - INSEE 3/23/17 Antoine Rebecq Sampling designs for graphs
  • 2. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data 1 Statistics and networks Graphs and stats Methods - algorithms - models 2 Survey sampling Estimates Use of auxiliary information 3 Extending the sampling design Snowball sampling Adaptive sampling 4 Application to Twitter data The problem Results Model-assisted sampling Antoine Rebecq Sampling designs for graphs
  • 3. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Section 1 Statistics and networks Antoine Rebecq Sampling designs for graphs
  • 4. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Subsection 1 Graphs and stats Antoine Rebecq Sampling designs for graphs
  • 5. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Graphs Graph G, set of vertices and edges : G = (V , E) Antoine Rebecq Sampling designs for graphs
  • 6. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Directed graphs Antoine Rebecq Sampling designs for graphs
  • 7. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Statistics of interest - graphs Size Degree Centrality Clustering Communities . . . Antoine Rebecq Sampling designs for graphs
  • 8. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Degree dv = number of edges incident upon vertex v Antoine Rebecq Sampling designs for graphs
  • 9. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Degree / scale-free property Antoine Rebecq Sampling designs for graphs
  • 10. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Path lengths Antoine Rebecq Sampling designs for graphs
  • 11. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Centrality Measure of “importance” of a node. Examples : Google Pagerank, betweenness centrality (number of times a node acts as a bridge along the shortest path between two other nodes) Antoine Rebecq Sampling designs for graphs
  • 12. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Betweenness centrality Antoine Rebecq Sampling designs for graphs
  • 13. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Clustering Global clustering coefficient = 3 · number of triangles number of connected triplets Local clustering coefficient of a vertex = how close its neighbours are to being a clique (complete graph). Antoine Rebecq Sampling designs for graphs
  • 14. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Local clustering coefficient Antoine Rebecq Sampling designs for graphs
  • 15. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models The rise of “big graphs” Rise of “big graphs” Antoine Rebecq Sampling designs for graphs
  • 16. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models The rise of “big graphs” Example : The Graph500 benchmark (http://www.graph500.org). Size of data sets up to 1.1 PB adjacency list (human connectome size) Antoine Rebecq Sampling designs for graphs
  • 17. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Subsection 2 Methods - algorithms - models Antoine Rebecq Sampling designs for graphs
  • 18. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Methods for graph statistics Algorithms (computer science, “big data”) Model-based estimation Sampling (“Design-based estimation”) Antoine Rebecq Sampling designs for graphs
  • 19. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Methods for graph statistics Antoine Rebecq Sampling designs for graphs
  • 20. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Computer science methods Efficient algorithms (speed / memory). Antoine Rebecq Sampling designs for graphs
  • 21. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Computer science methods Efficient algorithms (speed / memory). Sometimes require sampling. Antoine Rebecq Sampling designs for graphs
  • 22. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Model-based estimation Famous graph models : Erd˝os-R´enyi Price / Barab´asi-Albert (High tailed degree distribution) Watts-Strogatz / “small-world” (short path lengths) Stochastic block models (communities) Images from [8] Antoine Rebecq Sampling designs for graphs
  • 23. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Model-based estimation : Erd˝os-R´enyi (“random graphs”) Antoine Rebecq Sampling designs for graphs
  • 24. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Model-based estimation : Barab´asi-Albert (“preferential attachment”) Antoine Rebecq Sampling designs for graphs
  • 25. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Model-based estimation : Watts-Strogatz (“small world”) Antoine Rebecq Sampling designs for graphs
  • 26. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Model-based estimation : Stochastic Block Models Antoine Rebecq Sampling designs for graphs
  • 27. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Sampling / Design-based estimation Sampling : select a few vertices/edges and compute estimators using sample data. Very little exists about design-based statistical inference on networks (Kolaczyk 2009 , [5]) We try survey sampling methods used in official Statistics Institutes to make design-based inference about “big graphs” Antoine Rebecq Sampling designs for graphs
  • 28. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Section 2 Survey sampling Antoine Rebecq Sampling designs for graphs
  • 29. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Subsection 1 Estimates Antoine Rebecq Sampling designs for graphs
  • 30. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Horvitz-Thompson estimator Population U (here vertices of the graph). Assign all k ∈ U an inclusion probability P(k ∈ s) = πk Antoine Rebecq Sampling designs for graphs
  • 31. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Horvitz-Thompson estimator Classic unbiased estimator for totals and means : Horvitz-Thompson ˆT(Y )HT = k∈s yk πk ˆ¯y = 1 N k∈s yk πk Antoine Rebecq Sampling designs for graphs
  • 32. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Horvitz-Thompson estimator Variance of the Horvitz-Thompson estimator depends on the first and second-order inclusion probabilities : πk = P(k ∈ s) πkl = P(k, l ∈ s) V( ˆT(Y )HT ) = k∈U l∈U (πkl − πkπl ) yk πk yl πl Antoine Rebecq Sampling designs for graphs
  • 33. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Bernoulli sampling Poisson sampling : For each k ∈ U , run a πk-Bernoulli experiment to decide whether to include unit k in the sample. Bernoulli sampling : ∀k, πk = p Antoine Rebecq Sampling designs for graphs
  • 34. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Subsection 2 Use of auxiliary information Antoine Rebecq Sampling designs for graphs
  • 35. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Auxiliary information If πk ∝ yk then V( ˆT(Y )HT ) = 0 In practice, use auxiliary variable : X which is well correlated to Y . Antoine Rebecq Sampling designs for graphs
  • 36. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Stratified sampling We write : U = U1 U2 . . . UH and draw independant samples in each Uh. Strata should be formed so that intra dispersion of yk is the lowest possible. Antoine Rebecq Sampling designs for graphs
  • 37. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Stratified sampling : Neyman allocation Given a set of strata and a sample size n, optimal variance is obtained for : nh = NhS2 h h NhS2 h Antoine Rebecq Sampling designs for graphs
  • 38. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Calibrated estimator Deville-Sarndal, 1992 ([2]). Modification of the Horvitz-Thompson estimator to take auxiliary information into account. Very similar to empirical likelihood methods ([7]). Computing variances for calibrated estimators is easy. Antoine Rebecq Sampling designs for graphs
  • 39. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Section 3 Extending the sampling design Antoine Rebecq Sampling designs for graphs
  • 40. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Official statistics Measuring “hidden populations” Antoine Rebecq Sampling designs for graphs
  • 41. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Community structure When trying to measure the size of a community ( ˆNC ), use of edges as auxiliary variables. Antoine Rebecq Sampling designs for graphs
  • 42. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling From now on, our sampling designs will include extensions : s = s0 ∪ sext Antoine Rebecq Sampling designs for graphs
  • 43. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Subsection 1 Snowball sampling Antoine Rebecq Sampling designs for graphs
  • 44. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling Population U Antoine Rebecq Sampling designs for graphs
  • 45. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling Initial sample s0 Antoine Rebecq Sampling designs for graphs
  • 46. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling One stage snowball extension s = A(s0) Antoine Rebecq Sampling designs for graphs
  • 47. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling Formally, we write : Bi = {i} ∪ {j ∈ V , Eji = ∅} Ai = {i} ∪ {j ∈ V , Eij = ∅} s = A(s0) Antoine Rebecq Sampling designs for graphs
  • 48. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling ˆNC3 = k∈s zi 1 − ¯π(Bi ) where : ¯π(Bi ) = P(Bi ⊂ ¯s) = k∈Bi (1 − P(k ∈ s)) Antoine Rebecq Sampling designs for graphs
  • 49. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling ˆV( ˆNC3) = i∈s j∈s zi zj ¯π(Bi ∪ Bj ) γij where : γij = ¯π(Bi ∪ Bj ) − ¯π(Bi )¯π(Bj ) [1 − ¯π(Bi )][1 − ¯π(Bj )] Antoine Rebecq Sampling designs for graphs
  • 50. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Subsection 2 Adaptive sampling Antoine Rebecq Sampling designs for graphs
  • 51. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling Adaptive sampling (Thompson, [9]) Used in official statistics to measure number of drugs users or HIV-positive people Sampling design often compared to the video game “minesweeper” Antoine Rebecq Sampling designs for graphs
  • 52. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling Image from [10] Antoine Rebecq Sampling designs for graphs
  • 53. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling Once a unit bearing the characteristic of interest is found, all its network is included in the sample. Antoine Rebecq Sampling designs for graphs
  • 54. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling Estimator : ˆNC4 = K k=1 n∗ CkJk πgk where : K = number of networks y∗ k = total of Y in the network k n∗ Ck = Number of people with yk ≥ 1 in the network k Jk = 1{k ∈ C} πgk = probability that the initial sample intersects k Antoine Rebecq Sampling designs for graphs
  • 55. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling When using an adaptive design, it is often better to use the Rao-Blackwell of the previous estimate. It has a very simple closed form in the case of the adaptive stratified. ˆNC5 = n0 + K k=1 nr 1 − (1 − p)nr where : n0 = #s0 and s0 = ∪r {k ∈ s, δ(k, C) = 1} is the union of the sides of C. Antoine Rebecq Sampling designs for graphs
  • 56. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling - Variance ˆV( ˆNC4) = K k=1 K k =1 ykyk πgkk πgkk πgkπgk − 1 where : πgkk = 1 − πgk − πgk + (1 − p)ngk +ngk Antoine Rebecq Sampling designs for graphs
  • 57. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling - Variance Variance estimation for the Rao-Blackwell can be done by selecting m samples : ˆV( ˆNC5) = ˆV( ˆNC4) − 1 m − 1 m i=1 ( ˆNC5i − ˆNC4)2 Antoine Rebecq Sampling designs for graphs
  • 58. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Section 4 Application to Twitter data Antoine Rebecq Sampling designs for graphs
  • 59. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Subsection 1 The problem Antoine Rebecq Sampling designs for graphs
  • 60. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling The Twitter graph Twitter in 2013 Image from [1] Antoine Rebecq Sampling designs for graphs
  • 61. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling The Twitter API Access to the Twitter data through an API (Application programming interface), which limits the number of calls per hour. Antoine Rebecq Sampling designs for graphs
  • 62. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Example : Star Wars : The Force Awakens How many (real) users behind tweets talking about the new Star Wars movie ? Antoine Rebecq Sampling designs for graphs
  • 63. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Example : “Star Wars, The Force Awakens” Let’s write : yk = Number of tweets @starwars by user k between 10/29/15, 7 :48 - 10 :48 PM EST zk = 1{yk ≥ 1} Goal : estimate NC = T(Z) Additionally, we write : nC = k∈s zk Antoine Rebecq Sampling designs for graphs
  • 64. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling The Twitter graph The Twitter graph ([6]) : Is directed Degree distribution is heavy-tailed Antoine Rebecq Sampling designs for graphs
  • 65. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling The Twitter graph Has small path lengths Antoine Rebecq Sampling designs for graphs
  • 66. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Sampling designs 1 Bernoulli sample 2 Stratified Bernoulli 3 Snowball over the stratified Bernoulli 4 Adaptive over the stratified Bernoulli 5 (Rao-blackwell of the adaptive estimator) Antoine Rebecq Sampling designs for graphs
  • 67. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Stratification U1 = Followers of official @starwars account U2 = Rest of Twitter users Antoine Rebecq Sampling designs for graphs
  • 68. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Stratification : Neyman allocation Given some preliminary exploratory data, we get (for n = 2000) : n1 = 9700 n2 = 10300 Antoine Rebecq Sampling designs for graphs
  • 69. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Sample size - extension Size of s0 : 1000 (so that total sample size, with extensions, would be about n = 20000). Antoine Rebecq Sampling designs for graphs
  • 70. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Calibration variables N = Number of users in scope Structure of number of followers Number of verified users . . . Antoine Rebecq Sampling designs for graphs
  • 71. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Estimators ˆNC1 = nC p ˆNC2 = N1 n1 nC1 + N − N1 n2 nC2 ˆNC3 = k∈s zi 1 − ¯π(Bi ) ˆNC4 = K k=1 n∗ CkJk πgk ˆNC5 = n0 + K k=1 nr 1 − (1 − p)nr Antoine Rebecq Sampling designs for graphs
  • 72. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Exclusion probabilities ¯π(Bi ) = P(Bi ⊂ ¯s) = k∈Bi (1 − P(k ∈ s)) = q #(Bi ∩U1) S1 · q #(Bi ∩U2) S2 Antoine Rebecq Sampling designs for graphs
  • 73. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Subsection 2 Results Antoine Rebecq Sampling designs for graphs
  • 74. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Results Design n nscope n0 ˆNC ˆCV ˆDeff Bernoulli 20013 3946 354121 0.231 1.04 Stratified 20094 9832 316889 0.097 0.68 1-snowball 159957 73570 1000 331097 0.031 0.60 Antoine Rebecq Sampling designs for graphs
  • 75. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Results Mean number of tweets @StarWars per user : 1.18 ± 0.07 Suggests that bots are not responsible for this very large number of tweets (see [4], [3]) ! Adaptive sampling did not converge. Antoine Rebecq Sampling designs for graphs
  • 76. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Subsection 3 Model-assisted sampling Antoine Rebecq Sampling designs for graphs
  • 77. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Auxiliary information for Barab´asi-Albert model : Degree Centrality Local clustering Mean path Max path Degree ++ - - - - Centrality - - - - Local clustering + + Mean path ++ Max path Antoine Rebecq Sampling designs for graphs
  • 78. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Future work Combine all these (optimal allocations, etc.) Asymptotics Antoine Rebecq Sampling designs for graphs
  • 79. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Conclusion Thank you ! http://nc233.com/madstat2017 @nc233 Antoine Rebecq Sampling designs for graphs
  • 80. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Paul Burkhardt and Chris Waring. An nsa big graph experiment. In presentation at the Carnegie Mellon University SDI/ISTC Seminar, Pittsburgh, Pa, 2013. Jean-Claude Deville and Carl-Erik S¨arndal. Calibration estimators in survey sampling. Journal of the American statistical Association, 87(418) :376–382, 1992. Emilio Ferrara. ”manipulation and abuse on social media” by emilio ferrara with ching-man au yeung as coordinator. SIGWEB Newsl., (Spring) :4 :1–4 :9, April 2015. Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro Flammini. The rise of social bots. Antoine Rebecq Sampling designs for graphs
  • 81. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling arXiv preprint arXiv :1407.5225, 2014. Eric D Kolaczyk. Statistical analysis of network data. Springer, 2009. Seth A Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin. Information network or social network ? : the structure of the twitter follow graph. In Proceedings of the companion publication of the 23rd international conference on World wide web companion, pages 493–498. International World Wide Web Conferences Steering Committee, 2014. Art B. Owen. Empirical likelihood. CRC press, 2010. Tiago P. Peixoto. Antoine Rebecq Sampling designs for graphs
  • 82. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling The graph-tool python library. figshare, 2014. Steven K Thompson. Adaptive cluster sampling. Journal of the American Statistical Association, 85(412) :1050–1059, 1990. Steven K Thompson. Stratified adaptive cluster sampling. Biometrika, pages 389–397, 1991. Antoine Rebecq Sampling designs for graphs