SlideShare une entreprise Scribd logo
1  sur  86
Télécharger pour lire hors ligne
Dr. ClaudiaWagner
http://claudiawagner.info/
Web Science Summer SchoolWS3 , Southampton, UK , 21th July 2014
source: Twitter
2
 Statistical computing is very central , but data
science is more than statistics
 Activities of data scientists:
 collection and generation,
 preparation,
 analysis,
 visualization,
 management and preservation of large collections of
data
Jeffrey Stanton, Introduction to Data Science, free e-book
3
 Ask interesting question
 Why is it important?Which number answers your question?
 Get or generate the data
 Which data will help answering you question? How is the data
generated? Are their any sampling biases? Ethical issues?
 Analyze the data
 Are there any anomalies or regularities?
 Which hidden process has generated the data?
 Fit a model to the data and validate it
 Visualize and communicate results
 What does 75% probability mean?
 Preserve and share the data to make results reproducible
4
 Data is a collection of facts
 Facts can be numbers, words,
measurements, observations or even just
descriptions of things
 Qualitative data (e.g., “it was great”)
 Quantitative data
 Discrete (e.g., 5)
 Continuous (e.g., 3.723)
5
6
Stevens, S. S. (1946). "On theTheory of Scales of Measurement". Science 103 (2684):
677–680.
Nominal (e.g., ethnic group, sex, nationality)
Ordinal (e.g., status)
Interval (e.g., temperature in Celsius)
Ratio (e.g., weight)
Observations are
only named
Observations can be ordered
Distance is meaningful
Absolute zero
7
 Random sample of Twitter users
 Random sample of tweets from the public timeline
 More active users are more likely to be included
 Friendship Paradox
 Select a random sample of people and ask them to list
the people they know. Contact a sample of the listed
friends and repeat the survey.
 Sampling bias: people with more friends are more
likely to show up in the friend lists which we generate
at the first stage
8
 A study found that the profession with the
lowest average age of death was student.
 Being a student does not cause you to die at an early
age. Being a student means you are young.This is
what makes the average of those that die so low.
 Amount of ice cream consumed per day is highly
correlated with number of drownings per day
 Both variables are correlated with the daily
temperature
9
"Teaching Statistics:A Bag ofTricks," by Gelman and Nolan (2002)
 A study found that only 1.5% of drivers in accidents
reported that they were using a cell phone, whereas
10.9% reported that they were distracted by another
occupant in the car.
 Can we conclude that using a cell phone safer than
speaking with another occupant?
 P(cellphone | accident) != P(accident | cellphone)
 Compare P(accident|cellphone) and P(accident|occupant)
 We need to know the prevalence of cell phone use
 It is likely that much more people talk to another occupant
in the car while driving than talking on the cell phone
10
Jessica Utts, What Educated Citizens Should Know about Statistics and Probability,The American
Statistician, Vol. 57, No. 2 (May, 2003), pp. 74-79
 Ecological Fallacy
 Illiteracy rate in each US state and the
proportion of immigrants per state
 Negative correlation of −0.53
▪ The greater the proportion of immigrants in a state,
the lower its average illiteracy.
 When individuals are considered, the
correlation was +0.12 — immigrants were on
average more illiterate than native citizens.
11
Robinson, W.S. (1950). "Ecological Correlations and the Behavior of Individuals". American
Sociological Review (American Sociological Review, Vol. 15, No. 3) 15 (3): 351–357.
Data Collection
Data Preprocessing
DataAnalysis
DataVisualization
Data Preservation
 Found data or observational data
 Are observational data enough?
 Are such data available?
 Generate Data
 Designs the data generation process
▪ E.g., via surveys, experiments, crowdsourcing
13
14http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html
Two general types of traces:
15
Accretion - a build-up
of physical traces
Erosion - the wearing away
of material
Webb, Eugene J. et al. Unobtrusive Measures: nonreactive research in the social
sciences. Chicago: Rand McNally, 1966
 Bulk downloads
 Wikipedia, IMDB, Million Song Database, etc.
 API access
 NYTimes,Twitter, Facebook, Foursquare, etc.
 Web scraping
 Tools e.g., http://scrapy.org/
 What data is ok to scrap?
▪ Public, non-sensitive, anonymized, fully referenced
information, Check terms of conditions!
16
 Takes time to accumulate
 Conservative estimate
 Only what happened counts! Intentions,
motivations or internal states don’t count.
 Inferentially weak
 Cannot answer “what-if” questions
17
 Surveys
 Simulations
 Model behavior of users/agents on a micro-level
 Simulate what happens under different conditions
 Empirical validation
 Experiments
 Keep all variables constant and only manipulate one
variable (e.g., emotions)
18
 Simulations
 Study of macro-phenomena
 Difficult to validate empirically
 Surveys and/or Experiments
 We only get data from those who are accessible and
willing to respond or participate
 Responders provide answers that are in line with self-
image and researcher’s expectations
 Hawthorne effect, etc.
19
Data Collection
Data Preprocessing
DataAnalysis
DataVisualization
Data Preservation
21
 Data cleaning
 Fill in missing values
 Smooth noisy data
 Identify or remove outliers
 Resolve inconsistencies
 Data integration
 Integration of multiple databases, or files
22
 Data transformation
 Normalization: scaled to fall within a small, specified range
 Standardization: how many standard deviations from the mean
lies each data point
 Discretization: divide the range of a continuous attribute into
intervals  some algorithms require discrete attributes.
 Data reduction
 Dimensionality reduction (remove unimportant attributes via
feature selection, group features into factors e.g. PCA, SVD)
 Aggregation and clustering
 Sampling
Data Collection
Data Preprocessing Data Mining
DataAnalysis  Statistical Inference
DataVisualization Machine Learning
Data Preservation
 Problem:
 Given high dimensional space (e.g., fb-user which
are described via various attributes such as
locations they visited)
 Find pairs of data points (𝒙, y) that are within
some distance threshold 𝒅(𝒙, y) ≤ 𝒔
 We first need to decide what „distance“
means
24
 Distance Measures
 Jaccard similarity between 2 sets of items I1, I2
sim(I1, I2) =
|𝐼1 ∩ 𝐼2|
|𝐼1 ∪ 𝐼2|
dist(I1, I2) = 1- sim(I1, I2)
 Euclidian distance, Hamming distance,
Cosine Similarity, etc.
25
 Goal: Given a set of items group the items
into some number of clusters, so that
 Members of a cluster are similar to each other
 Members of different clusters are dissimilar
26Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press
 Not-Hierarchical / Point assignment:
 Maintain a set of clusters
 Point belong to “nearest” cluster
 Hierarchical:
 Agglomerative (bottom up):
▪ Initially, each point is a cluster
▪ Repeatedly combine the two “nearest” clusters into one
 Divisive (top down):
▪ Start with one cluster and recursively split it
27Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press
28
29
30
31
32
 Try different k, looking at the change in the
average distance to centroid as k increases
 Average falls rapidly until right k, then
changes little
33
Average
Diameter
k
best k
 Aim: Find hidden concepts/groups in a matrix
 Method: SingularValue Decomposition (SVD)
34
Lescovec et al., Mining of Massive Datasets, p. 418
 Rank = 2
 Rank denotes the
information content of
the matrix.
 For instance, a rank-1
matrix can be written as a
product of one column and
one vector
35
36
37
Lescovec et al., Mining of Massive Datasets, p. 418
Relates users
and concepts
Relates movies to
concepts
Strength
of
concepts
Data Collection
Data Preprocessing Data Mining
DataAnalysis  Statistical Inference
DataVisualization Machine Learning
Data Preservation
 Estimate population parameter from sample statistics
 Sampling Distribution of statistic:
 Draw a finite set of samples of size n from the population
 Computing the statistic on the sample
 Repeat this process
 The mean of the sampling distribution is the expected
value of the statistic in the true population
 SD of the sampling distribution is the standard error
39
40
 Some descriptive statistics such as mean or
median are unbiased estimators of central
tendency
 Expected value of the statistic is the true
population parameter
 Expected value of dispersion in a sample is an
underestimate of the true population value
41
 True population size is N
 Sample size n < N (e.g., n=100)
 Correction factor :
𝑛
𝑛−1
 For n=100 the correction factor is ~ 1.01
 For n=100.000 our correction factor is
~1.00001
 Estimate PopulationVar:
(
𝑛
𝑛−1
) ∗ (𝑥 𝑖−𝜇𝑛
𝑖=1 )
𝑛
42
 Specify the range of values that have a high
probability of containing the true population
parameter
 Confidence level α: the probability that
confidence interval contains true population
parameter
43
 CI = sample statistic + MOE
 MOE = SE * Critical value
 MOE =
𝜎
𝑛
∗ 𝑧 𝛼/2
 CriticalValue: how far away from the mean
must a point lie in order to be considered as
“extreme” or “unexpected”?
44
n … sample size
σ … standard deviation
z α/2 … confidence coefficient
45
Area
under the
curve is
0.475
What’s
the z-
score?
46
 Select 1000 fb-user randomly
 Average number of bar visits per year X = 78
 Standard Deviation:
(𝑥 𝑖−𝜇𝑛
𝑖=1 )
2
𝑛
= 30
 Confidence level is 95%  divide 0.95 by 2 to get
0.475
 Check out the z table  z = 1.98
 MOE =
𝜎
𝑛
∗ 𝑧 𝛼/2 =
30
1000
∗ 1.98= 1.88
 78 +/- 1.88 CI: [76.12 ; 79.88]
47
 Exact CI can only be computed when the
sampling distribution and SD of sampling
distribution (i.e., SE) are known
 Otherwise we have to estimate the Standard
Error (SE)  Bootstrap
48
 Sampling with replacement
 Population is unknown
 But we observe one sample from the population of
size n=4: {2, 3, 8, 8}
 We use this sample to generate a large number of
bootstrap samples of size n:
▪ 8, 8, 8, 3
▪ 3, 3, 8, 2
▪ …
 Compute statistic (e.g. ,mean) for each
bootstrap sample
 Estimate SE from the bootstrap distribution
49
50
Population
Sample
Bootstrap
Sample
Bootstrap
Sample
Bootstrap
Sample
Bootstrap
Sample
Calculate statistic for
each bootstrap sample
Statistic +/- MOE
MOE for 95% CI = 2 * SE
Bootstrap Distribution
Standard Error (SE):
SD of bootstrap
distribution
 Randomly selected sample of fb-user
 Have they ever checked in at a nightclub?
 Democrats: 100/1000 yes
 Republican: 90/1000 yes
 Do the nightlife preferences differ
significantly across political parties?
 Give 95% CI for difference in proportions
51
 dems = rep( c(0,1), c(1000-100, 100) )
 repubs = rep( c(0,1), c(1000-90, 90) )
 mean(dems) #0.1
 mean(repubs) #0.09
 del.p = mean(dems) - mean(repubs) #0.01 (point estimate)
 reps = replicate( 1000, {
ds = sample( dems, 1000, replace=TRUE )
rs = sample( repubs, 1000, replace=TRUE )
mean( ds ) - mean( rs )
} )
 SE = sd( reps ) # 0.0131
 c( del.p - 2*SE, del.p + 2*SE ) #-0.0162 0.0362 (interval estimate)
52
 H1: political party affects the nightlife-preferences
 H0: political party does not affects the nightlife-
preferences
 Proportion of users who visited nightclubs not matter
which party they belong to: 190/2000 = 0.095
 If political affinities have no effect, we would expect
the following frequencies:
53
Democrats Republicans
yes 100 90 190
no 900 910 1810
Democrats Republicans
yes 95 95 190
no 905 905 1810
 χ2=
𝑜−𝑒 2
𝑒
= 0.5815
 DF = (number of rows – 1) x (number of columns – 1) = 1
 Critical value of χ2 at 5% significance and 1 DF is
3.84
 Our χ2 does not exceed the critical value
 We cannot reject H0
54
Democrats Republicans
yes 100 90 190
no 900 910 1810
 If α=0.05 then 95%
of all values fall in
this interval
 Two-tail test:
 2.5% of values in the
upper tail and 2.5%
of the lower tail are
considered as so
extreme that we
reject H0 if we
observe them
55
 Test if democrats on fb, on average, have more
than 60 bar visits per year
 H1: µ > 60
 H0: µ <= 60
 Random sample of 20 democratic fb-user:
 {65 73 51 67 48 80 69 53 59 62 71 67 64 78 65 490
80 60 51 70}
 Sample mean 𝜇=64.1
 Assume we know SD in population = 10
 𝑧 =
𝜇− 𝜇
𝑆𝐸
𝑆𝐸 =
𝑆𝐷
𝑛
𝑧 =
64.1−60
10/ 20
= 1.8336
56
 Would we expect that? How extreme is
this observation?
 If H0 is true (mean<=60)  in which area
around the mean do 95% of all points lie
 Pick alpha level α=0.05  that’s the
maximum probability where you reject
the null hypothesis if the null hypothesis
is true
 Right-tail test: find our critical value for
0.45 using the z-distribution
 If the z-score of our observed data exceed
this value we have to reject H0
57
1.8336 > 1.645  reject
the null hypothesis
 Large Effects, Small Samples:
 In small samples it is easy to overestimate an effect which
might have happened by chance
 Small Effects, Large Samples:
 The smaller the effect you want to measure the larger the
sample size you need to prove it significant!
 Example: Assume a coin is biased: 10% head and 90% tail
 Tossing the coin 10 times should be enough to convince people
that the coin is biased.
 Example: Assume a coin is biased: 51% head and 49% tail
 Minimum sample size increases with decreasing effect size
which one wants to demonstrate
58
 The more we analyze, the more we find by
chance!
 If you calculate correlation between 10 variables
(i.e., 44 different correlation coefficients) you
should expect that at least 2 correlations are
significant with p < 0.05 by chance (one in every
20)
 Corrections or adjustments for the total number
of comparison are needed!
59
 Many tests such as z-test, t-test, ANOVA make the
normality assumption.
 If true population is very skewed (e.g. power law) the
sampling distribution of the statistic will not be normal
 Nonparametric methods like sign-test use e.g. median
rather than the mean
 Hypothesis about the median of the true population (e.g. H1:
median < 100, H0: median = 100)
 Count number of measurements that favor the null hypothesis
 If H0 is true half of the measurement should fall on each side.
60
Data Collection
Data Preprocessing Data Mining
DataAnalysis  Statistical Inference
DataVisualization Machine Learning
Data Preservation
 Aim
 Find a function that describes the relation between X
(e.g. bar visits) andY (e.g. new friends)
 Given X predictY
 Problem
 Infinite number of ways X andY could be related
 Idea
 Reduce space of possible function and start with the
simplest one (linear relation)
 Y= 𝑏0 + 𝑏1 𝑋
62
 Y = 2 + 0.5 X
63
6
4
2
0
Y
X
0 2 4 6 8
 Use Gradient Descent to minimize Cost
function C 𝑏0, 𝑏1
 C 𝑏0, 𝑏1 =
1
2𝑁
(𝑌𝑖−𝑌𝑖)2𝑁
𝑖=1
 C 𝑏0, 𝑏1 =
1
2𝑁
(𝑌𝑖 − 𝑏0 − 𝑏1 𝑋)2𝑁
𝑖=1
 Start with some guess for 𝑏0, 𝑏1
 Keep changing 𝑏0, 𝑏1 to reduce C 𝑏0, 𝑏1 until
we hopefully end up at a minimum
64
𝑏0 ≔ 𝑏0 − 𝛼
𝜕
𝜕𝑏 𝑗
C 𝑏0, 𝑏1
 𝑏1 ≔ 𝑏1 − 𝛼
𝜕
𝜕𝑏 𝑗
C 𝑏0, 𝑏1
 Simultaneous updates of b0 and b1
65
Derivative of cost function
informs us about the slope of
the cost function
Learning rate
66
C(b)
b
 Residuals: deviation between the observed
and the predicted values
 Residual sum of squares:
67
Is this a good
measure?
No it depends on
the number of
observations N
What if we
multiply it with
1/N?
 𝑦𝑖… observed value
 𝑦 … value predicted by the model
 𝑦 … mean of observed data
68
Total variability
in the outcome
that needs to be
explained
Unexplained variability!
Residuals: difference
between the observed value
and the estimated value
Proportion of the total variability
unexplained by the model
 Independent variable is binary (e.g., went to nightclub
or not)
 We can group users by number of new friends year
(20-25, 25-30, 30-35, etc.) and compute the proportion
of people with high “nightclub-probability”
69
 Logistic Regression:
 Maximum Likelihood Estimator
 Estimate unknown coefficients by
maximizing the log likelihood function
 Coefficient is interpreted as the rate of
change in the "log odds" as X changes
70
ln
𝑃(𝑌 = 1)
1 − 𝑃(𝑌 = 1)
= 𝑏0 + 𝑏1X + ϵ
Simple Example:
You have a coin that you know is biased towards
heads and you want to know what the probability of
heads (p) is.
We want to estimate the unknown parameter p!
71
You flip the coin 10 times and the coin comes
up head 7 times.
What’s your best guess for p?
72
3737
)1(
!3!7
!10
)1(
7
10
)heads7( ppppP 








Find the value for p that makes our data most likely!
The probability of observing 7 times head when tossing
a coin 10 times is given by this binomial distribution:
73
)1log(3log7
!3!7
!10
loglog ppLikelihood 


Set the derivative equal to 0 and solve for p.
Derivative with respect to p.
pp
Likelihood
dp
d



1
37
0log
10
7
107377
3)1(70
)1(
3)1(7
0
1
37








p
ppp
pp
pp
pp
pp
*derivative of a constant is 0
*derivative 7f(x)=7f '(x)
*derivative of log x is 1/x
3737
)1(
!3!7
!10
)1(
7
10
ppppLikelihood 








74
web.stanford.edu/~kcobb/hrp261/lecture4.ppt
267.)3(.)7(.120)3(.)7(.
7
10
LikelihoodtheofValue 3737







Likelihood of observing 7 times head when tossing a
biased coin with p(head) = 0.7 and p(tail)=0.3 10 times
is:
75
 Linear Regression (R-squared)
 Logistic Regression (pseudo R-squared)
76
you can “prove” anything with graphics
Data Collection
Data Preprocessing
DataAnalysis
DataVisualization
Data Preservation
78
79
http://www.motherjones.com/kevin-drum/2012/01/lying-charts-global-warming-edition
80
http://www.motherjones.com/kevin-drum/2012/01/lying-charts-global-warming-edition
 Be careful when drawing conclusions from
graphs
 Size of effect shown in graphic != Size of
effect in sample data != Size of the effect
in the true population
 Scale Disorting (e.g., bar charts not starting with
zero)
 Snapshot
 …
81
Data Collection
Data Preprocessing
DataAnalysis
DataVisualization
Data Preservation
 GESIS Data Archives & Data Centers
 Preserve research data and make them accessible for
reuse.
 Competencies and infrastructure
▪ e.g. https://datorium.gesis.org/xmlui/
 CESSDA:
 umbrella organisation for the European national data
archives (http://www.cessda.net/)
 Re3data
 browse data archives by topic: http://www.re3data.org/
83
DPC Digital Preservation Handbook:
http://www.dpconline.org/advice/preservationhandbook
 Legal and regulatory framework
 including open access and licenses
 Incentives to share data
 Credentials? Citation principles under development (see
e.g. http://www.datacite.org/).
 Long term preservation strategies
 software and hardware changes, documentation,
metadata and retrieval/access
Data preservation starts at an individual level
Reasons for data loss often on an individual level,
e.g. broken hardware, researchers leaving a
group.
84
http://claudiawagner.info/teaching/WebSciSS2014/
 Vasant Dhar. Data Science and Prediction. In: Communications of
the ACM, December 2013,Vol. 56, No. 12, pp. 64-73
 Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of
Massive Datasets, Cambridge University Press (free download)
 Jeffrey Stanton, Introduction to Data Science (free download)
 Steffen Staab, Data Science Course University Koblenz-Landau,
https://www.uni-koblenz-landau.de/campus-
koblenz/fb4/west/teaching/ss14/data-science/data-science1
 Serious Stats,Thom Baguley
86

Contenu connexe

Tendances

Quantitative analysis using SPSS
Quantitative analysis using SPSSQuantitative analysis using SPSS
Quantitative analysis using SPSSAlaa Sadik
 
Quantitative data analysis
Quantitative data analysisQuantitative data analysis
Quantitative data analysisAyuni Abdullah
 
Statistics For The Behavioral Sciences 10th Edition Gravetter Solutions Manual
Statistics For The Behavioral Sciences 10th Edition Gravetter Solutions ManualStatistics For The Behavioral Sciences 10th Edition Gravetter Solutions Manual
Statistics For The Behavioral Sciences 10th Edition Gravetter Solutions Manuallajabed
 
MD Paediatrics (Part 1) - Overview of Basic Statistics
MD Paediatrics (Part 1) - Overview of Basic StatisticsMD Paediatrics (Part 1) - Overview of Basic Statistics
MD Paediatrics (Part 1) - Overview of Basic StatisticsBernard Deepal W. Jayamanne
 
Applied statistics lecture_2
Applied statistics lecture_2Applied statistics lecture_2
Applied statistics lecture_2Daria Bogdanova
 
Visualiation of quantitative information
Visualiation of quantitative informationVisualiation of quantitative information
Visualiation of quantitative informationJames Neill
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSScsula its training
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal InferenceNBER
 
Quantitative Data Analysis
Quantitative Data AnalysisQuantitative Data Analysis
Quantitative Data AnalysisAsma Muhamad
 
Statistical tools in research 1
Statistical tools in research 1Statistical tools in research 1
Statistical tools in research 1ashish7sattee
 
Malimu descriptive statistics.
Malimu descriptive statistics.Malimu descriptive statistics.
Malimu descriptive statistics.Miharbi Ignasm
 
Descriptive Statistics, Numerical Description
Descriptive Statistics, Numerical DescriptionDescriptive Statistics, Numerical Description
Descriptive Statistics, Numerical Descriptiongetyourcheaton
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysissristi1992
 

Tendances (18)

Quantitative analysis using SPSS
Quantitative analysis using SPSSQuantitative analysis using SPSS
Quantitative analysis using SPSS
 
Quantitative data analysis
Quantitative data analysisQuantitative data analysis
Quantitative data analysis
 
Statistics For The Behavioral Sciences 10th Edition Gravetter Solutions Manual
Statistics For The Behavioral Sciences 10th Edition Gravetter Solutions ManualStatistics For The Behavioral Sciences 10th Edition Gravetter Solutions Manual
Statistics For The Behavioral Sciences 10th Edition Gravetter Solutions Manual
 
MD Paediatrics (Part 1) - Overview of Basic Statistics
MD Paediatrics (Part 1) - Overview of Basic StatisticsMD Paediatrics (Part 1) - Overview of Basic Statistics
MD Paediatrics (Part 1) - Overview of Basic Statistics
 
Applied statistics lecture_2
Applied statistics lecture_2Applied statistics lecture_2
Applied statistics lecture_2
 
Visualiation of quantitative information
Visualiation of quantitative informationVisualiation of quantitative information
Visualiation of quantitative information
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal Inference
 
Quantitative Data Analysis
Quantitative Data AnalysisQuantitative Data Analysis
Quantitative Data Analysis
 
Multivariate
MultivariateMultivariate
Multivariate
 
Chapter 01
Chapter 01Chapter 01
Chapter 01
 
Chapter 01
Chapter 01Chapter 01
Chapter 01
 
Statistical tools in research 1
Statistical tools in research 1Statistical tools in research 1
Statistical tools in research 1
 
Malimu descriptive statistics.
Malimu descriptive statistics.Malimu descriptive statistics.
Malimu descriptive statistics.
 
Descriptive Statistics, Numerical Description
Descriptive Statistics, Numerical DescriptionDescriptive Statistics, Numerical Description
Descriptive Statistics, Numerical Description
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
 
Types of Data, Key Concept
Types of Data, Key ConceptTypes of Data, Key Concept
Types of Data, Key Concept
 
Analysis and Interpretation of Data
Analysis and Interpretation of DataAnalysis and Interpretation of Data
Analysis and Interpretation of Data
 

Similaire à Datascience Introduction WebSci Summer School 2014

Similaire à Datascience Introduction WebSci Summer School 2014 (20)

Data Analysis
Data Analysis Data Analysis
Data Analysis
 
Data analysis presentation by Jameel Ahmed Qureshi
Data analysis presentation by Jameel Ahmed QureshiData analysis presentation by Jameel Ahmed Qureshi
Data analysis presentation by Jameel Ahmed Qureshi
 
Statistics
StatisticsStatistics
Statistics
 
Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statistics
 
Introduction to Biostatistics_20_4_17.ppt
Introduction to Biostatistics_20_4_17.pptIntroduction to Biostatistics_20_4_17.ppt
Introduction to Biostatistics_20_4_17.ppt
 
Practice Test 1 solutions
Practice Test 1 solutions  Practice Test 1 solutions
Practice Test 1 solutions
 
Lab 1 intro
Lab 1 introLab 1 intro
Lab 1 intro
 
chapter 1.pptx
chapter 1.pptxchapter 1.pptx
chapter 1.pptx
 
Data science
Data scienceData science
Data science
 
Chapter_9.pptx
Chapter_9.pptxChapter_9.pptx
Chapter_9.pptx
 
Chapter 1
Chapter 1Chapter 1
Chapter 1
 
Lecture notes on STS 102
Lecture notes on STS 102Lecture notes on STS 102
Lecture notes on STS 102
 
Sampling and statistical inference
Sampling and statistical inferenceSampling and statistical inference
Sampling and statistical inference
 
Lesson 1 05 measuring central tendency
Lesson 1 05 measuring central tendencyLesson 1 05 measuring central tendency
Lesson 1 05 measuring central tendency
 
Statistics
StatisticsStatistics
Statistics
 
Quant Data Analysis
Quant Data AnalysisQuant Data Analysis
Quant Data Analysis
 
IDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesIDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notes
 
Lesson 6 chapter 4
Lesson 6   chapter 4Lesson 6   chapter 4
Lesson 6 chapter 4
 
Review of Basic Statistics and Terminology
Review of Basic Statistics and TerminologyReview of Basic Statistics and Terminology
Review of Basic Statistics and Terminology
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 

Plus de Claudia Wagner

Measuring Gender Inequality in Wikipedia
Measuring Gender Inequality in WikipediaMeasuring Gender Inequality in Wikipedia
Measuring Gender Inequality in WikipediaClaudia Wagner
 
Slam about "Discrimination and Inequalities in socio-computational systems"
Slam about "Discrimination and Inequalities in socio-computational systems"Slam about "Discrimination and Inequalities in socio-computational systems"
Slam about "Discrimination and Inequalities in socio-computational systems"Claudia Wagner
 
It's a Man's Wikipedia?
It's a Man's Wikipedia? It's a Man's Wikipedia?
It's a Man's Wikipedia? Claudia Wagner
 
When politicians talk: Assessing online conversational practices of political...
When politicians talk: Assessing online conversational practices of political...When politicians talk: Assessing online conversational practices of political...
When politicians talk: Assessing online conversational practices of political...Claudia Wagner
 
WWW2014 Semantic Stability in Social Tagging Streams
WWW2014 Semantic Stability in Social Tagging StreamsWWW2014 Semantic Stability in Social Tagging Streams
WWW2014 Semantic Stability in Social Tagging StreamsClaudia Wagner
 
Welcome 1st Computational Social Science Workshop 2013 at GESIS
Welcome 1st Computational Social Science Workshop 2013 at GESISWelcome 1st Computational Social Science Workshop 2013 at GESIS
Welcome 1st Computational Social Science Workshop 2013 at GESISClaudia Wagner
 
Spatio and Temporal Dietary Patterns
Spatio and Temporal Dietary PatternsSpatio and Temporal Dietary Patterns
Spatio and Temporal Dietary PatternsClaudia Wagner
 
Eswc2013 audience short
Eswc2013 audience shortEswc2013 audience short
Eswc2013 audience shortClaudia Wagner
 
The Impact of Socialbots in Online Social Networks
The Impact of Socialbots in Online Social NetworksThe Impact of Socialbots in Online Social Networks
The Impact of Socialbots in Online Social NetworksClaudia Wagner
 
It’s not in their tweets: Modeling topical expertise of Twitter users
It’s not in their tweets: Modeling topical expertise of Twitter users It’s not in their tweets: Modeling topical expertise of Twitter users
It’s not in their tweets: Modeling topical expertise of Twitter users Claudia Wagner
 
Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online ...
Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online ...Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online ...
Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online ...Claudia Wagner
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 
Knowledge Acquisition from Social Awareness Streams
Knowledge Acquisition from Social Awareness StreamsKnowledge Acquisition from Social Awareness Streams
Knowledge Acquisition from Social Awareness StreamsClaudia Wagner
 
The wisdom in Tweetonomies
The wisdom in TweetonomiesThe wisdom in Tweetonomies
The wisdom in TweetonomiesClaudia Wagner
 

Plus de Claudia Wagner (18)

Measuring Gender Inequality in Wikipedia
Measuring Gender Inequality in WikipediaMeasuring Gender Inequality in Wikipedia
Measuring Gender Inequality in Wikipedia
 
Slam about "Discrimination and Inequalities in socio-computational systems"
Slam about "Discrimination and Inequalities in socio-computational systems"Slam about "Discrimination and Inequalities in socio-computational systems"
Slam about "Discrimination and Inequalities in socio-computational systems"
 
It's a Man's Wikipedia?
It's a Man's Wikipedia? It's a Man's Wikipedia?
It's a Man's Wikipedia?
 
Food and Culture
Food and CultureFood and Culture
Food and Culture
 
When politicians talk: Assessing online conversational practices of political...
When politicians talk: Assessing online conversational practices of political...When politicians talk: Assessing online conversational practices of political...
When politicians talk: Assessing online conversational practices of political...
 
WWW2014 Semantic Stability in Social Tagging Streams
WWW2014 Semantic Stability in Social Tagging StreamsWWW2014 Semantic Stability in Social Tagging Streams
WWW2014 Semantic Stability in Social Tagging Streams
 
Welcome 1st Computational Social Science Workshop 2013 at GESIS
Welcome 1st Computational Social Science Workshop 2013 at GESISWelcome 1st Computational Social Science Workshop 2013 at GESIS
Welcome 1st Computational Social Science Workshop 2013 at GESIS
 
Spatio and Temporal Dietary Patterns
Spatio and Temporal Dietary PatternsSpatio and Temporal Dietary Patterns
Spatio and Temporal Dietary Patterns
 
Eswc2013 audience short
Eswc2013 audience shortEswc2013 audience short
Eswc2013 audience short
 
The Impact of Socialbots in Online Social Networks
The Impact of Socialbots in Online Social NetworksThe Impact of Socialbots in Online Social Networks
The Impact of Socialbots in Online Social Networks
 
It’s not in their tweets: Modeling topical expertise of Twitter users
It’s not in their tweets: Modeling topical expertise of Twitter users It’s not in their tweets: Modeling topical expertise of Twitter users
It’s not in their tweets: Modeling topical expertise of Twitter users
 
Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online ...
Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online ...Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online ...
Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online ...
 
Socialbots www2012
Socialbots www2012Socialbots www2012
Socialbots www2012
 
SDOW (ISWC2011)
SDOW (ISWC2011)SDOW (ISWC2011)
SDOW (ISWC2011)
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Knowledge Acquisition from Social Awareness Streams
Knowledge Acquisition from Social Awareness StreamsKnowledge Acquisition from Social Awareness Streams
Knowledge Acquisition from Social Awareness Streams
 
The wisdom in Tweetonomies
The wisdom in TweetonomiesThe wisdom in Tweetonomies
The wisdom in Tweetonomies
 

Dernier

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 

Dernier (20)

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 

Datascience Introduction WebSci Summer School 2014

  • 1. Dr. ClaudiaWagner http://claudiawagner.info/ Web Science Summer SchoolWS3 , Southampton, UK , 21th July 2014
  • 3.  Statistical computing is very central , but data science is more than statistics  Activities of data scientists:  collection and generation,  preparation,  analysis,  visualization,  management and preservation of large collections of data Jeffrey Stanton, Introduction to Data Science, free e-book 3
  • 4.  Ask interesting question  Why is it important?Which number answers your question?  Get or generate the data  Which data will help answering you question? How is the data generated? Are their any sampling biases? Ethical issues?  Analyze the data  Are there any anomalies or regularities?  Which hidden process has generated the data?  Fit a model to the data and validate it  Visualize and communicate results  What does 75% probability mean?  Preserve and share the data to make results reproducible 4
  • 5.  Data is a collection of facts  Facts can be numbers, words, measurements, observations or even just descriptions of things  Qualitative data (e.g., “it was great”)  Quantitative data  Discrete (e.g., 5)  Continuous (e.g., 3.723) 5
  • 6. 6 Stevens, S. S. (1946). "On theTheory of Scales of Measurement". Science 103 (2684): 677–680. Nominal (e.g., ethnic group, sex, nationality) Ordinal (e.g., status) Interval (e.g., temperature in Celsius) Ratio (e.g., weight) Observations are only named Observations can be ordered Distance is meaningful Absolute zero
  • 7. 7
  • 8.  Random sample of Twitter users  Random sample of tweets from the public timeline  More active users are more likely to be included  Friendship Paradox  Select a random sample of people and ask them to list the people they know. Contact a sample of the listed friends and repeat the survey.  Sampling bias: people with more friends are more likely to show up in the friend lists which we generate at the first stage 8
  • 9.  A study found that the profession with the lowest average age of death was student.  Being a student does not cause you to die at an early age. Being a student means you are young.This is what makes the average of those that die so low.  Amount of ice cream consumed per day is highly correlated with number of drownings per day  Both variables are correlated with the daily temperature 9 "Teaching Statistics:A Bag ofTricks," by Gelman and Nolan (2002)
  • 10.  A study found that only 1.5% of drivers in accidents reported that they were using a cell phone, whereas 10.9% reported that they were distracted by another occupant in the car.  Can we conclude that using a cell phone safer than speaking with another occupant?  P(cellphone | accident) != P(accident | cellphone)  Compare P(accident|cellphone) and P(accident|occupant)  We need to know the prevalence of cell phone use  It is likely that much more people talk to another occupant in the car while driving than talking on the cell phone 10 Jessica Utts, What Educated Citizens Should Know about Statistics and Probability,The American Statistician, Vol. 57, No. 2 (May, 2003), pp. 74-79
  • 11.  Ecological Fallacy  Illiteracy rate in each US state and the proportion of immigrants per state  Negative correlation of −0.53 ▪ The greater the proportion of immigrants in a state, the lower its average illiteracy.  When individuals are considered, the correlation was +0.12 — immigrants were on average more illiterate than native citizens. 11 Robinson, W.S. (1950). "Ecological Correlations and the Behavior of Individuals". American Sociological Review (American Sociological Review, Vol. 15, No. 3) 15 (3): 351–357.
  • 13.  Found data or observational data  Are observational data enough?  Are such data available?  Generate Data  Designs the data generation process ▪ E.g., via surveys, experiments, crowdsourcing 13
  • 15. Two general types of traces: 15 Accretion - a build-up of physical traces Erosion - the wearing away of material Webb, Eugene J. et al. Unobtrusive Measures: nonreactive research in the social sciences. Chicago: Rand McNally, 1966
  • 16.  Bulk downloads  Wikipedia, IMDB, Million Song Database, etc.  API access  NYTimes,Twitter, Facebook, Foursquare, etc.  Web scraping  Tools e.g., http://scrapy.org/  What data is ok to scrap? ▪ Public, non-sensitive, anonymized, fully referenced information, Check terms of conditions! 16
  • 17.  Takes time to accumulate  Conservative estimate  Only what happened counts! Intentions, motivations or internal states don’t count.  Inferentially weak  Cannot answer “what-if” questions 17
  • 18.  Surveys  Simulations  Model behavior of users/agents on a micro-level  Simulate what happens under different conditions  Empirical validation  Experiments  Keep all variables constant and only manipulate one variable (e.g., emotions) 18
  • 19.  Simulations  Study of macro-phenomena  Difficult to validate empirically  Surveys and/or Experiments  We only get data from those who are accessible and willing to respond or participate  Responders provide answers that are in line with self- image and researcher’s expectations  Hawthorne effect, etc. 19
  • 21. 21  Data cleaning  Fill in missing values  Smooth noisy data  Identify or remove outliers  Resolve inconsistencies  Data integration  Integration of multiple databases, or files
  • 22. 22  Data transformation  Normalization: scaled to fall within a small, specified range  Standardization: how many standard deviations from the mean lies each data point  Discretization: divide the range of a continuous attribute into intervals  some algorithms require discrete attributes.  Data reduction  Dimensionality reduction (remove unimportant attributes via feature selection, group features into factors e.g. PCA, SVD)  Aggregation and clustering  Sampling
  • 23. Data Collection Data Preprocessing Data Mining DataAnalysis  Statistical Inference DataVisualization Machine Learning Data Preservation
  • 24.  Problem:  Given high dimensional space (e.g., fb-user which are described via various attributes such as locations they visited)  Find pairs of data points (𝒙, y) that are within some distance threshold 𝒅(𝒙, y) ≤ 𝒔  We first need to decide what „distance“ means 24
  • 25.  Distance Measures  Jaccard similarity between 2 sets of items I1, I2 sim(I1, I2) = |𝐼1 ∩ 𝐼2| |𝐼1 ∪ 𝐼2| dist(I1, I2) = 1- sim(I1, I2)  Euclidian distance, Hamming distance, Cosine Similarity, etc. 25
  • 26.  Goal: Given a set of items group the items into some number of clusters, so that  Members of a cluster are similar to each other  Members of different clusters are dissimilar 26Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press
  • 27.  Not-Hierarchical / Point assignment:  Maintain a set of clusters  Point belong to “nearest” cluster  Hierarchical:  Agglomerative (bottom up): ▪ Initially, each point is a cluster ▪ Repeatedly combine the two “nearest” clusters into one  Divisive (top down): ▪ Start with one cluster and recursively split it 27Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press
  • 28. 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. 32
  • 33.  Try different k, looking at the change in the average distance to centroid as k increases  Average falls rapidly until right k, then changes little 33 Average Diameter k best k
  • 34.  Aim: Find hidden concepts/groups in a matrix  Method: SingularValue Decomposition (SVD) 34 Lescovec et al., Mining of Massive Datasets, p. 418
  • 35.  Rank = 2  Rank denotes the information content of the matrix.  For instance, a rank-1 matrix can be written as a product of one column and one vector 35
  • 36. 36
  • 37. 37 Lescovec et al., Mining of Massive Datasets, p. 418 Relates users and concepts Relates movies to concepts Strength of concepts
  • 38. Data Collection Data Preprocessing Data Mining DataAnalysis  Statistical Inference DataVisualization Machine Learning Data Preservation
  • 39.  Estimate population parameter from sample statistics  Sampling Distribution of statistic:  Draw a finite set of samples of size n from the population  Computing the statistic on the sample  Repeat this process  The mean of the sampling distribution is the expected value of the statistic in the true population  SD of the sampling distribution is the standard error 39
  • 40. 40
  • 41.  Some descriptive statistics such as mean or median are unbiased estimators of central tendency  Expected value of the statistic is the true population parameter  Expected value of dispersion in a sample is an underestimate of the true population value 41
  • 42.  True population size is N  Sample size n < N (e.g., n=100)  Correction factor : 𝑛 𝑛−1  For n=100 the correction factor is ~ 1.01  For n=100.000 our correction factor is ~1.00001  Estimate PopulationVar: ( 𝑛 𝑛−1 ) ∗ (𝑥 𝑖−𝜇𝑛 𝑖=1 ) 𝑛 42
  • 43.  Specify the range of values that have a high probability of containing the true population parameter  Confidence level α: the probability that confidence interval contains true population parameter 43
  • 44.  CI = sample statistic + MOE  MOE = SE * Critical value  MOE = 𝜎 𝑛 ∗ 𝑧 𝛼/2  CriticalValue: how far away from the mean must a point lie in order to be considered as “extreme” or “unexpected”? 44 n … sample size σ … standard deviation z α/2 … confidence coefficient
  • 45. 45
  • 47.  Select 1000 fb-user randomly  Average number of bar visits per year X = 78  Standard Deviation: (𝑥 𝑖−𝜇𝑛 𝑖=1 ) 2 𝑛 = 30  Confidence level is 95%  divide 0.95 by 2 to get 0.475  Check out the z table  z = 1.98  MOE = 𝜎 𝑛 ∗ 𝑧 𝛼/2 = 30 1000 ∗ 1.98= 1.88  78 +/- 1.88 CI: [76.12 ; 79.88] 47
  • 48.  Exact CI can only be computed when the sampling distribution and SD of sampling distribution (i.e., SE) are known  Otherwise we have to estimate the Standard Error (SE)  Bootstrap 48
  • 49.  Sampling with replacement  Population is unknown  But we observe one sample from the population of size n=4: {2, 3, 8, 8}  We use this sample to generate a large number of bootstrap samples of size n: ▪ 8, 8, 8, 3 ▪ 3, 3, 8, 2 ▪ …  Compute statistic (e.g. ,mean) for each bootstrap sample  Estimate SE from the bootstrap distribution 49
  • 50. 50 Population Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Calculate statistic for each bootstrap sample Statistic +/- MOE MOE for 95% CI = 2 * SE Bootstrap Distribution Standard Error (SE): SD of bootstrap distribution
  • 51.  Randomly selected sample of fb-user  Have they ever checked in at a nightclub?  Democrats: 100/1000 yes  Republican: 90/1000 yes  Do the nightlife preferences differ significantly across political parties?  Give 95% CI for difference in proportions 51
  • 52.  dems = rep( c(0,1), c(1000-100, 100) )  repubs = rep( c(0,1), c(1000-90, 90) )  mean(dems) #0.1  mean(repubs) #0.09  del.p = mean(dems) - mean(repubs) #0.01 (point estimate)  reps = replicate( 1000, { ds = sample( dems, 1000, replace=TRUE ) rs = sample( repubs, 1000, replace=TRUE ) mean( ds ) - mean( rs ) } )  SE = sd( reps ) # 0.0131  c( del.p - 2*SE, del.p + 2*SE ) #-0.0162 0.0362 (interval estimate) 52
  • 53.  H1: political party affects the nightlife-preferences  H0: political party does not affects the nightlife- preferences  Proportion of users who visited nightclubs not matter which party they belong to: 190/2000 = 0.095  If political affinities have no effect, we would expect the following frequencies: 53 Democrats Republicans yes 100 90 190 no 900 910 1810 Democrats Republicans yes 95 95 190 no 905 905 1810
  • 54.  χ2= 𝑜−𝑒 2 𝑒 = 0.5815  DF = (number of rows – 1) x (number of columns – 1) = 1  Critical value of χ2 at 5% significance and 1 DF is 3.84  Our χ2 does not exceed the critical value  We cannot reject H0 54 Democrats Republicans yes 100 90 190 no 900 910 1810
  • 55.  If α=0.05 then 95% of all values fall in this interval  Two-tail test:  2.5% of values in the upper tail and 2.5% of the lower tail are considered as so extreme that we reject H0 if we observe them 55
  • 56.  Test if democrats on fb, on average, have more than 60 bar visits per year  H1: µ > 60  H0: µ <= 60  Random sample of 20 democratic fb-user:  {65 73 51 67 48 80 69 53 59 62 71 67 64 78 65 490 80 60 51 70}  Sample mean 𝜇=64.1  Assume we know SD in population = 10  𝑧 = 𝜇− 𝜇 𝑆𝐸 𝑆𝐸 = 𝑆𝐷 𝑛 𝑧 = 64.1−60 10/ 20 = 1.8336 56
  • 57.  Would we expect that? How extreme is this observation?  If H0 is true (mean<=60)  in which area around the mean do 95% of all points lie  Pick alpha level α=0.05  that’s the maximum probability where you reject the null hypothesis if the null hypothesis is true  Right-tail test: find our critical value for 0.45 using the z-distribution  If the z-score of our observed data exceed this value we have to reject H0 57 1.8336 > 1.645  reject the null hypothesis
  • 58.  Large Effects, Small Samples:  In small samples it is easy to overestimate an effect which might have happened by chance  Small Effects, Large Samples:  The smaller the effect you want to measure the larger the sample size you need to prove it significant!  Example: Assume a coin is biased: 10% head and 90% tail  Tossing the coin 10 times should be enough to convince people that the coin is biased.  Example: Assume a coin is biased: 51% head and 49% tail  Minimum sample size increases with decreasing effect size which one wants to demonstrate 58
  • 59.  The more we analyze, the more we find by chance!  If you calculate correlation between 10 variables (i.e., 44 different correlation coefficients) you should expect that at least 2 correlations are significant with p < 0.05 by chance (one in every 20)  Corrections or adjustments for the total number of comparison are needed! 59
  • 60.  Many tests such as z-test, t-test, ANOVA make the normality assumption.  If true population is very skewed (e.g. power law) the sampling distribution of the statistic will not be normal  Nonparametric methods like sign-test use e.g. median rather than the mean  Hypothesis about the median of the true population (e.g. H1: median < 100, H0: median = 100)  Count number of measurements that favor the null hypothesis  If H0 is true half of the measurement should fall on each side. 60
  • 61. Data Collection Data Preprocessing Data Mining DataAnalysis  Statistical Inference DataVisualization Machine Learning Data Preservation
  • 62.  Aim  Find a function that describes the relation between X (e.g. bar visits) andY (e.g. new friends)  Given X predictY  Problem  Infinite number of ways X andY could be related  Idea  Reduce space of possible function and start with the simplest one (linear relation)  Y= 𝑏0 + 𝑏1 𝑋 62
  • 63.  Y = 2 + 0.5 X 63 6 4 2 0 Y X 0 2 4 6 8
  • 64.  Use Gradient Descent to minimize Cost function C 𝑏0, 𝑏1  C 𝑏0, 𝑏1 = 1 2𝑁 (𝑌𝑖−𝑌𝑖)2𝑁 𝑖=1  C 𝑏0, 𝑏1 = 1 2𝑁 (𝑌𝑖 − 𝑏0 − 𝑏1 𝑋)2𝑁 𝑖=1  Start with some guess for 𝑏0, 𝑏1  Keep changing 𝑏0, 𝑏1 to reduce C 𝑏0, 𝑏1 until we hopefully end up at a minimum 64
  • 65. 𝑏0 ≔ 𝑏0 − 𝛼 𝜕 𝜕𝑏 𝑗 C 𝑏0, 𝑏1  𝑏1 ≔ 𝑏1 − 𝛼 𝜕 𝜕𝑏 𝑗 C 𝑏0, 𝑏1  Simultaneous updates of b0 and b1 65 Derivative of cost function informs us about the slope of the cost function Learning rate
  • 67.  Residuals: deviation between the observed and the predicted values  Residual sum of squares: 67 Is this a good measure? No it depends on the number of observations N What if we multiply it with 1/N?
  • 68.  𝑦𝑖… observed value  𝑦 … value predicted by the model  𝑦 … mean of observed data 68 Total variability in the outcome that needs to be explained Unexplained variability! Residuals: difference between the observed value and the estimated value Proportion of the total variability unexplained by the model
  • 69.  Independent variable is binary (e.g., went to nightclub or not)  We can group users by number of new friends year (20-25, 25-30, 30-35, etc.) and compute the proportion of people with high “nightclub-probability” 69
  • 70.  Logistic Regression:  Maximum Likelihood Estimator  Estimate unknown coefficients by maximizing the log likelihood function  Coefficient is interpreted as the rate of change in the "log odds" as X changes 70 ln 𝑃(𝑌 = 1) 1 − 𝑃(𝑌 = 1) = 𝑏0 + 𝑏1X + ϵ
  • 71. Simple Example: You have a coin that you know is biased towards heads and you want to know what the probability of heads (p) is. We want to estimate the unknown parameter p! 71
  • 72. You flip the coin 10 times and the coin comes up head 7 times. What’s your best guess for p? 72
  • 73. 3737 )1( !3!7 !10 )1( 7 10 )heads7( ppppP          Find the value for p that makes our data most likely! The probability of observing 7 times head when tossing a coin 10 times is given by this binomial distribution: 73
  • 74. )1log(3log7 !3!7 !10 loglog ppLikelihood    Set the derivative equal to 0 and solve for p. Derivative with respect to p. pp Likelihood dp d    1 37 0log 10 7 107377 3)1(70 )1( 3)1(7 0 1 37         p ppp pp pp pp pp *derivative of a constant is 0 *derivative 7f(x)=7f '(x) *derivative of log x is 1/x 3737 )1( !3!7 !10 )1( 7 10 ppppLikelihood          74 web.stanford.edu/~kcobb/hrp261/lecture4.ppt
  • 75. 267.)3(.)7(.120)3(.)7(. 7 10 LikelihoodtheofValue 3737        Likelihood of observing 7 times head when tossing a biased coin with p(head) = 0.7 and p(tail)=0.3 10 times is: 75
  • 76.  Linear Regression (R-squared)  Logistic Regression (pseudo R-squared) 76
  • 77. you can “prove” anything with graphics Data Collection Data Preprocessing DataAnalysis DataVisualization Data Preservation
  • 78. 78
  • 81.  Be careful when drawing conclusions from graphs  Size of effect shown in graphic != Size of effect in sample data != Size of the effect in the true population  Scale Disorting (e.g., bar charts not starting with zero)  Snapshot  … 81
  • 83.  GESIS Data Archives & Data Centers  Preserve research data and make them accessible for reuse.  Competencies and infrastructure ▪ e.g. https://datorium.gesis.org/xmlui/  CESSDA:  umbrella organisation for the European national data archives (http://www.cessda.net/)  Re3data  browse data archives by topic: http://www.re3data.org/ 83 DPC Digital Preservation Handbook: http://www.dpconline.org/advice/preservationhandbook
  • 84.  Legal and regulatory framework  including open access and licenses  Incentives to share data  Credentials? Citation principles under development (see e.g. http://www.datacite.org/).  Long term preservation strategies  software and hardware changes, documentation, metadata and retrieval/access Data preservation starts at an individual level Reasons for data loss often on an individual level, e.g. broken hardware, researchers leaving a group. 84
  • 86.  Vasant Dhar. Data Science and Prediction. In: Communications of the ACM, December 2013,Vol. 56, No. 12, pp. 64-73  Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press (free download)  Jeffrey Stanton, Introduction to Data Science (free download)  Steffen Staab, Data Science Course University Koblenz-Landau, https://www.uni-koblenz-landau.de/campus- koblenz/fb4/west/teaching/ss14/data-science/data-science1  Serious Stats,Thom Baguley 86