SlideShare une entreprise Scribd logo
1  sur  79
Frontiers of
Computational Journalism
Columbia Journalism School
Week 1: High Dimensional Data
September 12, 2018
“Stories will emerge from stacks of financial disclosure forms,
court records, legislative hearings, officials' calendars or meeting
notes, and regulators' email messages that no one today has
time or money to mine. With a suite of reporting tools, a
journalist will be able to scan, transcribe, analyze, and visualize
the patterns in these documents.”
- Cohen, Hamilton, Turner, Computational Journalism, 2011
Computational Journalism: Definitions
Surgeon Scorecard, ProPublica 2015
Computational Journalism: Definitions
“Broadly defined, it can involve changing how stories are
discovered, presented, aggregated, monetized, and archived.
Computation can advance journalism by drawing on innovations
in topic detection, video analysis, personalization, aggregation,
visualization, and sensemaking.”
- Cohen, Hamilton, Turner, Computational Journalism, 2011
Journalism & Technology: Big Data, Personalization & Automation
Shailesh Prakash, The Washington Post
Kony 2012 early network, Gilad Lotan
We are now living in a world where algorithms, and the data that feed
them, adjudicate a large array of decisions in our lives: not just search
engines and personalized online news systems, but educational
evaluations, the operation of markets and political campaigns, the design
of urban public spaces, and even how social services like welfare and
public safety are managed.
…
Journalists are beginning to adapt their traditional watchdogging and
accountability functions to this new wellspring of power in society. They are
investigating algorithms in order to characterize their power and delineate
their mistakes and biases.
- Nick Diakopoulos, Algorithmic Accountability, 2015
Computational Journalism: Definitions
Websites Vary Prices, Deals Based on Users' Information
Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012
Message Machine
Jeff Larson, Al Shaw, ProPublica, 2012
“Computational Journalism” in this course
Reporting on society, using computation
and
Reporting on computation in society
Natural Language
Processing
Visualization
Sociology
Artificial
Intelligence Cognitive Science
Statistics
Graph Theory
Text Analysis
Filter Design
Inference
Algorithmic Accountability
Network Analysis
Disinformation
Privacy and Security
Information
Retrieval
Epistemology
Course Topics
Administration
Assignments
Some assignments require programming, but your writing counts for more than your code!
Final project
Code, story, or research
Course blog
http://compjournalism.com
Grading
40% assignements
40% final project
20% class participation
This class
• Introduction
• High dimensional data
• Text analysis in journalism
• The Document Vector Space model
• The Overview document mining platform
High dimensional data
Vector representation of objects
x1
x2
x3
xN
é
ë
ê
ê
ê
ê
ê
ê
ê
ù
û
ú
ú
ú
ú
ú
ú
ú
Fundamental representation for (almost) all data mining, clustering,
machine learning, visualization, NLP, etc. algorithms.
Interpreting High Dimensional Data
UK House of Lords voting record, 2000-2012.
N = 1043 lords by M = 1630 votes
2 = aye, 4 = nay, -9 = didn't vote
Dimensionality reduction
Problem: vector space is high-dimensional. Up to thousands of
dimensions. The screen is two-dimensional.
We have to go from
x ∈ RN
to much lower dimensional points
y ∈ RK<<N
Probably K=2 or K=3.
Projection
Projection from 3 to 2 dimensions
Which direction should we look from?
Principal components analysis: find a linear projection that
preserves greatest variance
Take first K eigenvectors of covariance matrix corresponding to
largest eigenvalues. This gives a K-dimensional sub-space for
projection.
PCA on House of Lords data
Classification is arguably one of the most central and
generic of all our conceptual exercises. It is the foundation
not only for conceptualization, language, and speech, but
also for mathematics, statistics, and data analysis in
general.
Kenneth D. Bailey, Typologies and Taxonomies: An Introduction to
Classification Techniques
UK House of Lords PCA notebook
Classification and Clustering
Classification is arguably one of the most central and
generic of all our conceptual exercises. It is the foundation
not only for conceptualization, language, and speech, but
also for mathematics, statistics, and data analysis in
general.
Kenneth D. Bailey, Typologies and Taxonomies: An Introduction to
Classification Techniques
Distance metric
d(x, y) ≥ 0
- distance is never negative
d(x, x) = 0
- “reflexivity”: zero distance to self
d(x, y) = d(y, x)
- “symmetry”: x to y same as y to x
d(x, z) ≤ d(x, y) + d(y, z)
- “triangle inequality”: going direct is shorter
Distance matrix
Data matrix for M objects of N dimensions
Distance matrix
X =
x1
x2
xM
é
ë
ê
ê
ê
ê
ù
û
ú
ú
ú
ú
=
x1,1 x1,2 x1,N
x2,1 x2,2
x1,M xM,N
é
ë
ê
ê
ê
ê
ê
ù
û
ú
ú
ú
ú
ú
Dij = Dji = d(xi , xj ) =
d1,1 d1,2 dM,M
d2,1 d2,2
d1,M dM,M
é
ë
ê
ê
ê
ê
ê
ù
û
ú
ú
ú
ú
ú
Different clustering algorithms
• Partitioning
o keep adjusting clusters until convergence
o e.g. K-means
o Also LDA and many Bayesian models, from a certain perspective
• Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches
• Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
K-means demo
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
UK House of Lords voting clusters
Algorithm instructed to separate MPs into five clusters. Output:
1 2 2 1 3 2 2 2 1 4
1 1 1 1 1 1 5 2 1 1
2 2 1 2 3 2 2 4 2 1
2 3 2 1 3 1 1 2 1 2
1 5 2 1 4 2 2 1 2 1
1 4 1 1 4 1 2 2 1 5
1 1 1 2 3 3 2 2 2 5
2 3 1 2 1 4 1 1 4 4
1 1 2 1 1 2 2 2 2 1
2 1 2 1 2 2 1 3 2 1
1 2 2 1 2 3 4 2 2 2
.
.
Voting clusters with parties
LDem XB Lab LDem XB Lab XB Lab Con XB
1 2 2 1 3 2 2 2 1 4
Con Con LDem Con Con Con LDem Lab Con LDem
1 1 1 1 1 1 5 2 1 1
Lab Lab Con Lab XB XB Lab XB Lab Con
2 2 1 2 3 2 2 4 2 1
Lab XB Lab Con XB XB LDem Lab XB Lab
2 3 2 1 3 1 1 2 1 2
Con Con Lab Con XB Lab Lab Con XB XB
1 5 2 1 4 2 2 1 2 1
Con XB Con Con XB Con Lab XB LDem Con
1 4 1 1 4 1 2 2 1 5
Con Con Con Lab Bp XB Lab Lab Lab LDem
1 1 1 2 3 3 2 2 2 5
Lab XB Con Lab Con XB Con Con XB XB
2 3 1 2 1 4 1 1 4 4
Con Con Lab Con Con XB Lab Lab Lab Con
1 1 2 1 1 2 2 2 2 1
Lab LDem Lab Con Lab Lab Con XB Lab Con
2 1 2 1 2 2 1 3 2 1
Con Lab XB Con XB XB XB Lab Lab Lab
1 2 2 1 2 3 4 2 2 2
No unique “right” clustering
Different distance metrics and clustering algorithms give different
results.
Should we sort incident reports by location, time, actor, event type,
author, cost, casualties…?
There is only context-specific categorization.
And the computer doesn’t understand your context.
Different libraries,
different categories
Clustering Algorithm
Input: data points (feature vectors).
Output: a set of clusters, each of which is a set of points.
Visualization
Input: data points (feature vectors).
Output: a picture of the points.
Linear projections (like PCA)
Projects in a straight line to
closest point on "screen.”
y = Px
where P is a K by N matrix.
Projection from 2 to 1 dimensions
Nonlinear projections
Still going from high-
dimensional x to low-
dimensional y, but now
y = f(x)
for some function f(), not
linear. So, may not preserve
relative distances, angles,
etc.
Fish-eye projection from 3 to 2 dimensions
Multidimensional scaling
Idea: try to preserve distances between points "as much as possible."
If we have the distances between all points in a distance matrix,
D = |xi – xj| for all i,j
We can recover the original {xi} coordinates exactly (up to rigid
transformations.) Like working out a country map if you know how far away
each city is from every other.
But notice that the original dimension is not encoded in the matrix… we
can re-project to any number of dimensions.
Multidimensional scaling
Torgerson's "classical MDS" algorithm (1952)
MDS Stress minimization
The formula actually minimizes “stress”
Think of “springs” between every pair of points. Spring between xi, xj
has rest length dij
Stress is zero if all high-dimensional distances matched exactly in low
dimension.
stress(x) = xi - xj - dij( )
2
i, j
å
Multi-dimensional Scaling
Like "flattening" a
stretchy structure into
2D, so that distances
between points are
preserved (as much as
possible")
House of Lords MDS plot
Robustness of results
Regarding these analyses of legislative voting, we could still ask:
• Are we modeling the right thing? (What about other legislative
work, e.g. in committee?)
• Are our underlying assumptions correct? (do representatives
really have “ideal points” in a preference space?)
• What are we trying to argue? What will be the effect of pointing
out this result?
Text Analysis in Journalism
Count incident types by date. For Level 14, ProPublica, 2015
The Child Exchange, Reuters, 2014
USA Today/Twitter Political Issues Index
Politico analysis of GOP primary, 2012
The Post obtained draft versions of 12 audits by the inspector general’s office,
covering projects from the Caribbean to Pakistan to the Republic of Georgia
between 2011 and 2013. The drafts are confidential and rarely become public.
The Post compared the drafts with the final reports published by the
inspector general’s office and interviewed former and current employees. E-
mails and other internal records also were reviewed.
The Post tracked changes in the language that auditors used to describe
USAID and its mission offices. The analysis found that more than 400
negative references were removed from the audits between the draft and
final versions.
Sentiment analysis used by Washington Post, 2014
LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years
Los Angeles Times
The Times analyzed Los Angeles Police Department violent crime
data from 2005 to 2012. Our analysis found that the Los Angeles
Police Department misclassified an estimated 14,000 serious assaults
as minor offenses, artificially lowering the city’s crime levels. To
conduct the analysis, The Times used an algorithm that combined
two machine learning classifiers. Each classifier read in a brief
description of the crime, which it used to determine if it was a minor
or serious assault.
An example of a minor assault reads: "VICTS AND SUSPS BECAME
INV IN VERBA ARGUMENT SUSP THEN BEGAN HITTING VICTS
IN THE FACE.”
We used a machine-learning method
known as latent Dirichlet allocation to
identify the topics in all 14,400 petitions
and to then categorize the briefs. This
enabled us to identify which lawyers did
which kind of work for which sorts of
petitioners. For example, in cases where
workers sue their employers, the lawyers
most successful getting cases before the
court were far more likely to represent the
employers rather than the employees.
The Echo Chamber, Reuters
Document Vector Space Model
Message Machine clusters emails
Using TF-IDF document vectors
Document vectors in journalism
- Text clustering for stories, e.g. Message Machine
- Find “key words” or “most important words”
- Topic analysis, e.g. ProPublica’s legislative tracker
- Key component of filtering algorithms, e.g. Google News
- Standard representation for document classification.
- Basis of all text search engines.
A text analysis building block.
What is this document "about"?
Most commonly occurring words a pretty good indicator.
30 the
23 to
19 and
19 a
18 animal
17 cruelty
15 of
15 crimes
14 in
14 for
11 that
8 crime
7 we
Features = words works fine
Encode each document as the list of words it contains.
Dimensions = vocabulary of document set.
Value on each dimension = # of times word appears in document
Example
D1 = “I like databases”
D2 = “I hate hate databases”
Each row = document vector
All rows = term-document matrix
Individual entry = tf(t,d) = “term frequency”
Aka “Bag of words” model
Throws out word order.
e.g. “soldiers shot civilians” and “civilians shot soldiers” encoded
identically.
Tokenization
The documents come to us as long strings, not individual words.
Tokenization is the process of converting the string into individual
words, or "tokens."
For this course, we will assume a very simple strategy:
o convert all letters to lowercase
o remove all punctuation characters
o separate words based on spaces
Note that this won't work at all for Chinese. It will fail in ,many
ways even for English. How?
Distance metric for text
Useful for:
• clustering documents
• finding docs similar to example
• matching a search query
Basic idea: look for overlapping terms
Cosine similarity
Given document vectors a,b define
If each word occurs exactly once in each document, equivalent
to counting overlapping words.
Note: not a distance function, as similarity increases when
documents are… similar. (What part of the definition of a
distance function is violated here?)
similarity(a,b) º a·b
Problem: long documents always win
Let a = “This car runs fast.”
Let b = “My car is old. I want a new car, a shiny car”
Let query = “fast car”
this car runs fast my is old I want a new shiny
a 1 1 1 1 0 0 0 0 0 0 0 0
b 0 3 0 0 1 1 1 1 1 1 1 1
q 0 1 0 1 0 0 0 0 0 0 0 0
similarity(a,q) = 1*1 [car] + 1*1 [fast] = 2
similarity(b,q) = 3*1 [car] + 0*1 [fast] = 3
Longer document more “similar”, by virtue of repeating words.
Problem: long documents always win
Normalize document vectors
similarity(a,b) º
a·b
a b
= cos(Θ)
returns result in [0,1]
Normalized query example
this car runs fast my is old I want a new shiny
a 1 1 1 1 0 0 0 0 0 0 0 0
b 0 3 0 0 1 1 1 1 1 1 1 1
q 0 1 0 1 0 0 0 0 0 0 0 0
similarity(a,q) =
2
4 2
=
1
2
» 0.707
similarity(b,q) =
3
17 2
» 0.514
Cosine similarity
cosq = similarity(a,b) º
a·b
a b
Cosine distance (finally)
dist(a,b) º1-
a·b
a b
Problem: common words
We want to look at words that “discriminate” among documents.
Stopwords: if all documents contain “the,” are all documents
similar?
Common words: if most documents contain “car” then car
doesn’t tell us much about (contextual) similarity.
Context matters
Car ReviewsGeneral News
= contains “car”
= does not contain “car”
Document Frequency
Idea: de-weight common words
Common = appears in many documents
“document frequency” = fraction of docs containing
term
df (t,D)= d Î D:t Î d D
Inverse Document Frequency
Invert (so more common = smaller weight) and take log
idf (t,D) = log D d Î D:t Î d( )
TF-IDF
Multiply term frequency by inverse document frequency
n(t,d) = number of times term t in doc d
n(t,D) = number docs in D containing t
tfidf (t,d,D)= tf (t,d)×idf (d,D)
= n(t,d)×log D n(t,D)( )
TF-IDF depends on entire corpus
The TF-IDF vector for a document changes if we add
another document to the corpus.
TF-IDF is sensitive to context. The context is all other documents
tfidf (t,d,D)= tf (t,d)×idf (d,D)
if we add a document, D changes!
What is this document "about"?
Each document is now a vector of TF-IDF scores for every word in the
document. We can look at which words have the top scores.
crimes 0.0675591652263963
cruelty 0.0585772393867342
crime 0.0257614113616027
reporting 0.0208838148975406
animals 0.0179258756717422
michael 0.0156575858658684
category 0.0154564813388897
commit 0.0137447439653709
criminal 0.0134312894429112
societal 0.0124164973052386
trends 0.0119505837811614
conviction 0.0115699047136248
patterns 0.011248045148093
Salton’s description of tf-idf
- from Salton et al, A Vector Space Model for Automatic Indexing, 1975
TF
nj-sentator-menendez corpus, Overview sample files
color = human tags generated from TF-IDF clusters
TF-IDF
Cluster Hypothesis
“documents in the same cluster behave similarly with respect to
relevance to information needs”
- Manning, Raghavan, Schütze, Introduction to Information Retrieval
Not really a precise statement – but the crucial link between human
semantics and mathematical properties.
Articulated as early as 1971, has been shown to hold at web scale,
widely assumed.
Bag of words + TF-IDF widely used
Practical win: good precision-recall metrics in tests with human-tagged
document sets.
Still the dominant text indexing scheme used today. (Lucene, FAST,
Google…) Many variants and extensions.
Some, but not much, theory to explain why this works. (E.g. why that
particular IDF formula? why doesn’t indexing bigrams improve
performance?)
Collectively: the vector space document model

Contenu connexe

Tendances

Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/CategorizationOswal Abhishek
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringViet-Trung TRAN
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
 
Text classification
Text classificationText classification
Text classificationJames Wong
 
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGESOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
 
Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLPYuriy Guts
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...taeseon ryu
 
Sentiment Analysis Using Twitter
Sentiment Analysis Using TwitterSentiment Analysis Using Twitter
Sentiment Analysis Using Twitterpiya chauhan
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextSangmin Woo
 
Dialogue systems and personal assistants
Dialogue systems and personal assistantsDialogue systems and personal assistants
Dialogue systems and personal assistantsNatalia Konstantinova
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlpankit_ppt
 
bm25 demystified
bm25 demystifiedbm25 demystified
bm25 demystifiedFan Robbin
 

Tendances (20)

Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Text classification
Text classificationText classification
Text classification
 
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGESOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
 
Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLP
 
Spam Detection Using Natural Language processing
Spam Detection Using Natural Language processingSpam Detection Using Natural Language processing
Spam Detection Using Natural Language processing
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
 
Sentiment Analysis Using Twitter
Sentiment Analysis Using TwitterSentiment Analysis Using Twitter
Sentiment Analysis Using Twitter
 
Nlp
NlpNlp
Nlp
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
Question answering
Question answeringQuestion answering
Question answering
 
Dialogue systems and personal assistants
Dialogue systems and personal assistantsDialogue systems and personal assistants
Dialogue systems and personal assistants
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
bm25 demystified
bm25 demystifiedbm25 demystified
bm25 demystified
 

Similaire à Frontiers of Computational Journalism week 1 - Introduction and High Dimensional Data

Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine LearningNimrita Koul
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Challenges in business analytics
Challenges in business analyticsChallenges in business analytics
Challenges in business analyticsMiklos Koren
 
20142014_20142015_20142115
20142014_20142015_2014211520142014_20142015_20142115
20142014_20142015_20142115Divita Madaan
 
UNIT1-2.pptx
UNIT1-2.pptxUNIT1-2.pptx
UNIT1-2.pptxcsecem
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data ConferenceDataTactics
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationRich Heimann
 
2012: Natural Computing - The Grand Challenges and Two Case Studies
2012: Natural Computing - The Grand Challenges and Two Case Studies2012: Natural Computing - The Grand Challenges and Two Case Studies
2012: Natural Computing - The Grand Challenges and Two Case StudiesLeandro de Castro
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
Making data visual diy guide to getting started with data visualization
Making data visual diy guide to getting started with data visualizationMaking data visual diy guide to getting started with data visualization
Making data visual diy guide to getting started with data visualizationVisual Resources Association
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Statistical Inference for development statistical model.pptx
Statistical Inference for development statistical model.pptxStatistical Inference for development statistical model.pptx
Statistical Inference for development statistical model.pptxQasimGull
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017MLconf
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to RAnshik Bansal
 
Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922stone55
 

Similaire à Frontiers of Computational Journalism week 1 - Introduction and High Dimensional Data (20)

Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine Learning
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Challenges in business analytics
Challenges in business analyticsChallenges in business analytics
Challenges in business analytics
 
20142014_20142015_20142115
20142014_20142015_2014211520142014_20142015_20142115
20142014_20142015_20142115
 
UNIT1-2.pptx
UNIT1-2.pptxUNIT1-2.pptx
UNIT1-2.pptx
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data Conference
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics Corporation
 
2012: Natural Computing - The Grand Challenges and Two Case Studies
2012: Natural Computing - The Grand Challenges and Two Case Studies2012: Natural Computing - The Grand Challenges and Two Case Studies
2012: Natural Computing - The Grand Challenges and Two Case Studies
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Making data visual diy guide to getting started with data visualization
Making data visual diy guide to getting started with data visualizationMaking data visual diy guide to getting started with data visualization
Making data visual diy guide to getting started with data visualization
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Why Data Science is a Science
Why Data Science is a ScienceWhy Data Science is a Science
Why Data Science is a Science
 
Data analysis
Data analysisData analysis
Data analysis
 
Statistical Inference for development statistical model.pptx
Statistical Inference for development statistical model.pptxStatistical Inference for development statistical model.pptx
Statistical Inference for development statistical model.pptx
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to R
 
Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922
 

Plus de Jonathan Stray

Frameworks for Algorithmic Bias
Frameworks for Algorithmic BiasFrameworks for Algorithmic Bias
Frameworks for Algorithmic BiasJonathan Stray
 
Analyzing Bias in Data - IRE 2019
Analyzing Bias in Data - IRE 2019Analyzing Bias in Data - IRE 2019
Analyzing Bias in Data - IRE 2019Jonathan Stray
 
Frontiers of Computational Journalism week 11 - Privacy and Security
Frontiers of Computational Journalism week 11 - Privacy and SecurityFrontiers of Computational Journalism week 11 - Privacy and Security
Frontiers of Computational Journalism week 11 - Privacy and SecurityJonathan Stray
 
Frontiers of Computational Journalism week 10 - Truth and Trust
Frontiers of Computational Journalism week 10 - Truth and TrustFrontiers of Computational Journalism week 10 - Truth and Trust
Frontiers of Computational Journalism week 10 - Truth and TrustJonathan Stray
 
Frontiers of Computational Journalism week 9 - Knowledge representation
Frontiers of Computational Journalism week 9 - Knowledge representationFrontiers of Computational Journalism week 9 - Knowledge representation
Frontiers of Computational Journalism week 9 - Knowledge representationJonathan Stray
 
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Jonathan Stray
 
Frontiers of Computational Journalism week 7 - Randomness and Statistical Sig...
Frontiers of Computational Journalism week 7 - Randomness and Statistical Sig...Frontiers of Computational Journalism week 7 - Randomness and Statistical Sig...
Frontiers of Computational Journalism week 7 - Randomness and Statistical Sig...Jonathan Stray
 
Frontiers of Computational Journalism week 6 - Quantitative Fairness
Frontiers of Computational Journalism week 6 - Quantitative FairnessFrontiers of Computational Journalism week 6 - Quantitative Fairness
Frontiers of Computational Journalism week 6 - Quantitative FairnessJonathan Stray
 
Frontiers of Computational Journalism week 5 - Algorithmic Accountability and...
Frontiers of Computational Journalism week 5 - Algorithmic Accountability and...Frontiers of Computational Journalism week 5 - Algorithmic Accountability and...
Frontiers of Computational Journalism week 5 - Algorithmic Accountability and...Jonathan Stray
 
Frontiers of Computational Journalism - Final project suggestions
Frontiers of Computational Journalism - Final project suggestionsFrontiers of Computational Journalism - Final project suggestions
Frontiers of Computational Journalism - Final project suggestionsJonathan Stray
 
Frontiers of Computational Journalism week 4 - Statistical Inference
Frontiers of Computational Journalism week 4 - Statistical InferenceFrontiers of Computational Journalism week 4 - Statistical Inference
Frontiers of Computational Journalism week 4 - Statistical InferenceJonathan Stray
 
Frontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignFrontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignJonathan Stray
 

Plus de Jonathan Stray (12)

Frameworks for Algorithmic Bias
Frameworks for Algorithmic BiasFrameworks for Algorithmic Bias
Frameworks for Algorithmic Bias
 
Analyzing Bias in Data - IRE 2019
Analyzing Bias in Data - IRE 2019Analyzing Bias in Data - IRE 2019
Analyzing Bias in Data - IRE 2019
 
Frontiers of Computational Journalism week 11 - Privacy and Security
Frontiers of Computational Journalism week 11 - Privacy and SecurityFrontiers of Computational Journalism week 11 - Privacy and Security
Frontiers of Computational Journalism week 11 - Privacy and Security
 
Frontiers of Computational Journalism week 10 - Truth and Trust
Frontiers of Computational Journalism week 10 - Truth and TrustFrontiers of Computational Journalism week 10 - Truth and Trust
Frontiers of Computational Journalism week 10 - Truth and Trust
 
Frontiers of Computational Journalism week 9 - Knowledge representation
Frontiers of Computational Journalism week 9 - Knowledge representationFrontiers of Computational Journalism week 9 - Knowledge representation
Frontiers of Computational Journalism week 9 - Knowledge representation
 
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
 
Frontiers of Computational Journalism week 7 - Randomness and Statistical Sig...
Frontiers of Computational Journalism week 7 - Randomness and Statistical Sig...Frontiers of Computational Journalism week 7 - Randomness and Statistical Sig...
Frontiers of Computational Journalism week 7 - Randomness and Statistical Sig...
 
Frontiers of Computational Journalism week 6 - Quantitative Fairness
Frontiers of Computational Journalism week 6 - Quantitative FairnessFrontiers of Computational Journalism week 6 - Quantitative Fairness
Frontiers of Computational Journalism week 6 - Quantitative Fairness
 
Frontiers of Computational Journalism week 5 - Algorithmic Accountability and...
Frontiers of Computational Journalism week 5 - Algorithmic Accountability and...Frontiers of Computational Journalism week 5 - Algorithmic Accountability and...
Frontiers of Computational Journalism week 5 - Algorithmic Accountability and...
 
Frontiers of Computational Journalism - Final project suggestions
Frontiers of Computational Journalism - Final project suggestionsFrontiers of Computational Journalism - Final project suggestions
Frontiers of Computational Journalism - Final project suggestions
 
Frontiers of Computational Journalism week 4 - Statistical Inference
Frontiers of Computational Journalism week 4 - Statistical InferenceFrontiers of Computational Journalism week 4 - Statistical Inference
Frontiers of Computational Journalism week 4 - Statistical Inference
 
Frontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignFrontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter Design
 

Dernier

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 

Dernier (20)

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 

Frontiers of Computational Journalism week 1 - Introduction and High Dimensional Data

  • 1. Frontiers of Computational Journalism Columbia Journalism School Week 1: High Dimensional Data September 12, 2018
  • 2. “Stories will emerge from stacks of financial disclosure forms, court records, legislative hearings, officials' calendars or meeting notes, and regulators' email messages that no one today has time or money to mine. With a suite of reporting tools, a journalist will be able to scan, transcribe, analyze, and visualize the patterns in these documents.” - Cohen, Hamilton, Turner, Computational Journalism, 2011 Computational Journalism: Definitions
  • 3.
  • 5. Computational Journalism: Definitions “Broadly defined, it can involve changing how stories are discovered, presented, aggregated, monetized, and archived. Computation can advance journalism by drawing on innovations in topic detection, video analysis, personalization, aggregation, visualization, and sensemaking.” - Cohen, Hamilton, Turner, Computational Journalism, 2011
  • 6. Journalism & Technology: Big Data, Personalization & Automation Shailesh Prakash, The Washington Post
  • 7. Kony 2012 early network, Gilad Lotan
  • 8. We are now living in a world where algorithms, and the data that feed them, adjudicate a large array of decisions in our lives: not just search engines and personalized online news systems, but educational evaluations, the operation of markets and political campaigns, the design of urban public spaces, and even how social services like welfare and public safety are managed. … Journalists are beginning to adapt their traditional watchdogging and accountability functions to this new wellspring of power in society. They are investigating algorithms in order to characterize their power and delineate their mistakes and biases. - Nick Diakopoulos, Algorithmic Accountability, 2015 Computational Journalism: Definitions
  • 9. Websites Vary Prices, Deals Based on Users' Information Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012
  • 10. Message Machine Jeff Larson, Al Shaw, ProPublica, 2012
  • 11. “Computational Journalism” in this course Reporting on society, using computation and Reporting on computation in society
  • 12. Natural Language Processing Visualization Sociology Artificial Intelligence Cognitive Science Statistics Graph Theory Text Analysis Filter Design Inference Algorithmic Accountability Network Analysis Disinformation Privacy and Security Information Retrieval Epistemology Course Topics
  • 13. Administration Assignments Some assignments require programming, but your writing counts for more than your code! Final project Code, story, or research Course blog http://compjournalism.com Grading 40% assignements 40% final project 20% class participation
  • 14. This class • Introduction • High dimensional data • Text analysis in journalism • The Document Vector Space model • The Overview document mining platform
  • 16. Vector representation of objects x1 x2 x3 xN é ë ê ê ê ê ê ê ê ù û ú ú ú ú ú ú ú Fundamental representation for (almost) all data mining, clustering, machine learning, visualization, NLP, etc. algorithms.
  • 17. Interpreting High Dimensional Data UK House of Lords voting record, 2000-2012. N = 1043 lords by M = 1630 votes 2 = aye, 4 = nay, -9 = didn't vote
  • 18. Dimensionality reduction Problem: vector space is high-dimensional. Up to thousands of dimensions. The screen is two-dimensional. We have to go from x ∈ RN to much lower dimensional points y ∈ RK<<N Probably K=2 or K=3.
  • 19. Projection Projection from 3 to 2 dimensions
  • 20. Which direction should we look from? Principal components analysis: find a linear projection that preserves greatest variance Take first K eigenvectors of covariance matrix corresponding to largest eigenvalues. This gives a K-dimensional sub-space for projection.
  • 21. PCA on House of Lords data Classification is arguably one of the most central and generic of all our conceptual exercises. It is the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis in general. Kenneth D. Bailey, Typologies and Taxonomies: An Introduction to Classification Techniques
  • 22. UK House of Lords PCA notebook
  • 23. Classification and Clustering Classification is arguably one of the most central and generic of all our conceptual exercises. It is the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis in general. Kenneth D. Bailey, Typologies and Taxonomies: An Introduction to Classification Techniques
  • 24. Distance metric d(x, y) ≥ 0 - distance is never negative d(x, x) = 0 - “reflexivity”: zero distance to self d(x, y) = d(y, x) - “symmetry”: x to y same as y to x d(x, z) ≤ d(x, y) + d(y, z) - “triangle inequality”: going direct is shorter
  • 25. Distance matrix Data matrix for M objects of N dimensions Distance matrix X = x1 x2 xM é ë ê ê ê ê ù û ú ú ú ú = x1,1 x1,2 x1,N x2,1 x2,2 x1,M xM,N é ë ê ê ê ê ê ù û ú ú ú ú ú Dij = Dji = d(xi , xj ) = d1,1 d1,2 dM,M d2,1 d2,2 d1,M dM,M é ë ê ê ê ê ê ù û ú ú ú ú ú
  • 26. Different clustering algorithms • Partitioning o keep adjusting clusters until convergence o e.g. K-means o Also LDA and many Bayesian models, from a certain perspective • Agglomerative hierarchical o start with leaves, repeatedly merge clusters o e.g. MIN and MAX approaches • Divisive hierarchical o start with root, repeatedly split clusters o e.g. binary split
  • 28. UK House of Lords voting clusters Algorithm instructed to separate MPs into five clusters. Output: 1 2 2 1 3 2 2 2 1 4 1 1 1 1 1 1 5 2 1 1 2 2 1 2 3 2 2 4 2 1 2 3 2 1 3 1 1 2 1 2 1 5 2 1 4 2 2 1 2 1 1 4 1 1 4 1 2 2 1 5 1 1 1 2 3 3 2 2 2 5 2 3 1 2 1 4 1 1 4 4 1 1 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 3 2 1 1 2 2 1 2 3 4 2 2 2 . .
  • 29. Voting clusters with parties LDem XB Lab LDem XB Lab XB Lab Con XB 1 2 2 1 3 2 2 2 1 4 Con Con LDem Con Con Con LDem Lab Con LDem 1 1 1 1 1 1 5 2 1 1 Lab Lab Con Lab XB XB Lab XB Lab Con 2 2 1 2 3 2 2 4 2 1 Lab XB Lab Con XB XB LDem Lab XB Lab 2 3 2 1 3 1 1 2 1 2 Con Con Lab Con XB Lab Lab Con XB XB 1 5 2 1 4 2 2 1 2 1 Con XB Con Con XB Con Lab XB LDem Con 1 4 1 1 4 1 2 2 1 5 Con Con Con Lab Bp XB Lab Lab Lab LDem 1 1 1 2 3 3 2 2 2 5 Lab XB Con Lab Con XB Con Con XB XB 2 3 1 2 1 4 1 1 4 4 Con Con Lab Con Con XB Lab Lab Lab Con 1 1 2 1 1 2 2 2 2 1 Lab LDem Lab Con Lab Lab Con XB Lab Con 2 1 2 1 2 2 1 3 2 1 Con Lab XB Con XB XB XB Lab Lab Lab 1 2 2 1 2 3 4 2 2 2
  • 30. No unique “right” clustering Different distance metrics and clustering algorithms give different results. Should we sort incident reports by location, time, actor, event type, author, cost, casualties…? There is only context-specific categorization. And the computer doesn’t understand your context.
  • 32. Clustering Algorithm Input: data points (feature vectors). Output: a set of clusters, each of which is a set of points. Visualization Input: data points (feature vectors). Output: a picture of the points.
  • 33. Linear projections (like PCA) Projects in a straight line to closest point on "screen.” y = Px where P is a K by N matrix. Projection from 2 to 1 dimensions
  • 34. Nonlinear projections Still going from high- dimensional x to low- dimensional y, but now y = f(x) for some function f(), not linear. So, may not preserve relative distances, angles, etc. Fish-eye projection from 3 to 2 dimensions
  • 35. Multidimensional scaling Idea: try to preserve distances between points "as much as possible." If we have the distances between all points in a distance matrix, D = |xi – xj| for all i,j We can recover the original {xi} coordinates exactly (up to rigid transformations.) Like working out a country map if you know how far away each city is from every other. But notice that the original dimension is not encoded in the matrix… we can re-project to any number of dimensions.
  • 37. MDS Stress minimization The formula actually minimizes “stress” Think of “springs” between every pair of points. Spring between xi, xj has rest length dij Stress is zero if all high-dimensional distances matched exactly in low dimension. stress(x) = xi - xj - dij( ) 2 i, j å
  • 38. Multi-dimensional Scaling Like "flattening" a stretchy structure into 2D, so that distances between points are preserved (as much as possible")
  • 39. House of Lords MDS plot
  • 40. Robustness of results Regarding these analyses of legislative voting, we could still ask: • Are we modeling the right thing? (What about other legislative work, e.g. in committee?) • Are our underlying assumptions correct? (do representatives really have “ideal points” in a preference space?) • What are we trying to argue? What will be the effect of pointing out this result?
  • 41. Text Analysis in Journalism
  • 42. Count incident types by date. For Level 14, ProPublica, 2015
  • 43. The Child Exchange, Reuters, 2014
  • 45. Politico analysis of GOP primary, 2012
  • 46. The Post obtained draft versions of 12 audits by the inspector general’s office, covering projects from the Caribbean to Pakistan to the Republic of Georgia between 2011 and 2013. The drafts are confidential and rarely become public. The Post compared the drafts with the final reports published by the inspector general’s office and interviewed former and current employees. E- mails and other internal records also were reviewed. The Post tracked changes in the language that auditors used to describe USAID and its mission offices. The analysis found that more than 400 negative references were removed from the audits between the draft and final versions. Sentiment analysis used by Washington Post, 2014
  • 47. LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years Los Angeles Times
  • 48. The Times analyzed Los Angeles Police Department violent crime data from 2005 to 2012. Our analysis found that the Los Angeles Police Department misclassified an estimated 14,000 serious assaults as minor offenses, artificially lowering the city’s crime levels. To conduct the analysis, The Times used an algorithm that combined two machine learning classifiers. Each classifier read in a brief description of the crime, which it used to determine if it was a minor or serious assault. An example of a minor assault reads: "VICTS AND SUSPS BECAME INV IN VERBA ARGUMENT SUSP THEN BEGAN HITTING VICTS IN THE FACE.”
  • 49. We used a machine-learning method known as latent Dirichlet allocation to identify the topics in all 14,400 petitions and to then categorize the briefs. This enabled us to identify which lawyers did which kind of work for which sorts of petitioners. For example, in cases where workers sue their employers, the lawyers most successful getting cases before the court were far more likely to represent the employers rather than the employees. The Echo Chamber, Reuters
  • 51. Message Machine clusters emails Using TF-IDF document vectors
  • 52. Document vectors in journalism - Text clustering for stories, e.g. Message Machine - Find “key words” or “most important words” - Topic analysis, e.g. ProPublica’s legislative tracker - Key component of filtering algorithms, e.g. Google News - Standard representation for document classification. - Basis of all text search engines. A text analysis building block.
  • 53. What is this document "about"? Most commonly occurring words a pretty good indicator. 30 the 23 to 19 and 19 a 18 animal 17 cruelty 15 of 15 crimes 14 in 14 for 11 that 8 crime 7 we
  • 54.
  • 55. Features = words works fine Encode each document as the list of words it contains. Dimensions = vocabulary of document set. Value on each dimension = # of times word appears in document
  • 56. Example D1 = “I like databases” D2 = “I hate hate databases” Each row = document vector All rows = term-document matrix Individual entry = tf(t,d) = “term frequency”
  • 57. Aka “Bag of words” model Throws out word order. e.g. “soldiers shot civilians” and “civilians shot soldiers” encoded identically.
  • 58. Tokenization The documents come to us as long strings, not individual words. Tokenization is the process of converting the string into individual words, or "tokens." For this course, we will assume a very simple strategy: o convert all letters to lowercase o remove all punctuation characters o separate words based on spaces Note that this won't work at all for Chinese. It will fail in ,many ways even for English. How?
  • 59. Distance metric for text Useful for: • clustering documents • finding docs similar to example • matching a search query Basic idea: look for overlapping terms
  • 60. Cosine similarity Given document vectors a,b define If each word occurs exactly once in each document, equivalent to counting overlapping words. Note: not a distance function, as similarity increases when documents are… similar. (What part of the definition of a distance function is violated here?) similarity(a,b) º a·b
  • 61. Problem: long documents always win Let a = “This car runs fast.” Let b = “My car is old. I want a new car, a shiny car” Let query = “fast car” this car runs fast my is old I want a new shiny a 1 1 1 1 0 0 0 0 0 0 0 0 b 0 3 0 0 1 1 1 1 1 1 1 1 q 0 1 0 1 0 0 0 0 0 0 0 0
  • 62. similarity(a,q) = 1*1 [car] + 1*1 [fast] = 2 similarity(b,q) = 3*1 [car] + 0*1 [fast] = 3 Longer document more “similar”, by virtue of repeating words. Problem: long documents always win
  • 63. Normalize document vectors similarity(a,b) º a·b a b = cos(Θ) returns result in [0,1]
  • 64. Normalized query example this car runs fast my is old I want a new shiny a 1 1 1 1 0 0 0 0 0 0 0 0 b 0 3 0 0 1 1 1 1 1 1 1 1 q 0 1 0 1 0 0 0 0 0 0 0 0 similarity(a,q) = 2 4 2 = 1 2 » 0.707 similarity(b,q) = 3 17 2 » 0.514
  • 65. Cosine similarity cosq = similarity(a,b) º a·b a b
  • 67. Problem: common words We want to look at words that “discriminate” among documents. Stopwords: if all documents contain “the,” are all documents similar? Common words: if most documents contain “car” then car doesn’t tell us much about (contextual) similarity.
  • 68. Context matters Car ReviewsGeneral News = contains “car” = does not contain “car”
  • 69. Document Frequency Idea: de-weight common words Common = appears in many documents “document frequency” = fraction of docs containing term df (t,D)= d Î D:t Î d D
  • 70. Inverse Document Frequency Invert (so more common = smaller weight) and take log idf (t,D) = log D d Î D:t Î d( )
  • 71. TF-IDF Multiply term frequency by inverse document frequency n(t,d) = number of times term t in doc d n(t,D) = number docs in D containing t tfidf (t,d,D)= tf (t,d)×idf (d,D) = n(t,d)×log D n(t,D)( )
  • 72. TF-IDF depends on entire corpus The TF-IDF vector for a document changes if we add another document to the corpus. TF-IDF is sensitive to context. The context is all other documents tfidf (t,d,D)= tf (t,d)×idf (d,D) if we add a document, D changes!
  • 73. What is this document "about"? Each document is now a vector of TF-IDF scores for every word in the document. We can look at which words have the top scores. crimes 0.0675591652263963 cruelty 0.0585772393867342 crime 0.0257614113616027 reporting 0.0208838148975406 animals 0.0179258756717422 michael 0.0156575858658684 category 0.0154564813388897 commit 0.0137447439653709 criminal 0.0134312894429112 societal 0.0124164973052386 trends 0.0119505837811614 conviction 0.0115699047136248 patterns 0.011248045148093
  • 74.
  • 75.
  • 76. Salton’s description of tf-idf - from Salton et al, A Vector Space Model for Automatic Indexing, 1975
  • 77. TF nj-sentator-menendez corpus, Overview sample files color = human tags generated from TF-IDF clusters TF-IDF
  • 78. Cluster Hypothesis “documents in the same cluster behave similarly with respect to relevance to information needs” - Manning, Raghavan, Schütze, Introduction to Information Retrieval Not really a precise statement – but the crucial link between human semantics and mathematical properties. Articulated as early as 1971, has been shown to hold at web scale, widely assumed.
  • 79. Bag of words + TF-IDF widely used Practical win: good precision-recall metrics in tests with human-tagged document sets. Still the dominant text indexing scheme used today. (Lucene, FAST, Google…) Many variants and extensions. Some, but not much, theory to explain why this works. (E.g. why that particular IDF formula? why doesn’t indexing bigrams improve performance?) Collectively: the vector space document model

Notes de l'éditeur

  1. To open: House of lords notebook and blog post http://www.compjournalism.com/?p=13 https://github.com/jstray/compjournalism2018/blob/master/uk-lords-votes.ipynb Rotating projection cube http://1.bp.blogspot.com/-pgMAHiIWvuw/Tql5HIXNdRI/AAAAAAAABLI/I2zPF5cLRwQ/s1600/clust.gif K-means demo https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ Overview prototype https://blog.overviewdocs.com/2012/03/16/video-document-mining-with-the-overview-prototype/
  2. http://ww2.kqed.org/futureofyou/wp-content/uploads/sites/13/2015/07/Screen-Shot-2015-07-15-at-1.05.15-PM.png
  3. https://youtu.be/PqMvxo89AQ4?t=17m17s
  4. http://1.bp.blogspot.com/-pgMAHiIWvuw/Tql5HIXNdRI/AAAAAAAABLI/I2zPF5cLRwQ/s1600/clust.gif
  5. http://localhost:8888/notebooks/compjournalism2018/uk-lords-votes.ipynb
  6. https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
  7. http://www.compjournalism.com/?p=13
  8. https://www.documentcloud.org/documents/1203111-2012-24-hour-ccl-reports-pbs-final-copy-redacted.html
  9. http://www.reuters.com/investigates/adoption/images/part1/charts.png
  10. http://www.gannett-cdn.com/experiments/usatoday/2014/10/election-topics/
  11. http://www.politico.com/story/2012/01/facebook-users-interest-in-primary-waning-071477
  12. https://www.washingtonpost.com/investigations/whistleblowers-say-usaids-ig-removed-critical-details-from-public-reports/2014/10/22/68fbc1a0-4031-11e4-b03f-de718edeb92f_story.html
  13. http://www.latimes.com/local/cityhall/la-me-crime-stats-20151015-story.html
  14. https://github.com/datadesk/lapd-crime-classification-analysis/blob/master/classifiers.ipynb
  15. http://www.reuters.com/investigates/special-report/scotus/
  16. https://www.menendez.senate.gov/news-and-events/press/on-day-of-michael-vicks-sentencing-legislation-introduced-in-us-senate-for-better-tracking-of-animal-cruelty-crimes