SlideShare une entreprise Scribd logo
1
Jason S. Kessler | Data Day Texas, January 14, 2017
@jasonkessler
Scattertext: A Tool for
Visualizing Differences in
Language
2
Word frequency
• Women and men tend to use different terms on Facebook.
• As do introverts and extroverts.
• Hillary Clinton and Donald Trump used different terms in the
presidential debate.
• Reveal differences in
• content,
• perceived strengths and weaknesses
• communication style
• These are often obvious after being surfaced
3
Outline
• Previous work
• Ways of visualizing word association
• Scattertext
• Open-source Python/D3 framework for visualizing these
differences
• Inspecting LDA, word2vec, sparse classification models
• How CDK Global is using this to help dealerships better sell
cars.
• We’re hiring senior data scientists + devs in Austin and Seattle.
4
OKCupid: an online dating site
hobos
almond
butter
100 Years of
Solitude
Bikram yoga
Christian Rudder: http://blog.okcupid.com/index.php/page/7/
Which words
and phrases
statistically
distinguish
ethnic groups
and genders?
5Source: Christian Rudder.
Dataclysm. 2014.
Ranking with everyone else
High distance: white men
ignore k-pop
Low distance: white men
disproportionately mention
Phish
The smaller the
distance from the top
left, the higher the
association with
white men.
6Source: Christian Rudder.
Dataclysm. 2014.
my blue eyes
7
Scattertext: Democrats vs Republicans: 2012 Convention Speeches
8
Word Use Reflecting Gender and Personality in Facebook
Statuses
• Objective:
• Find words, phrases, and topics that correlate to
• gender, and
• Big 5 personality type
• Data source:
• My Personality App
• 75k voluntary participants in Facebook based survey, >300mm
words
• Agreed to give researchers access to statuses.
• Scoring algorithm
• Linear regression weights, 2000 LDA topics. Lyle Ungar
2013 AAAI
TutorialSchwartz et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary
Approach. Plos One. 2013.
9
Lyle Ungar
2013 AAAI
Tutorial
The good:
• Word clouds force
you to hunt for the
most impactful
terms
• You end up
examining the long
tail in the process
• Compactly
represent a lot of
phrases and topics
10
Lyle Ungar
2013 AAAI
Tutorial
The bad:
• “Mullets of the
Internet” --Jeffrey
Zeldman, 2005
• Longer phrases are
are more prominent.
• Ranking is unclear
• Does size indicate
higher frequency?
11
Lyle Ungar
2013 AAAI
Tutorial
12
Lyle Ungar
2013 AAAI
Tutorial
13
Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
NYT: 2012 Political Convention Word Use by Party
14
Source: http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html,
15
Monroe, Burt L., Michael P. Colaresi, and
Kevin M. Quinn. "Fightin'words: Lexical
feature selection and evaluation for identifying
the content of political conflict." Political
Analysis 16.4 (2008): 372-403.
Differenceinz-scoresoflog-oddsw/prior
log
#(𝑤, 𝐴)
𝐴 − #(𝑤, 𝐴)
− 𝑙𝑜𝑔
#(𝑤, 𝐵)
𝐵 − #(𝑤, 𝐵)
Log-odds for word w, categories A,B
log
# 𝑤, 𝐴 + #(𝑤, 𝐶)
𝐴 + |𝐶| − #(𝑤, 𝐴) − #(𝑤, 𝐶)
− ⋯
Log-odds w/ Dirichlet prior, given
background corpus C
• Difference in z-score accounts for
variation in word frequencies.
• Words with differences < 1.96 are greyed
out.
16
Monroe, Burt L., Michael P. Colaresi, and
Kevin M. Quinn. "Fightin'words: Lexical
feature selection and evaluation for identifying
the content of political conflict." Political
Analysis 16.4 (2008): 372-403.
Differenceinz-scoresoflog-oddsw/prior
• Pros:
• Popular among major CL
researchers (3rd edition of J+M)
• Favors words which appear less
frequent in background.
• Natural linear word listing
• Cons:
• You have to pick a
representative, large background
corpus.
• If the corpus is small, divide
by 0 issue
• Probably only practical for
unigrams
• Inefficient use of space on chart
17Page 17
Repo: https://github.com/JasonKessler/scattertext
$ pip install scattertext
Why the plots look the way they do:
http://bit.ly/scattertextdevelopment
Topic models, word vectors, and The Lasso:
http://bit.ly/scattertext2016debates
Movie revenue and practical use:
http://bit.ly/scattertextrevenuemovie
Hands-on Tutorial
18
CDK Global: Finding Words that Sell Cars
…I was very skeptical giving up my truck and
buying an "Economy Car." I'm 6' 215lbs, but
my new career has me driving a personal
vehicle to make sales calls. I am overly
impressed with my Cruze…
Rating: 4.4/5 Stars
Example Review Appearing on a 3rd
Party Automotive Site
# of users who
read review:
# who went on to visit
a Chevy dealer’s
website: 15
20
Conversion rate of everyone who read
review:
15/20=75%
Text:
Car Reviewed: Chevy Cruze
Median conversion
rate: 22%
19
CDK Global: Finding Words that Sell Cars
5 star review words
Love
Comfortable
Features
Solid
Amazing
<3 star review words
Transmission
Problem
Issue
Dealership
Times
20
CDK Global: Finding Words that Sell Cars
5 star review words High conversion words
Love Comfortable
Comfortable Front [Seats]
Features Acceleration
Solid Free [Car Wash, Oil Change]
Amazing Quiet
<3 star review words
Transmission
Problem
Issue
Dealership
Times
21
CDK Global: Finding Words that Sell Cars
5 star review words High conversion words
Love Comfortable
Comfortable Front [Seats]
Features Acceleration
Solid Free [Car Wash, Oil Change]
Amazing Quiet
<3 star review words Low conversion words
Transmission Money [spend my, save]
Problem Features
Issue Dealership
Dealership Amazing
Times Build Quality [typically positive]
22
CDK Global: Finding Words that Sell Cars (SUV Specific)
5 star review words High conversion words
Love Comfortable
Comfortable Front [Seats]
Features Acceleration
Solid Free [Car Wash, Oil Change]
Amazing Quiet
<3 star review words Low conversion words
Transmission Money [spend my, save]
Problem Features
Issue Dealership
Dealership Amazing
Times Build Quality [typically positive]
The worst thing you can say about an
SUV may be:
I saved money and got all these
amazing features!
23
Thank you.
[first].[last]@gmail.com .
Please see https://github.com/JasonKessler/scattertext
for more info on this project.
We are hiring data scientists and developers in Seattle and
Austin! Please contact me if you’d like to know more.
https://jobs.cdkglobal.com/

Contenu connexe

Tendances

Getting to the People Behind The Keywords
Getting to the People Behind The KeywordsGetting to the People Behind The Keywords
Getting to the People Behind The Keywords
Carmen Mardiros
 
SMX East 2015
SMX East 2015SMX East 2015
SMX East 2015
Grant Simmons
 
Global Travel Network - Design
Global Travel Network  - DesignGlobal Travel Network  - Design
Global Travel Network - DesignSymantec
 
Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14
Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14
Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14Jordan Godbey
 
Search Engine Marketing: How Insurance Agents Can Take Advantage
Search Engine Marketing: How Insurance Agents Can Take AdvantageSearch Engine Marketing: How Insurance Agents Can Take Advantage
Search Engine Marketing: How Insurance Agents Can Take Advantage
mmahan
 
Business Intelligence | Competitive Intelligence | Business Intelligence Tools
Business Intelligence | Competitive Intelligence | Business Intelligence ToolsBusiness Intelligence | Competitive Intelligence | Business Intelligence Tools
Business Intelligence | Competitive Intelligence | Business Intelligence Tools
Roland Frasier
 
What? How? Why? Building Query Personas to Power Your Content Strategy
What? How? Why? Building Query Personas to Power Your Content StrategyWhat? How? Why? Building Query Personas to Power Your Content Strategy
What? How? Why? Building Query Personas to Power Your Content Strategy
Grant Simmons
 
Optimization advice-for-watertownbuysellgold-com-just-sell-gold
Optimization advice-for-watertownbuysellgold-com-just-sell-goldOptimization advice-for-watertownbuysellgold-com-just-sell-gold
Optimization advice-for-watertownbuysellgold-com-just-sell-gold
Brian Bateman
 
Killer Link Building Strategies - Christoph Cemper
Killer Link Building Strategies - Christoph CemperKiller Link Building Strategies - Christoph Cemper
Killer Link Building Strategies - Christoph Cemper
auexpo Conference
 
BPAA PD Day: BNC Research
BPAA PD Day: BNC ResearchBPAA PD Day: BNC Research
BPAA PD Day: BNC ResearchBookNet Canada
 

Tendances (10)

Getting to the People Behind The Keywords
Getting to the People Behind The KeywordsGetting to the People Behind The Keywords
Getting to the People Behind The Keywords
 
SMX East 2015
SMX East 2015SMX East 2015
SMX East 2015
 
Global Travel Network - Design
Global Travel Network  - DesignGlobal Travel Network  - Design
Global Travel Network - Design
 
Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14
Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14
Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14
 
Search Engine Marketing: How Insurance Agents Can Take Advantage
Search Engine Marketing: How Insurance Agents Can Take AdvantageSearch Engine Marketing: How Insurance Agents Can Take Advantage
Search Engine Marketing: How Insurance Agents Can Take Advantage
 
Business Intelligence | Competitive Intelligence | Business Intelligence Tools
Business Intelligence | Competitive Intelligence | Business Intelligence ToolsBusiness Intelligence | Competitive Intelligence | Business Intelligence Tools
Business Intelligence | Competitive Intelligence | Business Intelligence Tools
 
What? How? Why? Building Query Personas to Power Your Content Strategy
What? How? Why? Building Query Personas to Power Your Content StrategyWhat? How? Why? Building Query Personas to Power Your Content Strategy
What? How? Why? Building Query Personas to Power Your Content Strategy
 
Optimization advice-for-watertownbuysellgold-com-just-sell-gold
Optimization advice-for-watertownbuysellgold-com-just-sell-goldOptimization advice-for-watertownbuysellgold-com-just-sell-gold
Optimization advice-for-watertownbuysellgold-com-just-sell-gold
 
Killer Link Building Strategies - Christoph Cemper
Killer Link Building Strategies - Christoph CemperKiller Link Building Strategies - Christoph Cemper
Killer Link Building Strategies - Christoph Cemper
 
BPAA PD Day: BNC Research
BPAA PD Day: BNC ResearchBPAA PD Day: BNC Research
BPAA PD Day: BNC Research
 

Similaire à Scattertext: A Tool for Visualizing Differences in Language

Why Inbound Marketing for Online Business - EBriks Infotech
Why Inbound Marketing for Online Business - EBriks InfotechWhy Inbound Marketing for Online Business - EBriks Infotech
Why Inbound Marketing for Online Business - EBriks Infotech
EBriks Infotech Pvt. Ltd.
 
Get Better Content with Analytics and User Testing
Get Better Content with Analytics and User TestingGet Better Content with Analytics and User Testing
Get Better Content with Analytics and User Testing
Michael Powers
 
Grade 7 Reflective Essay Composition Writing Skill -
Grade 7 Reflective Essay  Composition Writing Skill -Grade 7 Reflective Essay  Composition Writing Skill -
Grade 7 Reflective Essay Composition Writing Skill -
Sean Flores
 
Tips On Improving Word Choice For Your Essay
Tips On Improving Word Choice For Your EssayTips On Improving Word Choice For Your Essay
Tips On Improving Word Choice For Your Essay
Linda Torres
 
Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...
Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...
Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...
Aggregage
 
Own it 5 steps - mceea - for upload
Own it   5 steps - mceea - for uploadOwn it   5 steps - mceea - for upload
Own it 5 steps - mceea - for upload
Scott Patchin
 
Survival Tips for the Lone Product Manager - Kristin Bolton-Keys and Alicia D...
Survival Tips for the Lone Product Manager - Kristin Bolton-Keys and Alicia D...Survival Tips for the Lone Product Manager - Kristin Bolton-Keys and Alicia D...
Survival Tips for the Lone Product Manager - Kristin Bolton-Keys and Alicia D...
ExoLeaders.com
 
The Dynamic Dozen (plus three) Strategic Tools (pdf version)
The Dynamic Dozen (plus three) Strategic Tools (pdf version)The Dynamic Dozen (plus three) Strategic Tools (pdf version)
The Dynamic Dozen (plus three) Strategic Tools (pdf version)
Strategic Counselor; Rowan University-Associate Professor, Ret.
 
Cover LetterOne aspect of strategic planning is to develop a str.docx
Cover LetterOne aspect of strategic planning is to develop a str.docxCover LetterOne aspect of strategic planning is to develop a str.docx
Cover LetterOne aspect of strategic planning is to develop a str.docx
marilucorr
 
Contoh Ielts Writing Task Micin Ilmu - Riset
Contoh Ielts Writing Task Micin Ilmu - RisetContoh Ielts Writing Task Micin Ilmu - Riset
Contoh Ielts Writing Task Micin Ilmu - Riset
Kristen Carter
 
Native American Dilemma
Native American DilemmaNative American Dilemma
Native American Dilemmaguest1f534f
 
Native American Dilemma
Native American DilemmaNative American Dilemma
Native American Dilemmadsvaldi
 
Thomas Baker Leadership Assessment: Envision The Future
Thomas Baker Leadership Assessment: Envision The FutureThomas Baker Leadership Assessment: Envision The Future
Thomas Baker Leadership Assessment: Envision The Future
Baker Publishing Company
 
Template Leading Mathematical Discussions Performance-Based.docx
Template Leading Mathematical Discussions Performance-Based.docxTemplate Leading Mathematical Discussions Performance-Based.docx
Template Leading Mathematical Discussions Performance-Based.docx
rhetttrevannion
 
Just How To Write An Autobiographical Essay - E
Just How To Write An Autobiographical Essay - EJust How To Write An Autobiographical Essay - E
Just How To Write An Autobiographical Essay - E
Holly Vega
 
Topics For Argument Essays
Topics For Argument EssaysTopics For Argument Essays
Topics For Argument Essays
Karen Mosley
 
Page 135Use verbs to present the information more forceful.docx
Page 135Use verbs to present the information more forceful.docxPage 135Use verbs to present the information more forceful.docx
Page 135Use verbs to present the information more forceful.docx
bunyansaturnina
 
Dean r berry real life problems drunk family man kills family
Dean r berry real life problems drunk family man kills familyDean r berry real life problems drunk family man kills family
Dean r berry real life problems drunk family man kills family
Riverside County Office of Education
 
The 4Cs of Powerful Learning
The 4Cs of Powerful LearningThe 4Cs of Powerful Learning
The 4Cs of Powerful Learning
Jonathan Milner
 
12Organization DevelopmentFifth Edition
12Organization DevelopmentFifth Edition12Organization DevelopmentFifth Edition
12Organization DevelopmentFifth Edition
ChantellPantoja184
 

Similaire à Scattertext: A Tool for Visualizing Differences in Language (20)

Why Inbound Marketing for Online Business - EBriks Infotech
Why Inbound Marketing for Online Business - EBriks InfotechWhy Inbound Marketing for Online Business - EBriks Infotech
Why Inbound Marketing for Online Business - EBriks Infotech
 
Get Better Content with Analytics and User Testing
Get Better Content with Analytics and User TestingGet Better Content with Analytics and User Testing
Get Better Content with Analytics and User Testing
 
Grade 7 Reflective Essay Composition Writing Skill -
Grade 7 Reflective Essay  Composition Writing Skill -Grade 7 Reflective Essay  Composition Writing Skill -
Grade 7 Reflective Essay Composition Writing Skill -
 
Tips On Improving Word Choice For Your Essay
Tips On Improving Word Choice For Your EssayTips On Improving Word Choice For Your Essay
Tips On Improving Word Choice For Your Essay
 
Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...
Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...
Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...
 
Own it 5 steps - mceea - for upload
Own it   5 steps - mceea - for uploadOwn it   5 steps - mceea - for upload
Own it 5 steps - mceea - for upload
 
Survival Tips for the Lone Product Manager - Kristin Bolton-Keys and Alicia D...
Survival Tips for the Lone Product Manager - Kristin Bolton-Keys and Alicia D...Survival Tips for the Lone Product Manager - Kristin Bolton-Keys and Alicia D...
Survival Tips for the Lone Product Manager - Kristin Bolton-Keys and Alicia D...
 
The Dynamic Dozen (plus three) Strategic Tools (pdf version)
The Dynamic Dozen (plus three) Strategic Tools (pdf version)The Dynamic Dozen (plus three) Strategic Tools (pdf version)
The Dynamic Dozen (plus three) Strategic Tools (pdf version)
 
Cover LetterOne aspect of strategic planning is to develop a str.docx
Cover LetterOne aspect of strategic planning is to develop a str.docxCover LetterOne aspect of strategic planning is to develop a str.docx
Cover LetterOne aspect of strategic planning is to develop a str.docx
 
Contoh Ielts Writing Task Micin Ilmu - Riset
Contoh Ielts Writing Task Micin Ilmu - RisetContoh Ielts Writing Task Micin Ilmu - Riset
Contoh Ielts Writing Task Micin Ilmu - Riset
 
Native American Dilemma
Native American DilemmaNative American Dilemma
Native American Dilemma
 
Native American Dilemma
Native American DilemmaNative American Dilemma
Native American Dilemma
 
Thomas Baker Leadership Assessment: Envision The Future
Thomas Baker Leadership Assessment: Envision The FutureThomas Baker Leadership Assessment: Envision The Future
Thomas Baker Leadership Assessment: Envision The Future
 
Template Leading Mathematical Discussions Performance-Based.docx
Template Leading Mathematical Discussions Performance-Based.docxTemplate Leading Mathematical Discussions Performance-Based.docx
Template Leading Mathematical Discussions Performance-Based.docx
 
Just How To Write An Autobiographical Essay - E
Just How To Write An Autobiographical Essay - EJust How To Write An Autobiographical Essay - E
Just How To Write An Autobiographical Essay - E
 
Topics For Argument Essays
Topics For Argument EssaysTopics For Argument Essays
Topics For Argument Essays
 
Page 135Use verbs to present the information more forceful.docx
Page 135Use verbs to present the information more forceful.docxPage 135Use verbs to present the information more forceful.docx
Page 135Use verbs to present the information more forceful.docx
 
Dean r berry real life problems drunk family man kills family
Dean r berry real life problems drunk family man kills familyDean r berry real life problems drunk family man kills family
Dean r berry real life problems drunk family man kills family
 
The 4Cs of Powerful Learning
The 4Cs of Powerful LearningThe 4Cs of Powerful Learning
The 4Cs of Powerful Learning
 
12Organization DevelopmentFifth Edition
12Organization DevelopmentFifth Edition12Organization DevelopmentFifth Edition
12Organization DevelopmentFifth Edition
 

Plus de Jason Kessler

Visualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextVisualizing Words and Topics with Scattertext
Visualizing Words and Topics with Scattertext
Jason Kessler
 
Natural Language Visualization with Scattertext
Natural Language Visualization with ScattertextNatural Language Visualization with Scattertext
Natural Language Visualization with Scattertext
Jason Kessler
 
Lexicon Mining for Semiotic Squares: Exploding Binary Classification
Lexicon Mining for Semiotic Squares: Exploding Binary ClassificationLexicon Mining for Semiotic Squares: Exploding Binary Classification
Lexicon Mining for Semiotic Squares: Exploding Binary Classification
Jason Kessler
 
Jason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with TwitterJason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with Twitter
Jason Kessler
 
The 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive DomainThe 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive DomainJason Kessler
 
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Jason Kessler
 
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Jason Kessler
 

Plus de Jason Kessler (7)

Visualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextVisualizing Words and Topics with Scattertext
Visualizing Words and Topics with Scattertext
 
Natural Language Visualization with Scattertext
Natural Language Visualization with ScattertextNatural Language Visualization with Scattertext
Natural Language Visualization with Scattertext
 
Lexicon Mining for Semiotic Squares: Exploding Binary Classification
Lexicon Mining for Semiotic Squares: Exploding Binary ClassificationLexicon Mining for Semiotic Squares: Exploding Binary Classification
Lexicon Mining for Semiotic Squares: Exploding Binary Classification
 
Jason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with TwitterJason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with Twitter
 
The 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive DomainThe 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive Domain
 
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
 
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
 

Dernier

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 

Dernier (20)

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 

Scattertext: A Tool for Visualizing Differences in Language

  • 1. 1 Jason S. Kessler | Data Day Texas, January 14, 2017 @jasonkessler Scattertext: A Tool for Visualizing Differences in Language
  • 2. 2 Word frequency • Women and men tend to use different terms on Facebook. • As do introverts and extroverts. • Hillary Clinton and Donald Trump used different terms in the presidential debate. • Reveal differences in • content, • perceived strengths and weaknesses • communication style • These are often obvious after being surfaced
  • 3. 3 Outline • Previous work • Ways of visualizing word association • Scattertext • Open-source Python/D3 framework for visualizing these differences • Inspecting LDA, word2vec, sparse classification models • How CDK Global is using this to help dealerships better sell cars. • We’re hiring senior data scientists + devs in Austin and Seattle.
  • 4. 4 OKCupid: an online dating site hobos almond butter 100 Years of Solitude Bikram yoga Christian Rudder: http://blog.okcupid.com/index.php/page/7/ Which words and phrases statistically distinguish ethnic groups and genders?
  • 5. 5Source: Christian Rudder. Dataclysm. 2014. Ranking with everyone else High distance: white men ignore k-pop Low distance: white men disproportionately mention Phish The smaller the distance from the top left, the higher the association with white men.
  • 7. 7 Scattertext: Democrats vs Republicans: 2012 Convention Speeches
  • 8. 8 Word Use Reflecting Gender and Personality in Facebook Statuses • Objective: • Find words, phrases, and topics that correlate to • gender, and • Big 5 personality type • Data source: • My Personality App • 75k voluntary participants in Facebook based survey, >300mm words • Agreed to give researchers access to statuses. • Scoring algorithm • Linear regression weights, 2000 LDA topics. Lyle Ungar 2013 AAAI TutorialSchwartz et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. Plos One. 2013.
  • 9. 9 Lyle Ungar 2013 AAAI Tutorial The good: • Word clouds force you to hunt for the most impactful terms • You end up examining the long tail in the process • Compactly represent a lot of phrases and topics
  • 10. 10 Lyle Ungar 2013 AAAI Tutorial The bad: • “Mullets of the Internet” --Jeffrey Zeldman, 2005 • Longer phrases are are more prominent. • Ranking is unclear • Does size indicate higher frequency?
  • 13. 13 Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html NYT: 2012 Political Convention Word Use by Party
  • 15. 15 Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. "Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16.4 (2008): 372-403. Differenceinz-scoresoflog-oddsw/prior log #(𝑤, 𝐴) 𝐴 − #(𝑤, 𝐴) − 𝑙𝑜𝑔 #(𝑤, 𝐵) 𝐵 − #(𝑤, 𝐵) Log-odds for word w, categories A,B log # 𝑤, 𝐴 + #(𝑤, 𝐶) 𝐴 + |𝐶| − #(𝑤, 𝐴) − #(𝑤, 𝐶) − ⋯ Log-odds w/ Dirichlet prior, given background corpus C • Difference in z-score accounts for variation in word frequencies. • Words with differences < 1.96 are greyed out.
  • 16. 16 Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. "Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16.4 (2008): 372-403. Differenceinz-scoresoflog-oddsw/prior • Pros: • Popular among major CL researchers (3rd edition of J+M) • Favors words which appear less frequent in background. • Natural linear word listing • Cons: • You have to pick a representative, large background corpus. • If the corpus is small, divide by 0 issue • Probably only practical for unigrams • Inefficient use of space on chart
  • 17. 17Page 17 Repo: https://github.com/JasonKessler/scattertext $ pip install scattertext Why the plots look the way they do: http://bit.ly/scattertextdevelopment Topic models, word vectors, and The Lasso: http://bit.ly/scattertext2016debates Movie revenue and practical use: http://bit.ly/scattertextrevenuemovie Hands-on Tutorial
  • 18. 18 CDK Global: Finding Words that Sell Cars …I was very skeptical giving up my truck and buying an "Economy Car." I'm 6' 215lbs, but my new career has me driving a personal vehicle to make sales calls. I am overly impressed with my Cruze… Rating: 4.4/5 Stars Example Review Appearing on a 3rd Party Automotive Site # of users who read review: # who went on to visit a Chevy dealer’s website: 15 20 Conversion rate of everyone who read review: 15/20=75% Text: Car Reviewed: Chevy Cruze Median conversion rate: 22%
  • 19. 19 CDK Global: Finding Words that Sell Cars 5 star review words Love Comfortable Features Solid Amazing <3 star review words Transmission Problem Issue Dealership Times
  • 20. 20 CDK Global: Finding Words that Sell Cars 5 star review words High conversion words Love Comfortable Comfortable Front [Seats] Features Acceleration Solid Free [Car Wash, Oil Change] Amazing Quiet <3 star review words Transmission Problem Issue Dealership Times
  • 21. 21 CDK Global: Finding Words that Sell Cars 5 star review words High conversion words Love Comfortable Comfortable Front [Seats] Features Acceleration Solid Free [Car Wash, Oil Change] Amazing Quiet <3 star review words Low conversion words Transmission Money [spend my, save] Problem Features Issue Dealership Dealership Amazing Times Build Quality [typically positive]
  • 22. 22 CDK Global: Finding Words that Sell Cars (SUV Specific) 5 star review words High conversion words Love Comfortable Comfortable Front [Seats] Features Acceleration Solid Free [Car Wash, Oil Change] Amazing Quiet <3 star review words Low conversion words Transmission Money [spend my, save] Problem Features Issue Dealership Dealership Amazing Times Build Quality [typically positive] The worst thing you can say about an SUV may be: I saved money and got all these amazing features!
  • 23. 23 Thank you. [first].[last]@gmail.com . Please see https://github.com/JasonKessler/scattertext for more info on this project. We are hiring data scientists and developers in Seattle and Austin! Please contact me if you’d like to know more. https://jobs.cdkglobal.com/

Notes de l'éditeur

  1. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  2. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  3. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  4. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  5. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  6. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  7. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  8. You’re left to your own devices to try and make sense of the differences.
  9. Similar scatter plot to rudder. Less efficient use of space. Natural listing of words on left-hand side.
  10. Similar scatter plot to rudder. Less efficient use of space. Natural listing of words on left-hand side.
  11. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  12. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  13. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  14. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  15. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)