SlideShare une entreprise Scribd logo
1  sur  32
Not-So-Obvious Online
Data Sources for
Demographic Research
Ingmar Weber
@ingmarweber
https://sites.google.com/site/smdrworkshop/
Targeted Advertising as a Digital Census
All the Internet giants make money with targeted advertising
It’s in their commercial interest to “understand” their users
Rich data on both demographic and behavioral attributes
Usually not available for outside researchers, but …
Some aggregate “audience estimates” available for advertisers:
How many users/impressions match criteria X?
Supported by (at least) Facebook, Twitter, and Google
Facebook’s Advertising Reach Estimates
https://www.facebook.com/ads/manager/creation/creation/
https://developers.facebook.com/docs/marketing-api/buying-api/targeting/v2.8
Easy-to-Use Python code
https://github.com/maraujo/pySocialWatcher
Created by Matheus Araujo at QCRI
Contact me if you want to (i) know about important
details, and (ii) know what’s in the pipeline.
Sneak Preview: Estimating Stocks of Migrants
Joint work with Emilio Zagheni and Krishna Gummadi. Currently under review.
Twitter’s Advertising Reach Estimates
https://dev.twitter.com/ads/reference/1/get/
accounts/%3Aaccount_id/reach_estimate
https://ads.twitter.com/login
Google’s Advertising Reach Estimates
https://support.google.com/adwords/answer/2475441?hl=en
https://developers.google.com/adwords/api/docs/guides/traffic-
estimator-servicehttp://adwords.google.com/
Using Online Ads to Reach Migrants
Only described use as a passive data source. But can be used as an active
outreach channel. Examples below.
“Migrant Sampling Using Facebook Advertisements A Case Study of Polish:
Migrants in Four European Countries”; S. Pötzschke, M. Braun; 2016
“Using Internet to Recruit Immigrants with Language and Culture Barriers for
Tobacco and Alcohol Use Screening: A Study Among Brazilians”; B. H. Carlini, L.
Safioti, T. C. Rue, L. Miles; 2014
“Reaching and recruiting Turkish migrants for a clinical trial through Facebook: A
process evaluation”; B. Ü. Ince, P. Cuijpers, E. van 't Hof, H. Riper; 2014
Google Trends on Steroids
Google Trends does not provide demographic information
Get DMA-level demographic information (race, income, …)
Join with DMA-level Google Trends information
Can potentially give “average income of a web search query over time”
But often sparsity problems, with data only showing for bigger cities (=> bias)
See “The cost of racial animus on a black candidate: Evidence using Google
search data”, Seth Stephens-Davidowitz; Journal of Public Economics; 2014
Also: “Demographic information flows”, Ingmar Weber, Alejandro Jaimes; CIKM 2010
“Fertility and its Meaning: Evidence from Search Behavior”
Jussi Ojala, Emilio Zagheni, Francesco C. Billari, Ingmar Weber
ICWSM; 2017
https://aaai.org/ocs/index.php/ICWSM/ICWSM17/paper/view/15579
Example study using Google Correlate
Study Goals
(i) detect evidence for different contexts surrounding different types of fertility;
Teen, low/high income, (un-)married, …
(ii) model regional variation across states for different fertility levels;
What distinguishes Alabama from California from New York?
(iii) track temporal changes in fertility across time.
Train a model across space, predict across time.
Different Contexts of Fertility
Discover search terms correlated with different fertility rates across US states
https://www.google.com/trends/correlate/search?e=id:f7PU4mFDWV-&t=all
Remove terms with no conceivable link to sex, pregnancy or maternity
Predicting Spatial Variability
Performance of the regression models using
leave-one-out cross-validation. SMAPE is in [%], RMSE
values are multiplied by 1,000.
Use the previous terms to build models
predicting state-level fertility rates
All these models make predictions based on
linear combinations of search intensity
Goal: apply these spatial models across time
Learning Across Space, Predicting Across Time
Temporal trend when applying the “teen” model across
time. Values are rescaled to a maximum of 1.0.
Pearson r correlation across 2010-2015 when
using the spatial model to predict trends across
time.
“Quantitative analysis of population-scale family trees using
millions of relatives”
Joanna Kaplanis, Assaf Gordon, Mary Wahl, Michael Gershovits, Barak Markus,
Mona Sheikh, Melissa Gymrek, Gaurav Bhatia, Daniel G MarArthur, Alkes Price,
Yaniv Erlich
bioRxiv; 2017
http://biorxiv.org/content/early/2017/02/07/106427
Example study using an online genealogy database
Online Genealogy Data - Again
13 million people, after
cleaning, in a single pedigree
Small sample of mitochondria
and Y-STR haplotypes (not
discussed)
Also location information.
Cleaned, de-identified data
available at:
http://familinx.org/
Geographical Distribution of Data (Place of Birth)
Pre 1800 Post 1800
Mortality and City Growth
Their model (red) validated against
previous models (Oeppen & Vaupel, black)
Mobility Over Time
And a lot more! Check out the paper.
Median migration distance in North American
born individuals as a function of time.
Red: mother-offspring,
blue: father-offspring,
black: marital radius.
Dots represent the data before smoothing.
“A novel web informatics approach for automated
surveillance of cancer mortality trends”
Georgia Tourassi, Hong-Jun Yoon, Songhua Xu
Journal of Biomedical Informatics; 2016
http://www.sciencedirect.com/science/article/pii/S1532046416300181
Example study using online obituaries
Crawling Cancer-Related Obituaries
Use a web search engine to get seeds
for queries such as “breast cancer
obituary, New York”
Example
Then post-filter
Then lung vs. breast cancer
Then infer age and gender
Cancer Mortality Rates from Online Obituaries
Percent of lung cancer deaths per age
group based on SEER data and
obituaries for both genders.
Annual female breast cancer death rates based on
obituaries and on National Vital Statistics Report
(NVSR) for 2008–2012.
“From Migration Corridors to Clusters: The Value of Google+
Data for Migration Studies”
Johnnatan Messias, Fabricio Benevenuto, Ingmar Weber, Emilio Zagheni
ASONAM; 2016
http://ieeexplore.ieee.org/document/7752269/
Example study using public Google Plus profiles
Beyond Origin-Destination Migration Analysis
I’m a German citizen living in Qatar. So did I migrate from Germany to Qatar?
Yes, according to Qatari border control.
But: Germany (78->99), United Kingdom (99->03),
Germany (03->07), Switzerland (07->09),
Spain (09->12), Qatar (12->now)
Use the “places lived” on Google+
In 2012, no “currently”, just set of places
Get tuples of co-lived countries
Flows/Corridors vs. Tuples/Clusters
This is what border
control can obtain
(with directionality)
This is what the Google+ “places lived” provides
Expected Cluster Frequencies
Lots of migrant flows on (A,B), (A,C) and (B,C) => expect lots on (A,B,C)
“Expect” = rank clusters according to:
min(freqAB; freqAC; freqBC) * mean(freqAB; freqAC; freqBC)
Best performing ranking approximation (Kendall .565, Spearman .754)
Look at outliers and try to explain those
Outlier Frequencies
Look at “expected rank – actual rank”
Middle 20%: “close to expected”
Top 20%: “higher than expected”
Low 20%: “lower than expected”
Feature Analysis
More than expected:
(Spain, France, Italy)
(UAE, India, Singapore)
Less than expected:
(Brazil, Mexico, USA)
(Canada, China, UK)
Most discriminative features for 3-class distinction
Enriching Your Data
Demographic Inference 101
Demographic Inference – Name Dictionaries
First name gender dictionaries:
https://ideas.repec.org/c/wip/eccode/10.html
http://gender.io/
Contact me for dictionary in “International Gender Differences and Gaps in Online
Social Networks”
Ethnicity Dictionary:
https://www.census.gov/topics/population/genealogy/data/2010_surnames.html
Also see “Inferring Nationalities of Twitter Users and Studying Inter-National Linking”
Demographic Inference – Image-Based Inference
Face++ Cognitive Services
https://www.faceplusplus.com/face-detection/
Microsoft Cognitive Services
https://www.microsoft.com/cognitive-services/en-us/computer-vision-api
Demographic Inference – Build Your Training Data
FollowerWonk by Moz
https://moz.com/followerwonk/bio
https://moz.com/followerwonk/bio/?q=(38-yr%7C38-yrs%7C38%20years)%20old%0A%0A
Questions, Comments, Thoughts?
https://sites.google.com/site/digitaldemography/

Contenu connexe

Tendances

INResearch Social Media Proposal
INResearch Social Media ProposalINResearch Social Media Proposal
INResearch Social Media Proposal
Jillian Schurr
 

Tendances (20)

Digital methods for Social Sciences: origin and definitions
Digital methods for Social Sciences: origin and definitionsDigital methods for Social Sciences: origin and definitions
Digital methods for Social Sciences: origin and definitions
 
GitHub as Transparency Device in Data Journalism, Open Data and Data Activism
GitHub as Transparency Device in  Data Journalism, Open Data and Data ActivismGitHub as Transparency Device in  Data Journalism, Open Data and Data Activism
GitHub as Transparency Device in Data Journalism, Open Data and Data Activism
 
Doing Digital Methods: Some Recent Highlights from Winter and Summer Schools
Doing Digital Methods: Some Recent Highlights from Winter and Summer SchoolsDoing Digital Methods: Some Recent Highlights from Winter and Summer Schools
Doing Digital Methods: Some Recent Highlights from Winter and Summer Schools
 
Doing Social and Political Research in a Digital Age: An Introduction to Digi...
Doing Social and Political Research in a Digital Age: An Introduction to Digi...Doing Social and Political Research in a Digital Age: An Introduction to Digi...
Doing Social and Political Research in a Digital Age: An Introduction to Digi...
 
Global Pulse: Mining Indonesian Tweets to Understand Food Price Crises copy
Global Pulse: Mining Indonesian Tweets to Understand Food Price Crises copyGlobal Pulse: Mining Indonesian Tweets to Understand Food Price Crises copy
Global Pulse: Mining Indonesian Tweets to Understand Food Price Crises copy
 
How to get started with Data Journalism
How to get started with Data JournalismHow to get started with Data Journalism
How to get started with Data Journalism
 
INResearch Social Media Proposal
INResearch Social Media ProposalINResearch Social Media Proposal
INResearch Social Media Proposal
 
Extracting interesting concepts from large-scale textual data
Extracting interesting concepts from large-scale textual dataExtracting interesting concepts from large-scale textual data
Extracting interesting concepts from large-scale textual data
 
Data Journalism and the Remaking of Data Infrastructures
Data Journalism and the Remaking of Data InfrastructuresData Journalism and the Remaking of Data Infrastructures
Data Journalism and the Remaking of Data Infrastructures
 
Mapping Issues with the Web: An Introduction to Digital Methods
Mapping Issues with the Web: An Introduction to Digital MethodsMapping Issues with the Web: An Introduction to Digital Methods
Mapping Issues with the Web: An Introduction to Digital Methods
 
Frontiers of Computational Journalism week 10 - Truth and Trust
Frontiers of Computational Journalism week 10 - Truth and TrustFrontiers of Computational Journalism week 10 - Truth and Trust
Frontiers of Computational Journalism week 10 - Truth and Trust
 
Redistributing journalism: Journalism as a data public and the politics of qu...
Redistributing journalism: Journalism as a data public and the politics of qu...Redistributing journalism: Journalism as a data public and the politics of qu...
Redistributing journalism: Journalism as a data public and the politics of qu...
 
Big Data and the Social Sciences
Big Data and the Social SciencesBig Data and the Social Sciences
Big Data and the Social Sciences
 
The evolution of research on social media
The evolution of research on social mediaThe evolution of research on social media
The evolution of research on social media
 
GlobalPulse_SAS_MethodsPaper2011
GlobalPulse_SAS_MethodsPaper2011GlobalPulse_SAS_MethodsPaper2011
GlobalPulse_SAS_MethodsPaper2011
 
ESRC Research Methods Festival - From Flickr to Snapchat: The challenge of an...
ESRC Research Methods Festival - From Flickr to Snapchat: The challenge of an...ESRC Research Methods Festival - From Flickr to Snapchat: The challenge of an...
ESRC Research Methods Festival - From Flickr to Snapchat: The challenge of an...
 
Analyzing Attitudes Towards Contraception & Teenage Pregnancy Using Social Da...
Analyzing Attitudes Towards Contraception & Teenage Pregnancy Using Social Da...Analyzing Attitudes Towards Contraception & Teenage Pregnancy Using Social Da...
Analyzing Attitudes Towards Contraception & Teenage Pregnancy Using Social Da...
 
Bigdataforesight
BigdataforesightBigdataforesight
Bigdataforesight
 
Crowdsourcing High- Frequency Food Price Data in Rural Indonesia - Project Ov...
Crowdsourcing High- Frequency Food Price Data in Rural Indonesia - Project Ov...Crowdsourcing High- Frequency Food Price Data in Rural Indonesia - Project Ov...
Crowdsourcing High- Frequency Food Price Data in Rural Indonesia - Project Ov...
 
Social Media Analysis: Present and Future
Social Media Analysis: Present and FutureSocial Media Analysis: Present and Future
Social Media Analysis: Present and Future
 

Similaire à Not-so-obvious Online Data Sources for Demographic Research

Finger On The Pulse
Finger On The PulseFinger On The Pulse
Finger On The Pulse
mccannpulse
 
Finger On The Pulse
Finger On The PulseFinger On The Pulse
Finger On The Pulse
mccannpulse
 

Similaire à Not-so-obvious Online Data Sources for Demographic Research (20)

Digital Demography - Keynote at SocInfo'18
Digital Demography - Keynote at SocInfo'18Digital Demography - Keynote at SocInfo'18
Digital Demography - Keynote at SocInfo'18
 
Monitoring migration using social media data an introduction
Monitoring migration using social media data   an introductionMonitoring migration using social media data   an introduction
Monitoring migration using social media data an introduction
 
Ethical Dilemmas in AI/ML-based systems
Ethical Dilemmas in AI/ML-based systemsEthical Dilemmas in AI/ML-based systems
Ethical Dilemmas in AI/ML-based systems
 
Scraping the Social Graph with Ushahidi and SwiftRiver
Scraping the Social Graph with Ushahidi and SwiftRiverScraping the Social Graph with Ushahidi and SwiftRiver
Scraping the Social Graph with Ushahidi and SwiftRiver
 
Opportunities in technology and connected health for population science
Opportunities in technology and connected health for population science Opportunities in technology and connected health for population science
Opportunities in technology and connected health for population science
 
Visualisation; help understand and communicate
Visualisation; help understand and communicateVisualisation; help understand and communicate
Visualisation; help understand and communicate
 
Data science innovations
Data science innovations Data science innovations
Data science innovations
 
Big Data Analytics - The New Cold War
Big Data Analytics - The New Cold WarBig Data Analytics - The New Cold War
Big Data Analytics - The New Cold War
 
Using internet advertising data for studying international migration
Using internet advertising data for studying international migrationUsing internet advertising data for studying international migration
Using internet advertising data for studying international migration
 
Google Insights and public data
Google Insights and public data Google Insights and public data
Google Insights and public data
 
Data Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as powerData Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as power
 
Big Data-Job 2
Big Data-Job 2Big Data-Job 2
Big Data-Job 2
 
Creating a Data-Driven Government: Big Data With Purpose
Creating a Data-Driven Government: Big Data With PurposeCreating a Data-Driven Government: Big Data With Purpose
Creating a Data-Driven Government: Big Data With Purpose
 
Big data for development
Big data for development Big data for development
Big data for development
 
H(app)athon
H(app)athon H(app)athon
H(app)athon
 
1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx
 
Worldwide Public Sector Breakfast Hosted by Teresa Carlson (WPS01) - AWS re:I...
Worldwide Public Sector Breakfast Hosted by Teresa Carlson (WPS01) - AWS re:I...Worldwide Public Sector Breakfast Hosted by Teresa Carlson (WPS01) - AWS re:I...
Worldwide Public Sector Breakfast Hosted by Teresa Carlson (WPS01) - AWS re:I...
 
Introduction to the Venice Time Machine
Introduction to the Venice Time MachineIntroduction to the Venice Time Machine
Introduction to the Venice Time Machine
 
Finger On The Pulse
Finger On The PulseFinger On The Pulse
Finger On The Pulse
 
Finger On The Pulse
Finger On The PulseFinger On The Pulse
Finger On The Pulse
 

Plus de Ingmar Weber

Plus de Ingmar Weber (16)

Digital Gender Gaps Seen Through Social Media
Digital Gender Gaps Seen Through Social MediaDigital Gender Gaps Seen Through Social Media
Digital Gender Gaps Seen Through Social Media
 
Different Hashtags, Different Opinions - Twitter Polarization in Egypt
Different Hashtags, Different Opinions - Twitter Polarization in EgyptDifferent Hashtags, Different Opinions - Twitter Polarization in Egypt
Different Hashtags, Different Opinions - Twitter Polarization in Egypt
 
Data on Polarization, Peace, and Propaganda
Data on Polarization, Peace, and PropagandaData on Polarization, Peace, and Propaganda
Data on Polarization, Peace, and Propaganda
 
Using Advertising Platforms for Social Good
Using Advertising Platforms for Social GoodUsing Advertising Platforms for Social Good
Using Advertising Platforms for Social Good
 
Not so-obvious social media analysis to study current affairs
Not so-obvious social media analysis to study current affairsNot so-obvious social media analysis to study current affairs
Not so-obvious social media analysis to study current affairs
 
Digital advertising data for migration research
Digital advertising data for migration researchDigital advertising data for migration research
Digital advertising data for migration research
 
Advertising Data for Good
Advertising Data for GoodAdvertising Data for Good
Advertising Data for Good
 
Using advertising data to model migration, poverty and digital gender gaps
Using advertising data to model migration, poverty and digital gender gapsUsing advertising data to model migration, poverty and digital gender gaps
Using advertising data to model migration, poverty and digital gender gaps
 
Correlated Impulses: Using Facebook Interests to Improve Predictions of Crime...
Correlated Impulses: Using Facebook Interests to Improve Predictions of Crime...Correlated Impulses: Using Facebook Interests to Improve Predictions of Crime...
Correlated Impulses: Using Facebook Interests to Improve Predictions of Crime...
 
Tapping into advertising platforms to monitor ict usage and more
Tapping into advertising platforms to monitor ict usage and moreTapping into advertising platforms to monitor ict usage and more
Tapping into advertising platforms to monitor ict usage and more
 
Hate Speech, Polarization and Online Data
Hate Speech, Polarization and Online DataHate Speech, Polarization and Online Data
Hate Speech, Polarization and Online Data
 
Tracking Digital Gender Gaps
Tracking Digital Gender GapsTracking Digital Gender Gaps
Tracking Digital Gender Gaps
 
Estimating Migration and Quantifying Migrant Assimilation Using Internet Adve...
Estimating Migration and Quantifying Migrant Assimilation Using Internet Adve...Estimating Migration and Quantifying Migrant Assimilation Using Internet Adve...
Estimating Migration and Quantifying Migrant Assimilation Using Internet Adve...
 
Social media analysis for better policy making
Social media analysis for better policy makingSocial media analysis for better policy making
Social media analysis for better policy making
 
Matching Methods and Natural Experiments - Examples of Causal Inference from ...
Matching Methods and Natural Experiments - Examples of Causal Inference from ...Matching Methods and Natural Experiments - Examples of Causal Inference from ...
Matching Methods and Natural Experiments - Examples of Causal Inference from ...
 
A Warm Welcome Matters! The Link Between Social Feedback and Weight Loss in /...
A Warm Welcome Matters! The Link Between Social Feedback and Weight Loss in /...A Warm Welcome Matters! The Link Between Social Feedback and Weight Loss in /...
A Warm Welcome Matters! The Link Between Social Feedback and Weight Loss in /...
 

Dernier

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 

Dernier (20)

Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 

Not-so-obvious Online Data Sources for Demographic Research

  • 1. Not-So-Obvious Online Data Sources for Demographic Research Ingmar Weber @ingmarweber https://sites.google.com/site/smdrworkshop/
  • 2. Targeted Advertising as a Digital Census All the Internet giants make money with targeted advertising It’s in their commercial interest to “understand” their users Rich data on both demographic and behavioral attributes Usually not available for outside researchers, but … Some aggregate “audience estimates” available for advertisers: How many users/impressions match criteria X? Supported by (at least) Facebook, Twitter, and Google
  • 3. Facebook’s Advertising Reach Estimates https://www.facebook.com/ads/manager/creation/creation/ https://developers.facebook.com/docs/marketing-api/buying-api/targeting/v2.8 Easy-to-Use Python code https://github.com/maraujo/pySocialWatcher Created by Matheus Araujo at QCRI Contact me if you want to (i) know about important details, and (ii) know what’s in the pipeline.
  • 4. Sneak Preview: Estimating Stocks of Migrants Joint work with Emilio Zagheni and Krishna Gummadi. Currently under review.
  • 5. Twitter’s Advertising Reach Estimates https://dev.twitter.com/ads/reference/1/get/ accounts/%3Aaccount_id/reach_estimate https://ads.twitter.com/login
  • 6. Google’s Advertising Reach Estimates https://support.google.com/adwords/answer/2475441?hl=en https://developers.google.com/adwords/api/docs/guides/traffic- estimator-servicehttp://adwords.google.com/
  • 7. Using Online Ads to Reach Migrants Only described use as a passive data source. But can be used as an active outreach channel. Examples below. “Migrant Sampling Using Facebook Advertisements A Case Study of Polish: Migrants in Four European Countries”; S. Pötzschke, M. Braun; 2016 “Using Internet to Recruit Immigrants with Language and Culture Barriers for Tobacco and Alcohol Use Screening: A Study Among Brazilians”; B. H. Carlini, L. Safioti, T. C. Rue, L. Miles; 2014 “Reaching and recruiting Turkish migrants for a clinical trial through Facebook: A process evaluation”; B. Ü. Ince, P. Cuijpers, E. van 't Hof, H. Riper; 2014
  • 8. Google Trends on Steroids Google Trends does not provide demographic information Get DMA-level demographic information (race, income, …) Join with DMA-level Google Trends information Can potentially give “average income of a web search query over time” But often sparsity problems, with data only showing for bigger cities (=> bias) See “The cost of racial animus on a black candidate: Evidence using Google search data”, Seth Stephens-Davidowitz; Journal of Public Economics; 2014 Also: “Demographic information flows”, Ingmar Weber, Alejandro Jaimes; CIKM 2010
  • 9. “Fertility and its Meaning: Evidence from Search Behavior” Jussi Ojala, Emilio Zagheni, Francesco C. Billari, Ingmar Weber ICWSM; 2017 https://aaai.org/ocs/index.php/ICWSM/ICWSM17/paper/view/15579 Example study using Google Correlate
  • 10. Study Goals (i) detect evidence for different contexts surrounding different types of fertility; Teen, low/high income, (un-)married, … (ii) model regional variation across states for different fertility levels; What distinguishes Alabama from California from New York? (iii) track temporal changes in fertility across time. Train a model across space, predict across time.
  • 11. Different Contexts of Fertility Discover search terms correlated with different fertility rates across US states https://www.google.com/trends/correlate/search?e=id:f7PU4mFDWV-&t=all Remove terms with no conceivable link to sex, pregnancy or maternity
  • 12. Predicting Spatial Variability Performance of the regression models using leave-one-out cross-validation. SMAPE is in [%], RMSE values are multiplied by 1,000. Use the previous terms to build models predicting state-level fertility rates All these models make predictions based on linear combinations of search intensity Goal: apply these spatial models across time
  • 13. Learning Across Space, Predicting Across Time Temporal trend when applying the “teen” model across time. Values are rescaled to a maximum of 1.0. Pearson r correlation across 2010-2015 when using the spatial model to predict trends across time.
  • 14. “Quantitative analysis of population-scale family trees using millions of relatives” Joanna Kaplanis, Assaf Gordon, Mary Wahl, Michael Gershovits, Barak Markus, Mona Sheikh, Melissa Gymrek, Gaurav Bhatia, Daniel G MarArthur, Alkes Price, Yaniv Erlich bioRxiv; 2017 http://biorxiv.org/content/early/2017/02/07/106427 Example study using an online genealogy database
  • 15. Online Genealogy Data - Again 13 million people, after cleaning, in a single pedigree Small sample of mitochondria and Y-STR haplotypes (not discussed) Also location information. Cleaned, de-identified data available at: http://familinx.org/
  • 16. Geographical Distribution of Data (Place of Birth) Pre 1800 Post 1800
  • 17. Mortality and City Growth Their model (red) validated against previous models (Oeppen & Vaupel, black)
  • 18. Mobility Over Time And a lot more! Check out the paper. Median migration distance in North American born individuals as a function of time. Red: mother-offspring, blue: father-offspring, black: marital radius. Dots represent the data before smoothing.
  • 19. “A novel web informatics approach for automated surveillance of cancer mortality trends” Georgia Tourassi, Hong-Jun Yoon, Songhua Xu Journal of Biomedical Informatics; 2016 http://www.sciencedirect.com/science/article/pii/S1532046416300181 Example study using online obituaries
  • 20. Crawling Cancer-Related Obituaries Use a web search engine to get seeds for queries such as “breast cancer obituary, New York” Example Then post-filter Then lung vs. breast cancer Then infer age and gender
  • 21. Cancer Mortality Rates from Online Obituaries Percent of lung cancer deaths per age group based on SEER data and obituaries for both genders. Annual female breast cancer death rates based on obituaries and on National Vital Statistics Report (NVSR) for 2008–2012.
  • 22. “From Migration Corridors to Clusters: The Value of Google+ Data for Migration Studies” Johnnatan Messias, Fabricio Benevenuto, Ingmar Weber, Emilio Zagheni ASONAM; 2016 http://ieeexplore.ieee.org/document/7752269/ Example study using public Google Plus profiles
  • 23. Beyond Origin-Destination Migration Analysis I’m a German citizen living in Qatar. So did I migrate from Germany to Qatar? Yes, according to Qatari border control. But: Germany (78->99), United Kingdom (99->03), Germany (03->07), Switzerland (07->09), Spain (09->12), Qatar (12->now) Use the “places lived” on Google+ In 2012, no “currently”, just set of places Get tuples of co-lived countries
  • 24. Flows/Corridors vs. Tuples/Clusters This is what border control can obtain (with directionality) This is what the Google+ “places lived” provides
  • 25. Expected Cluster Frequencies Lots of migrant flows on (A,B), (A,C) and (B,C) => expect lots on (A,B,C) “Expect” = rank clusters according to: min(freqAB; freqAC; freqBC) * mean(freqAB; freqAC; freqBC) Best performing ranking approximation (Kendall .565, Spearman .754) Look at outliers and try to explain those
  • 26. Outlier Frequencies Look at “expected rank – actual rank” Middle 20%: “close to expected” Top 20%: “higher than expected” Low 20%: “lower than expected”
  • 27. Feature Analysis More than expected: (Spain, France, Italy) (UAE, India, Singapore) Less than expected: (Brazil, Mexico, USA) (Canada, China, UK) Most discriminative features for 3-class distinction
  • 29. Demographic Inference – Name Dictionaries First name gender dictionaries: https://ideas.repec.org/c/wip/eccode/10.html http://gender.io/ Contact me for dictionary in “International Gender Differences and Gaps in Online Social Networks” Ethnicity Dictionary: https://www.census.gov/topics/population/genealogy/data/2010_surnames.html Also see “Inferring Nationalities of Twitter Users and Studying Inter-National Linking”
  • 30. Demographic Inference – Image-Based Inference Face++ Cognitive Services https://www.faceplusplus.com/face-detection/ Microsoft Cognitive Services https://www.microsoft.com/cognitive-services/en-us/computer-vision-api
  • 31. Demographic Inference – Build Your Training Data FollowerWonk by Moz https://moz.com/followerwonk/bio https://moz.com/followerwonk/bio/?q=(38-yr%7C38-yrs%7C38%20years)%20old%0A%0A