This document summarizes research using various types of big data including call detail records, political event data, and Twitter data. It discusses how call detail records have been used to study the spread of diseases, optimize transportation networks, and track population displacement. Political event data and mining data have been combined to examine how minerals fuel local conflicts in Africa. Twitter data has been analyzed to map language distributions in Europe and study international migration patterns, such as tracking the migration of Venezuelans during an economic crisis.
ICT Role in 21st Century Education & its Challenges.pptx
Big Data in Economic Research: Twitter, Phone calls and Political events
1. Big Data in Economic Research:
Twitter, Phone calls and Political events
EUI Summer School in Sofia
Julian Hinz
European University Institute &
Kiel Centre for Globalization
2. In this lecture
• Call detail records data: Insecurity and industrial organization
• Political event data: Local conflict and mines
• Twitter data: Spatial distribution of languages and migration
4. Offline services
• Despite “offline” service, often digital logs
• Phone calls, Taxi rides, Transportation networks
• mainly metadata
• Often “surprisingly” open data
5. Call detail records
• (anonymized) phone number of caller and callee
• date and time stamp
• type of interaction recorded (call, SMS, data)
• duration of calls, amount of data
• coordinates of the caller’s/callee’s cell tower
• sometimes further variables: indicator for business customers,
subscription details, billing address...
6. Call detail records
Data used (mainly) descriptively for variety of purposes
• Spread of diseases: Wesolowski et al. (2012) on Malaria, Buckee et al.
(2014) on Ebola, ...
• Optimizing transportation networks: Berlingerio et al. (2013) for
Abidjan, Ivory Coast
• Displacement of people: Bengtsson et al. (2010, 2011) on earthquake
and Cholera outbreak in Haiti, Wilson et al. (2016) on earthquake in
Nepal
• Geography of social networks: Phithakkitnukoon et al. (2012)
• Effect of geography on social networks: Büchel and Ehrlich (2017)
exploit exogenous travel time increase
14. Blumenstock et al. (2018):
Insecurity and Industrial Organization
• Most of textbook economics does not take physical insecurity into
account
→ not much of the literature either: lack of data
• Blumenstock et al. (2018): Combine CDR data from Afghanistan with
geolocalized conflict data
• firms reduce presence in districts following major increases in violence
• effects persist for up to six months
• larger firms are more responsive to violence
19. Political event data
• Political events, Conflict
• Human-coded vs. machine-coded from press coverage
• Probably most known: Correlates of War
→ interstate conflict, treaties, threats, alliances
• Uppsala Conflict Data Program
→ multiple datasets, some georeferenced
• Global Terrorism Database
→ geocoded, very detailed
20. Political event data
• “Integrated Crisis Early Warning System” (ICEWS)
→ DARPA program, released with 1 year lag
• Global Database of Events, Language, and Tone (GDELT)
→ daily data since 1979
• Phoenix from Open Event Data Alliance (OEDA)
→ near realtime event data
• GDELT and Phoenix provide API
21. Berman et al. (2017): Minerals and local
conflict
• “This Mine is Mine! How Minerals Fuel Conflicts in Africa”, Berman et
al. (2017)
• geolocalized data on conflict events in African countries between
1997–2010
• geolocalized data on mining extraction of 14 minerals (Raw Material
Data)
• mining activity increases the incidence of conflicts at the local level
• then spreads violence across territory and time
→ financial capacities of fighting groups increases
29. Online services
• Twitter, LinkedIn, Facebook, Instagram, Tumblr, Airbnb,...
• Content, but also metadata
• Often provide some data access
→ currently in flux
30. Twitter data
• Twitter Streaming API: 1 % random sample of all tweets
→ filters: keyword, geolocation
→ between 40 and 60 per second
• 42 variables: text, username, user_lang, lang, followers, timezone,
latitude, longitude, place, source,...
• Relatively easy to get access to the data: http://dev.twitter.com
31.
32.
33.
34.
35.
36. Twitter data in research
• Obvious: Text-mining
→ Brexit, Trump election,.. Gorodnichenko et al. (2018), De Lyon et al.
(2018), Halberstam and Knight (2016)
• Not so obvious: Metadata
→ Language distribution
→ Migration
37. Hinz and Leromain (2018):
Languages and trade
• Spatial distribution of languages in Europe
• Geolocation from “coordinates”, and “user_lang” or “lang”
→ large heterogeneity across and within countries
• Coordinates provided either by the user’s device’s GPS coordinates, or
a self-assigned location
→ Barratt, J. Cheshire, and E. Manley (2013) use similar data for NY
boroughs
38.
39.
40.
41.
42.
43.
44. Bots and human users
• Bots: an issue, we follow Chu et al. (2012) only taking those sent from
smart phones and official app
• 6.6 million unique human Twitter users
• 481,720 unique human Twitter users in Europe
• 73 different languages
• 25 % tweet in more than 1 language, in Germany 31 %
• 958,071 unique language-user observations
48. Counterfactual simulations
• Gravity between locations
Xod = G ×
Yo
Φ−θ
o
×
Ed
P−θ
d
× τ−θ
od
with
Lod =
∑
l
Pθ
dl
Pθ
d
ldl ×
Φ
−(γ−θ)
ol
Φθ
o
F
−( γ
θ
−1)
ol
=
∑
l
Pθ
dl
Pθ
d
ldl ×
Φθ
ol
Φθ
o
F
1−γ
θ
ol
Φγ
ol
49. Data and calibration
• Two types of “locations”: points in Europe and countries
• Aggregate to 30 arc minutes → 3,408 locations in Europe
→ average distance within location is about 15 kilometers
• Each country outside Europe as a single location
• calibrate production and expenditure in all locations to match external
country-to-country flows
50. Data and calibration
• We specify trade costs ϕod to be determined by: distance, RTA,
common currency
• Data on distances from Hinz (2016) or computed
• Data on the languages spoken in countries other than those in Europe
come from Melitz (2014)
• RTA and CU set to 1 within country or within location
• Coefficients from meta-analysis by Head et al. (2016)
51. Scenario 1: Common European language
→ Spoken by every inhabitant next to local languages
52.
53. Scenario 2: Impact of within-country language
diversity
→ Welfare impact of eliminating within-country language diversity from
European countries and allowing only domestic language
54.
55. Scenario 3: Elimination of all foreign languages in
UK
→ Welfare impact of eliminating allowing only English being spoken in UK
56.
57. Scenario 4: Migration of Arabic-speaking population
to Germany
→ Welfare impact of 10 percent of population speaking Arabic in Germany
58.
59. Hausmann, Hinz and Yıldırım (2018):
Venezuelan emigration
• Economic crisis in Venezuela: Large (?) number of refugees
→ lack of official numbers
• Dataset of geolocalized Tweets of people that tweeted from
Venezuela between February 2017 and May 2018
→ 5.4 million tweets
→ 490.000 tweets from 30.000 human Twitter users
• Idea: What location(s) do they tweet from over time?
60.
61.
62. Migration and social media
• Hawelka et al. (2014): global mobility patterns, tourism flows
• Jurdak et al. (2015) city-to-city travel in Australia
• Morstatter et al. (2013): random sample creates an accurate picture of
the entire population of geolocated Tweets
• Question: How representative are geolocalized tweets?
65. Representativeness of Twitter users in
Venezuela
• “Digital in 2017 Global Overview report”: 44% of Venezuelans social
media, 35% from mobile device
• “ Tendencias Digitales”: 56% of internet users in Venezuela use Twitter
or comparable social media services
• Twitter: penetration in Venezuela 26 %
68. How to make use we don’t capture tourists?
• We narrow sample to users who
→ tweeted from Venezuela exclusively between Feb and May ’17
(Period 1)
→ tweeted from a country exclusively between Feb and May ’18
(Period 2)
• Everyone who is not in Venezuela in period 2: migrant
• reduces sample to 818 (!)
→ Problem: Large heterogeneity in tweet frequency