2. It started with an idle tweet...
https://twitter.com/cbirchall/status/466197512143912961
3. Let’s use Twitter for something
(slightly) useful!
The plan:
● Collect geo-tagged tweets from Twitter
Streaming API
● Use them to build a name⇔country DB
● Build a simple search UI as a proof of
concept
● (crowbar Spark in there somewhere
because it’s cool)
5. Collecting tweets
● Ran the collector for 13 days
● Collected 285,340 geo-tagged tweets
● 205,798 distinct users
● Only collected names and countries,
threw everything else away
● Used Spark to filter out duplicate users
Processing
6. Stats
Top 10 countries by user count
Distinct countries = 204
Distinct first names = 40,689
Distinct last names = 81,674
country | percentage
-----------------------------+------------
United States | 39.4
United Kingdom | 10.1
Indonesia | 8.9
Brasil | 8.1
Türkiye | 3.9
España | 2.4
México | 2.2
Republic of the Philippines | 2.0
Canada | 1.8
Malaysia | 1.8
first_name
------------
chris
alex
david
michael
sarah
second_name
-------------
smith
jones
garcia
williams
johnson
Most popular first names
Most popular surnames
7. Results
It works surprisingly well!
(well, it worked for my name, anyway)
Note for the pedantic:
Since the original data is geo-tagged tweets, strictly speaking we only know
where a user is, not where they come from.