This document summarizes a presentation about using name recognition software to analyze "big data" and recognize trends in Turkish migration. The software is able to accurately classify names by linguistic characteristics and likely origin. It has been tested on datasets of Olympic athletes and inventors with over 80% accuracy. The software can be used to map diasporas on social media, analyze gender breakdowns and nationalities, and help investment agencies identify potential business contacts overseas with family or cultural ties to their country. While name alone does not determine nationality or residence, the software provides a useful statistical tool for exploring demographic trends and directions for further research.
Text mining names in ‘Big Data’ to recognize migration trends
1. TEXT MINING NAMES IN ‘BIG DATA’ TO
RECOGNIZE TURKISH MIGRATION TRENDS
NamSor Applied Onomastics
1
2014-05-30
2. Names Data Mining is just a Tool
2
Zeynep Değirmencioğlu
Şükrü Kaya
Şükrü Saracoğlu
Elian Carsenat
Hüseyin Yıldız
Mahmut Yıldırım
Fatih Öztürk
Mehmet Bölükbaşı
Mehmet Yılmaz
Elif Yıldırım
Ahmet Yıldırım
Mustafa Yücedağ
Mustafa Uzunyılmaz
Fatih Kılıç
Fatih Yılmaz
Murat Yıldırım
Hüseyin Kılıç
Oğuzhan Yıldız
Mevlüt Çavuşoğlu
… (Source: Freebase)
3. What’s in a name? What’s a name?
3
Elian Carsenat
@ElianCarsenat (Twitter)
elian.carsenat@namsor.com
elian.carsenat@sfr.fr
tioulpanov (Skype)
NamSor.com
Onomastics = the science of proper names
5. NamSor sorts names : functions, use cases
5
2.Name
Transliteration
& Matching
3.Named Entity
Extraction, Parsing
1.Name Ling.
Classification
Multilingual Text Mining
Control Watch ListsSocial Networks Analytics
Geo demographics
6. NamSor supervised learning
6
FN LN
MetteAndersen
LeneAndersson
EvaArndt-Riise
HeidiAstrup
MieAugustesen
MargotBærentzen
LouiseBager Nørgaard
MarieBagger Rasmussen
YuttaBarding
UllaBarding-Poulsen
FN LN
XianDongmei
ZhengDongmei
JinDongxiang
XuDongxiang
LiDongxiao
QinDongya
LiDongying
HanDuan
LiDuihong
JiangFan
Training set : Athletes
Step 1 – Learn stereotypes
bitao gong
biwang jiang
birgitta agerberth
birgitte l. eriksen
bitao gong
bitten thorengaard
biwang Jiang
birgitta agerberth
birgitte l. eriksen
bitten thorengaard
Data set : Inventors
Step 2 – Classify
7. Accuracy is measurable ~80%
The very first backtesting on the onomastics of 150,000 Olympic game athletes
7
TOTAL PERF Row Labels
3794 97%Japan
260 93%Mongolia
1576 92%Greece
262 89%Lithuania
4150 89%Italy
2818 88%Poland
2180 87%South Korea
Japan Indonesia Sri Lanka Nigeria Congo (B)
Japan 3686 4 3 3 3
Mongolia Iraq Japan Mali Kazakhstan
Mongolia 243 2 1 1 1
Greece Italy Georgia Romania Great Britain
Greece 1444 14 6 5 5
Lithuania Namibia Greece Latvia Russia
Lithuania 234 3 3 3 2
Italy Spain Portugal France Austria
Italy 3675 81 80 29 26
Poland Czechoslovakia Czech Republic Slovakia Austria
Poland 2486 46 38 34 22
South Korea North Korea Chinese Taipei
Equatorial
Guinea China
South Korea 1901 209 10 6 5
Euro athletes (excl. Anglo & Latin).
Breakdown accuracy 84%
Ex- Yugoslavia athletes
Breakdown accuracy 75%
14. Mining 3M Geo-Tweets
Population flows on Twitter
14
Source Target Type Id Onoma Weight
United Kingdom France Directed 16 Great Britain 37
Spain France Directed 55 Spain 14
United States France Directed 75 Great Britain 12
Turkey France Directed 79 Turkey 11
Brazil France Directed 87 Portugal 10
United Kingdom France Directed 112 Ireland 9
Italy France Directed 152 Italy 7
Switzerland France Directed 226 France 5
Belgium France Directed 247 France 5
United Kingdom France Directed 258 France 5
Mexico France Directed 287 Spain 4
Ireland France Directed 317 Great Britain 4
United Kingdom France Directed 333 Italy 4
United States France Directed 375 France 4
Source: Twitter
15. Mining 150k names in Patents to see
where the Turkish ‘brain juice’ flows15
17. Can ‘Big Data’ answer any question?
17
Trash in, Gold out ? Yes, to some extent
Beware of biases induced by the data source itself
Data access limitations / privacy issues
Open Data vs. Free APIs vs. Commercial Databases
19. originating FDI leads
19
NamSor™ announces FDI Magnet, a new offering for Investment Promotion Agencies.
What is the Idea behind it: “ As recently as 1986 Ireland was one of the poorest countries in the European
Union (EU), but today it is one of the richest. The engine of this new Irish prosperity has been Foreign Direct
Investment (FDI). [Between 1986 and 2002], the Irish have done almost everything right. They have
attracted huge amounts of money from America – due largely to a century of personal and familial ties –
and they have used this money to build factories ”.
A successful approach which Milda Darguzaite, the Managing Director of Invest Lithuania, considers relevant
for her own country. With three million people living in Lithuania and nearly one million people of Lithuanian
origin living abroad, there is a good many personal and familial ties to be leveraged to attract new
investment projects to the country. NamSor name recognition software helped discover those ties.
Recognizing names and their origin in global professional databases allows Investment Promotion Agencies
to identify potentially interesting high profile contacts in different countries / industrial sectors and reach out
to them. Another method to accelerate the origination of new leads is to better understand and leverage
the existing network of foreign businessmen in the country itself.
NamSor™ filters data from millions of meaningless elements to a few dozen actionable names.
Domas Girtavicius, a Senior consultant at Invest Lithuania, said "we were impressed by the accuracy of the
name recognition software: it reliably predicts the country of origin and the number of false positives is fully
manageable". Elian Carsenat, the founder of NamSor™, said "searching for names in the Big Data is like
seeking a gold needle in a haystack: doable once the right tool exists".
20. Conclusions
20
We recognize names in any language, any place, any
database; we can classify and we can sort
Onomastic class is no ‘hard fact’ like a place of birth, a
nationality, etc. but it’s accurate and fine-grain
As a statistics tool, it might be dabatable. But as a datamining
tool, it’s sharp, simple and efficient : it can help find research
directions, discover trends
We see use cases in Migration research; Education & Skills;
Labour & Social Affairs; Territorial Development/FDI; Science
& Innovation
21. Merci !
http://fdimagnet.com/ http://namsor.com/
21
Juillet 2013, Ambassade de Lituanie à Paris
elian.carsenat@namsor.com
+33 6 52 77 99 07
Twitter @NamsSor_com