SlideShare une entreprise Scribd logo
1  sur  21
TEXT MINING NAMES IN ‘BIG DATA’ TO
RECOGNIZE TURKISH MIGRATION TRENDS
NamSor Applied Onomastics
1
2014-05-30
Names Data Mining is just a Tool
2
Zeynep Değirmencioğlu
Şükrü Kaya
Şükrü Saracoğlu
Elian Carsenat
Hüseyin Yıldız
Mahmut Yıldırım
Fatih Öztürk
Mehmet Bölükbaşı
Mehmet Yılmaz
Elif Yıldırım
Ahmet Yıldırım
Mustafa Yücedağ
Mustafa Uzunyılmaz
Fatih Kılıç
Fatih Yılmaz
Murat Yıldırım
Hüseyin Kılıç
Oğuzhan Yıldız
Mevlüt Çavuşoğlu
… (Source: Freebase)
What’s in a name? What’s a name?
3
 Elian Carsenat
 @ElianCarsenat (Twitter)
 elian.carsenat@namsor.com
 elian.carsenat@sfr.fr
 tioulpanov (Skype)
 NamSor.com
 Onomastics = the science of proper names
Onoma != Residence != Nationality
4
Source: OECD
NamSor sorts names : functions, use cases
5
2.Name
Transliteration
& Matching
3.Named Entity
Extraction, Parsing
1.Name Ling.
Classification
Multilingual Text Mining
Control Watch ListsSocial Networks Analytics
Geo demographics
NamSor supervised learning
6
FN LN
MetteAndersen
LeneAndersson
EvaArndt-Riise
HeidiAstrup
MieAugustesen
MargotBærentzen
LouiseBager Nørgaard
MarieBagger Rasmussen
YuttaBarding
UllaBarding-Poulsen
FN LN
XianDongmei
ZhengDongmei
JinDongxiang
XuDongxiang
LiDongxiao
QinDongya
LiDongying
HanDuan
LiDuihong
JiangFan
Training set : Athletes
Step 1 – Learn stereotypes
bitao gong
biwang jiang
birgitta agerberth
birgitte l. eriksen
bitao gong
bitten thorengaard
biwang Jiang
birgitta agerberth
birgitte l. eriksen
bitten thorengaard
Data set : Inventors
Step 2 – Classify
Accuracy is measurable ~80%
The very first backtesting on the onomastics of 150,000 Olympic game athletes
7
TOTAL PERF Row Labels
3794 97%Japan
260 93%Mongolia
1576 92%Greece
262 89%Lithuania
4150 89%Italy
2818 88%Poland
2180 87%South Korea
Japan Indonesia Sri Lanka Nigeria Congo (B)
Japan 3686 4 3 3 3
Mongolia Iraq Japan Mali Kazakhstan
Mongolia 243 2 1 1 1
Greece Italy Georgia Romania Great Britain
Greece 1444 14 6 5 5
Lithuania Namibia Greece Latvia Russia
Lithuania 234 3 3 3 2
Italy Spain Portugal France Austria
Italy 3675 81 80 29 26
Poland Czechoslovakia Czech Republic Slovakia Austria
Poland 2486 46 38 34 22
South Korea North Korea Chinese Taipei
Equatorial
Guinea China
South Korea 1901 209 10 6 5
Euro athletes (excl. Anglo & Latin).
Breakdown accuracy 84%
Ex- Yugoslavia athletes
Breakdown accuracy 75%
Decrypting identity accross space/time:
India Geodemographics (1914)8
Source: Commonwealth WWI Casualties
Unsupervised learning is
fine-grain: Country/Region,…9
 Ex. Russian Federation
In progress :
Syrian names (backtesting)
Onoma Count
Syria 201
Saudi Arabia 20
Iraq 8
Kuwait 4
United Arab Emirates 3
Egypt 3
Qatar 2
Bahrain 2
Soudan 2
Lebanon 2
Algeria 1
Oman 1
Grand Total 249
10
201
Syria
Saudi Arabia
Iraq
Kuwait
United Arab Emirates
Egypt
Qatar
Bahrain
Soudan
Lebanon
Algeria
Oman
‫طاهر‬ ‫الحريري‬
‫عبدالغفار‬ ‫العيدة‬ ‫سليمان‬
‫عبدالغفار‬ ‫شحادة‬
‫قاسم‬ ‫األسعد‬
‫مؤمن‬ ‫حموده‬
‫مفلح‬ ‫محمد‬ ‫الجراد‬
‫نزار‬ ‫الحروب‬
‫نزار‬ ‫العيدة‬ ‫سليمان‬
‫أسامة‬ ‫الحراكي‬
‫أنس‬ ‫الصغير‬
‫خالد‬ ‫الهبول‬
‫وفيق‬ ‫الواحد‬ ‫عبد‬
‫إسراء‬ ‫يونس‬
‫رشا‬ ‫نزهة‬
‫زكريا‬ ‫محمد‬ ‫وهبة‬
‫كمال‬ ‫بركات‬
‫عيد‬ ‫محمد‬ ‫اللو‬
[…]
Syrian names recognized at ~80%
Other name may effectively be non-
Syrian or generic to the Arab world.
What can you dig with this tool?
11
Mining 5M names to recognize Gender,
breakdown by nationality/likely origin
12
Mining 1M names to map Diasporas
13
Source: Twitter
Mining 3M Geo-Tweets
Population flows on Twitter
14
Source Target Type Id Onoma Weight
United Kingdom France Directed 16 Great Britain 37
Spain France Directed 55 Spain 14
United States France Directed 75 Great Britain 12
Turkey France Directed 79 Turkey 11
Brazil France Directed 87 Portugal 10
United Kingdom France Directed 112 Ireland 9
Italy France Directed 152 Italy 7
Switzerland France Directed 226 France 5
Belgium France Directed 247 France 5
United Kingdom France Directed 258 France 5
Mexico France Directed 287 Spain 4
Ireland France Directed 317 Great Britain 4
United Kingdom France Directed 333 Italy 4
United States France Directed 375 France 4
Source: Twitter
Mining 150k names in Patents to see
where the Turkish ‘brain juice’ flows15
Mining names : a word of caution
16
Can ‘Big Data’ answer any question?
17
 Trash in, Gold out ? Yes, to some extent
 Beware of biases induced by the data source itself
 Data access limitations / privacy issues
 Open Data vs. Free APIs vs. Commercial Databases
Still, tools make possible the impossible
18
originating FDI leads
19
 NamSor™ announces FDI Magnet, a new offering for Investment Promotion Agencies.
 What is the Idea behind it: “ As recently as 1986 Ireland was one of the poorest countries in the European
Union (EU), but today it is one of the richest. The engine of this new Irish prosperity has been Foreign Direct
Investment (FDI). [Between 1986 and 2002], the Irish have done almost everything right. They have
attracted huge amounts of money from America – due largely to a century of personal and familial ties –
and they have used this money to build factories ”.
 A successful approach which Milda Darguzaite, the Managing Director of Invest Lithuania, considers relevant
for her own country. With three million people living in Lithuania and nearly one million people of Lithuanian
origin living abroad, there is a good many personal and familial ties to be leveraged to attract new
investment projects to the country. NamSor name recognition software helped discover those ties.
 Recognizing names and their origin in global professional databases allows Investment Promotion Agencies
to identify potentially interesting high profile contacts in different countries / industrial sectors and reach out
to them. Another method to accelerate the origination of new leads is to better understand and leverage
the existing network of foreign businessmen in the country itself.
 NamSor™ filters data from millions of meaningless elements to a few dozen actionable names.
 Domas Girtavicius, a Senior consultant at Invest Lithuania, said "we were impressed by the accuracy of the
name recognition software: it reliably predicts the country of origin and the number of false positives is fully
manageable". Elian Carsenat, the founder of NamSor™, said "searching for names in the Big Data is like
seeking a gold needle in a haystack: doable once the right tool exists".
Conclusions
20
 We recognize names in any language, any place, any
database; we can classify and we can sort
 Onomastic class is no ‘hard fact’ like a place of birth, a
nationality, etc. but it’s accurate and fine-grain
 As a statistics tool, it might be dabatable. But as a datamining
tool, it’s sharp, simple and efficient : it can help find research
directions, discover trends
 We see use cases in Migration research; Education & Skills;
Labour & Social Affairs; Territorial Development/FDI; Science
& Innovation
Merci !
 http://fdimagnet.com/  http://namsor.com/
21
Juillet 2013, Ambassade de Lituanie à Paris
 elian.carsenat@namsor.com
 +33 6 52 77 99 07
 Twitter @NamsSor_com

Contenu connexe

En vedette

En vedette (6)

Politics and government of france
Politics and government of francePolitics and government of france
Politics and government of france
 
France's Presentation
France's PresentationFrance's Presentation
France's Presentation
 
France Power Point
France Power PointFrance Power Point
France Power Point
 
France Power Point
France Power PointFrance Power Point
France Power Point
 
France Ppt
France PptFrance Ppt
France Ppt
 
2017 Digital Yearbook
2017 Digital Yearbook2017 Digital Yearbook
2017 Digital Yearbook
 

Similaire à Text mining names in ‘Big Data’ to recognize migration trends

Diasporas Digital Développement
Diasporas Digital DéveloppementDiasporas Digital Développement
Diasporas Digital DéveloppementElian CARSENAT
 
Mining names in the big data to map diasporas - NamSor
Mining names in the big data to map diasporas - NamSorMining names in the big data to map diasporas - NamSor
Mining names in the big data to map diasporas - NamSorICMPD
 
Icc2013 country names
Icc2013 country namesIcc2013 country names
Icc2013 country namessirf13
 
Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowTony Russell-Rose
 
Narrative Essay On Prewood High School
Narrative Essay On Prewood High SchoolNarrative Essay On Prewood High School
Narrative Essay On Prewood High SchoolRachel Johnston
 
Bigdataforesight
BigdataforesightBigdataforesight
Bigdataforesightsuresh sood
 
Resurrectionist Case Study Summary
Resurrectionist Case Study SummaryResurrectionist Case Study Summary
Resurrectionist Case Study SummaryRajee Dent
 
Data Geeks Paris - Cherchez la Femme
Data Geeks Paris - Cherchez la FemmeData Geeks Paris - Cherchez la Femme
Data Geeks Paris - Cherchez la FemmeElian CARSENAT
 
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...Lewis Shepherd
 
Privacy, human rights and Location Based Services
Privacy, human rights and Location Based ServicesPrivacy, human rights and Location Based Services
Privacy, human rights and Location Based Servicesblogzilla
 
2006 multinational intelligence (centcom ccc)
2006 multinational intelligence (centcom ccc)2006 multinational intelligence (centcom ccc)
2006 multinational intelligence (centcom ccc)Robert David Steele Vivas
 
Dartmouth Essay Prompt 2014
Dartmouth Essay Prompt 2014Dartmouth Essay Prompt 2014
Dartmouth Essay Prompt 2014Anna May
 
Power and Leverage in the XXI Century
Power and Leverage in the XXI CenturyPower and Leverage in the XXI Century
Power and Leverage in the XXI CenturyJyrki Kasvi
 
Open Data Innovation from GEO DATA Perspective
Open Data Innovation from GEO DATA  PerspectiveOpen Data Innovation from GEO DATA  Perspective
Open Data Innovation from GEO DATA PerspectiveSerdar Temiz
 
An Exploratory Study On Causes Of Identity Document Theft In South Africa
An Exploratory Study On Causes Of Identity Document Theft In South AfricaAn Exploratory Study On Causes Of Identity Document Theft In South Africa
An Exploratory Study On Causes Of Identity Document Theft In South AfricaTracy Morgan
 
Tails Linux Operating System: The Amnesiac Incognito System in Times of High ...
Tails Linux Operating System: The Amnesiac Incognito System in Times of High ...Tails Linux Operating System: The Amnesiac Incognito System in Times of High ...
Tails Linux Operating System: The Amnesiac Incognito System in Times of High ...Maurice Dawson
 
Pros And Cons Of Developing Israel
Pros And Cons Of Developing IsraelPros And Cons Of Developing Israel
Pros And Cons Of Developing IsraelLisa Olive
 
The Foreign Intelligence Surveillance Act (FISA)
The Foreign Intelligence Surveillance Act (FISA)The Foreign Intelligence Surveillance Act (FISA)
The Foreign Intelligence Surveillance Act (FISA)Kelly Ratkovic
 
Dennis Rader Research Paper
Dennis Rader Research PaperDennis Rader Research Paper
Dennis Rader Research PaperAlyssa Dennis
 

Similaire à Text mining names in ‘Big Data’ to recognize migration trends (20)

Diasporas Digital Développement
Diasporas Digital DéveloppementDiasporas Digital Développement
Diasporas Digital Développement
 
Mining names in the big data to map diasporas - NamSor
Mining names in the big data to map diasporas - NamSorMining names in the big data to map diasporas - NamSor
Mining names in the big data to map diasporas - NamSor
 
Icc2013 country names
Icc2013 country namesIcc2013 country names
Icc2013 country names
 
Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and Tomorrow
 
Narrative Essay On Prewood High School
Narrative Essay On Prewood High SchoolNarrative Essay On Prewood High School
Narrative Essay On Prewood High School
 
Bigdataforesight
BigdataforesightBigdataforesight
Bigdataforesight
 
Resurrectionist Case Study Summary
Resurrectionist Case Study SummaryResurrectionist Case Study Summary
Resurrectionist Case Study Summary
 
Data Geeks Paris - Cherchez la Femme
Data Geeks Paris - Cherchez la FemmeData Geeks Paris - Cherchez la Femme
Data Geeks Paris - Cherchez la Femme
 
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
 
Privacy, human rights and Location Based Services
Privacy, human rights and Location Based ServicesPrivacy, human rights and Location Based Services
Privacy, human rights and Location Based Services
 
2006 multinational intelligence (centcom ccc)
2006 multinational intelligence (centcom ccc)2006 multinational intelligence (centcom ccc)
2006 multinational intelligence (centcom ccc)
 
Dartmouth Essay Prompt 2014
Dartmouth Essay Prompt 2014Dartmouth Essay Prompt 2014
Dartmouth Essay Prompt 2014
 
Power and Leverage in the XXI Century
Power and Leverage in the XXI CenturyPower and Leverage in the XXI Century
Power and Leverage in the XXI Century
 
Open Data Innovation from GEO DATA Perspective
Open Data Innovation from GEO DATA  PerspectiveOpen Data Innovation from GEO DATA  Perspective
Open Data Innovation from GEO DATA Perspective
 
An Exploratory Study On Causes Of Identity Document Theft In South Africa
An Exploratory Study On Causes Of Identity Document Theft In South AfricaAn Exploratory Study On Causes Of Identity Document Theft In South Africa
An Exploratory Study On Causes Of Identity Document Theft In South Africa
 
Essay On Identity Theft
Essay On Identity TheftEssay On Identity Theft
Essay On Identity Theft
 
Tails Linux Operating System: The Amnesiac Incognito System in Times of High ...
Tails Linux Operating System: The Amnesiac Incognito System in Times of High ...Tails Linux Operating System: The Amnesiac Incognito System in Times of High ...
Tails Linux Operating System: The Amnesiac Incognito System in Times of High ...
 
Pros And Cons Of Developing Israel
Pros And Cons Of Developing IsraelPros And Cons Of Developing Israel
Pros And Cons Of Developing Israel
 
The Foreign Intelligence Surveillance Act (FISA)
The Foreign Intelligence Surveillance Act (FISA)The Foreign Intelligence Surveillance Act (FISA)
The Foreign Intelligence Surveillance Act (FISA)
 
Dennis Rader Research Paper
Dennis Rader Research PaperDennis Rader Research Paper
Dennis Rader Research Paper
 

Plus de Elian CARSENAT

NamSor AI Bias Estimator, Gender, Racial and Ethnic Fairness Toolkit
NamSor AI Bias Estimator, Gender, Racial and Ethnic Fairness ToolkitNamSor AI Bias Estimator, Gender, Racial and Ethnic Fairness Toolkit
NamSor AI Bias Estimator, Gender, Racial and Ethnic Fairness ToolkitElian CARSENAT
 
Announcing NamSorML : AI classifiers for race, ethnicity and migration studies
Announcing NamSorML :  AI classifiers for race, ethnicity and migration studiesAnnouncing NamSorML :  AI classifiers for race, ethnicity and migration studies
Announcing NamSorML : AI classifiers for race, ethnicity and migration studiesElian CARSENAT
 
GEOINT visualization of the Tunisian Diaspora in Europe
GEOINT visualization of the Tunisian Diaspora in EuropeGEOINT visualization of the Tunisian Diaspora in Europe
GEOINT visualization of the Tunisian Diaspora in EuropeElian CARSENAT
 
Promouvoir l'investissement en Afrique
Promouvoir l'investissement en AfriquePromouvoir l'investissement en Afrique
Promouvoir l'investissement en AfriqueElian CARSENAT
 
Gender Gap in Corporate Governance : AFRICA
Gender Gap in Corporate Governance : AFRICAGender Gap in Corporate Governance : AFRICA
Gender Gap in Corporate Governance : AFRICAElian CARSENAT
 
FDI Magnet wishes you a happy 2016!
FDI Magnet wishes you a happy 2016!FDI Magnet wishes you a happy 2016!
FDI Magnet wishes you a happy 2016!Elian CARSENAT
 
#APIDays Paris - NamSor API for 'Gender Gap Grader'
#APIDays Paris - NamSor API for 'Gender Gap Grader'#APIDays Paris - NamSor API for 'Gender Gap Grader'
#APIDays Paris - NamSor API for 'Gender Gap Grader'Elian CARSENAT
 
BigData Paris 2014 - Enjeux Sociaux
BigData Paris 2014 - Enjeux SociauxBigData Paris 2014 - Enjeux Sociaux
BigData Paris 2014 - Enjeux SociauxElian CARSENAT
 
Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...
Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...
Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...Elian CARSENAT
 

Plus de Elian CARSENAT (9)

NamSor AI Bias Estimator, Gender, Racial and Ethnic Fairness Toolkit
NamSor AI Bias Estimator, Gender, Racial and Ethnic Fairness ToolkitNamSor AI Bias Estimator, Gender, Racial and Ethnic Fairness Toolkit
NamSor AI Bias Estimator, Gender, Racial and Ethnic Fairness Toolkit
 
Announcing NamSorML : AI classifiers for race, ethnicity and migration studies
Announcing NamSorML :  AI classifiers for race, ethnicity and migration studiesAnnouncing NamSorML :  AI classifiers for race, ethnicity and migration studies
Announcing NamSorML : AI classifiers for race, ethnicity and migration studies
 
GEOINT visualization of the Tunisian Diaspora in Europe
GEOINT visualization of the Tunisian Diaspora in EuropeGEOINT visualization of the Tunisian Diaspora in Europe
GEOINT visualization of the Tunisian Diaspora in Europe
 
Promouvoir l'investissement en Afrique
Promouvoir l'investissement en AfriquePromouvoir l'investissement en Afrique
Promouvoir l'investissement en Afrique
 
Gender Gap in Corporate Governance : AFRICA
Gender Gap in Corporate Governance : AFRICAGender Gap in Corporate Governance : AFRICA
Gender Gap in Corporate Governance : AFRICA
 
FDI Magnet wishes you a happy 2016!
FDI Magnet wishes you a happy 2016!FDI Magnet wishes you a happy 2016!
FDI Magnet wishes you a happy 2016!
 
#APIDays Paris - NamSor API for 'Gender Gap Grader'
#APIDays Paris - NamSor API for 'Gender Gap Grader'#APIDays Paris - NamSor API for 'Gender Gap Grader'
#APIDays Paris - NamSor API for 'Gender Gap Grader'
 
BigData Paris 2014 - Enjeux Sociaux
BigData Paris 2014 - Enjeux SociauxBigData Paris 2014 - Enjeux Sociaux
BigData Paris 2014 - Enjeux Sociaux
 
Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...
Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...
Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...
 

Dernier

Retail marketing Supply chain management SLIDESHARE.pptx
Retail marketing Supply chain management SLIDESHARE.pptxRetail marketing Supply chain management SLIDESHARE.pptx
Retail marketing Supply chain management SLIDESHARE.pptxBharathBunny10
 
Leadership in Difficult Times- Strategies for Overcoming Challenges - Reflect...
Leadership in Difficult Times- Strategies for Overcoming Challenges - Reflect...Leadership in Difficult Times- Strategies for Overcoming Challenges - Reflect...
Leadership in Difficult Times- Strategies for Overcoming Challenges - Reflect...Kayode Fayemi
 
DAY 06 A Revelation 03-10-2024 PpPT.pptx
DAY 06 A Revelation 03-10-2024 PpPT.pptxDAY 06 A Revelation 03-10-2024 PpPT.pptx
DAY 06 A Revelation 03-10-2024 PpPT.pptxFamilyWorshipCenterD
 
wonder woman:quiz on female achievements
wonder woman:quiz on female achievementswonder woman:quiz on female achievements
wonder woman:quiz on female achievementsRemya Roshni
 
110 Philippines. quiz bee Power PoInt Presentation
110 Philippines. quiz bee Power PoInt Presentation110 Philippines. quiz bee Power PoInt Presentation
110 Philippines. quiz bee Power PoInt PresentationNorHaiFatun
 
Self Editing Your Novel Part 3: Who's Telling This Story?
Self Editing Your Novel Part 3: Who's Telling This Story?Self Editing Your Novel Part 3: Who's Telling This Story?
Self Editing Your Novel Part 3: Who's Telling This Story?Beth Jusino
 
Evaluating LLM Models for Production Systems Methods and Practices -
Evaluating LLM Models for Production Systems Methods and Practices -Evaluating LLM Models for Production Systems Methods and Practices -
Evaluating LLM Models for Production Systems Methods and Practices -alopatenko
 
BaruwaRaquella_Retail Store Presentation.pptx
BaruwaRaquella_Retail Store Presentation.pptxBaruwaRaquella_Retail Store Presentation.pptx
BaruwaRaquella_Retail Store Presentation.pptxRaquellaBaruwa
 
LAUNCH: Intersections between violence against children and violence against ...
LAUNCH: Intersections between violence against children and violence against ...LAUNCH: Intersections between violence against children and violence against ...
LAUNCH: Intersections between violence against children and violence against ...UNICEF Office of Research - Innocenti
 
2024 QRC PLM Recruitment Praesentation.pdf
2024 QRC PLM Recruitment Praesentation.pdf2024 QRC PLM Recruitment Praesentation.pdf
2024 QRC PLM Recruitment Praesentation.pdfJoerg Speikamp
 

Dernier (12)

Retail marketing Supply chain management SLIDESHARE.pptx
Retail marketing Supply chain management SLIDESHARE.pptxRetail marketing Supply chain management SLIDESHARE.pptx
Retail marketing Supply chain management SLIDESHARE.pptx
 
Leadership in Difficult Times- Strategies for Overcoming Challenges - Reflect...
Leadership in Difficult Times- Strategies for Overcoming Challenges - Reflect...Leadership in Difficult Times- Strategies for Overcoming Challenges - Reflect...
Leadership in Difficult Times- Strategies for Overcoming Challenges - Reflect...
 
DAY 06 A Revelation 03-10-2024 PpPT.pptx
DAY 06 A Revelation 03-10-2024 PpPT.pptxDAY 06 A Revelation 03-10-2024 PpPT.pptx
DAY 06 A Revelation 03-10-2024 PpPT.pptx
 
wonder woman:quiz on female achievements
wonder woman:quiz on female achievementswonder woman:quiz on female achievements
wonder woman:quiz on female achievements
 
110 Philippines. quiz bee Power PoInt Presentation
110 Philippines. quiz bee Power PoInt Presentation110 Philippines. quiz bee Power PoInt Presentation
110 Philippines. quiz bee Power PoInt Presentation
 
Self Editing Your Novel Part 3: Who's Telling This Story?
Self Editing Your Novel Part 3: Who's Telling This Story?Self Editing Your Novel Part 3: Who's Telling This Story?
Self Editing Your Novel Part 3: Who's Telling This Story?
 
NOC_SXSW_Non-ObviousThinking_2024_SLIDES.pptx
NOC_SXSW_Non-ObviousThinking_2024_SLIDES.pptxNOC_SXSW_Non-ObviousThinking_2024_SLIDES.pptx
NOC_SXSW_Non-ObviousThinking_2024_SLIDES.pptx
 
Evaluating LLM Models for Production Systems Methods and Practices -
Evaluating LLM Models for Production Systems Methods and Practices -Evaluating LLM Models for Production Systems Methods and Practices -
Evaluating LLM Models for Production Systems Methods and Practices -
 
Tethex Cards - complete presentation in English
Tethex Cards - complete presentation in EnglishTethex Cards - complete presentation in English
Tethex Cards - complete presentation in English
 
BaruwaRaquella_Retail Store Presentation.pptx
BaruwaRaquella_Retail Store Presentation.pptxBaruwaRaquella_Retail Store Presentation.pptx
BaruwaRaquella_Retail Store Presentation.pptx
 
LAUNCH: Intersections between violence against children and violence against ...
LAUNCH: Intersections between violence against children and violence against ...LAUNCH: Intersections between violence against children and violence against ...
LAUNCH: Intersections between violence against children and violence against ...
 
2024 QRC PLM Recruitment Praesentation.pdf
2024 QRC PLM Recruitment Praesentation.pdf2024 QRC PLM Recruitment Praesentation.pdf
2024 QRC PLM Recruitment Praesentation.pdf
 

Text mining names in ‘Big Data’ to recognize migration trends

  • 1. TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied Onomastics 1 2014-05-30
  • 2. Names Data Mining is just a Tool 2 Zeynep Değirmencioğlu Şükrü Kaya Şükrü Saracoğlu Elian Carsenat Hüseyin Yıldız Mahmut Yıldırım Fatih Öztürk Mehmet Bölükbaşı Mehmet Yılmaz Elif Yıldırım Ahmet Yıldırım Mustafa Yücedağ Mustafa Uzunyılmaz Fatih Kılıç Fatih Yılmaz Murat Yıldırım Hüseyin Kılıç Oğuzhan Yıldız Mevlüt Çavuşoğlu … (Source: Freebase)
  • 3. What’s in a name? What’s a name? 3  Elian Carsenat  @ElianCarsenat (Twitter)  elian.carsenat@namsor.com  elian.carsenat@sfr.fr  tioulpanov (Skype)  NamSor.com  Onomastics = the science of proper names
  • 4. Onoma != Residence != Nationality 4 Source: OECD
  • 5. NamSor sorts names : functions, use cases 5 2.Name Transliteration & Matching 3.Named Entity Extraction, Parsing 1.Name Ling. Classification Multilingual Text Mining Control Watch ListsSocial Networks Analytics Geo demographics
  • 6. NamSor supervised learning 6 FN LN MetteAndersen LeneAndersson EvaArndt-Riise HeidiAstrup MieAugustesen MargotBærentzen LouiseBager Nørgaard MarieBagger Rasmussen YuttaBarding UllaBarding-Poulsen FN LN XianDongmei ZhengDongmei JinDongxiang XuDongxiang LiDongxiao QinDongya LiDongying HanDuan LiDuihong JiangFan Training set : Athletes Step 1 – Learn stereotypes bitao gong biwang jiang birgitta agerberth birgitte l. eriksen bitao gong bitten thorengaard biwang Jiang birgitta agerberth birgitte l. eriksen bitten thorengaard Data set : Inventors Step 2 – Classify
  • 7. Accuracy is measurable ~80% The very first backtesting on the onomastics of 150,000 Olympic game athletes 7 TOTAL PERF Row Labels 3794 97%Japan 260 93%Mongolia 1576 92%Greece 262 89%Lithuania 4150 89%Italy 2818 88%Poland 2180 87%South Korea Japan Indonesia Sri Lanka Nigeria Congo (B) Japan 3686 4 3 3 3 Mongolia Iraq Japan Mali Kazakhstan Mongolia 243 2 1 1 1 Greece Italy Georgia Romania Great Britain Greece 1444 14 6 5 5 Lithuania Namibia Greece Latvia Russia Lithuania 234 3 3 3 2 Italy Spain Portugal France Austria Italy 3675 81 80 29 26 Poland Czechoslovakia Czech Republic Slovakia Austria Poland 2486 46 38 34 22 South Korea North Korea Chinese Taipei Equatorial Guinea China South Korea 1901 209 10 6 5 Euro athletes (excl. Anglo & Latin). Breakdown accuracy 84% Ex- Yugoslavia athletes Breakdown accuracy 75%
  • 8. Decrypting identity accross space/time: India Geodemographics (1914)8 Source: Commonwealth WWI Casualties
  • 9. Unsupervised learning is fine-grain: Country/Region,…9  Ex. Russian Federation
  • 10. In progress : Syrian names (backtesting) Onoma Count Syria 201 Saudi Arabia 20 Iraq 8 Kuwait 4 United Arab Emirates 3 Egypt 3 Qatar 2 Bahrain 2 Soudan 2 Lebanon 2 Algeria 1 Oman 1 Grand Total 249 10 201 Syria Saudi Arabia Iraq Kuwait United Arab Emirates Egypt Qatar Bahrain Soudan Lebanon Algeria Oman ‫طاهر‬ ‫الحريري‬ ‫عبدالغفار‬ ‫العيدة‬ ‫سليمان‬ ‫عبدالغفار‬ ‫شحادة‬ ‫قاسم‬ ‫األسعد‬ ‫مؤمن‬ ‫حموده‬ ‫مفلح‬ ‫محمد‬ ‫الجراد‬ ‫نزار‬ ‫الحروب‬ ‫نزار‬ ‫العيدة‬ ‫سليمان‬ ‫أسامة‬ ‫الحراكي‬ ‫أنس‬ ‫الصغير‬ ‫خالد‬ ‫الهبول‬ ‫وفيق‬ ‫الواحد‬ ‫عبد‬ ‫إسراء‬ ‫يونس‬ ‫رشا‬ ‫نزهة‬ ‫زكريا‬ ‫محمد‬ ‫وهبة‬ ‫كمال‬ ‫بركات‬ ‫عيد‬ ‫محمد‬ ‫اللو‬ […] Syrian names recognized at ~80% Other name may effectively be non- Syrian or generic to the Arab world.
  • 11. What can you dig with this tool? 11
  • 12. Mining 5M names to recognize Gender, breakdown by nationality/likely origin 12
  • 13. Mining 1M names to map Diasporas 13 Source: Twitter
  • 14. Mining 3M Geo-Tweets Population flows on Twitter 14 Source Target Type Id Onoma Weight United Kingdom France Directed 16 Great Britain 37 Spain France Directed 55 Spain 14 United States France Directed 75 Great Britain 12 Turkey France Directed 79 Turkey 11 Brazil France Directed 87 Portugal 10 United Kingdom France Directed 112 Ireland 9 Italy France Directed 152 Italy 7 Switzerland France Directed 226 France 5 Belgium France Directed 247 France 5 United Kingdom France Directed 258 France 5 Mexico France Directed 287 Spain 4 Ireland France Directed 317 Great Britain 4 United Kingdom France Directed 333 Italy 4 United States France Directed 375 France 4 Source: Twitter
  • 15. Mining 150k names in Patents to see where the Turkish ‘brain juice’ flows15
  • 16. Mining names : a word of caution 16
  • 17. Can ‘Big Data’ answer any question? 17  Trash in, Gold out ? Yes, to some extent  Beware of biases induced by the data source itself  Data access limitations / privacy issues  Open Data vs. Free APIs vs. Commercial Databases
  • 18. Still, tools make possible the impossible 18
  • 19. originating FDI leads 19  NamSor™ announces FDI Magnet, a new offering for Investment Promotion Agencies.  What is the Idea behind it: “ As recently as 1986 Ireland was one of the poorest countries in the European Union (EU), but today it is one of the richest. The engine of this new Irish prosperity has been Foreign Direct Investment (FDI). [Between 1986 and 2002], the Irish have done almost everything right. They have attracted huge amounts of money from America – due largely to a century of personal and familial ties – and they have used this money to build factories ”.  A successful approach which Milda Darguzaite, the Managing Director of Invest Lithuania, considers relevant for her own country. With three million people living in Lithuania and nearly one million people of Lithuanian origin living abroad, there is a good many personal and familial ties to be leveraged to attract new investment projects to the country. NamSor name recognition software helped discover those ties.  Recognizing names and their origin in global professional databases allows Investment Promotion Agencies to identify potentially interesting high profile contacts in different countries / industrial sectors and reach out to them. Another method to accelerate the origination of new leads is to better understand and leverage the existing network of foreign businessmen in the country itself.  NamSor™ filters data from millions of meaningless elements to a few dozen actionable names.  Domas Girtavicius, a Senior consultant at Invest Lithuania, said "we were impressed by the accuracy of the name recognition software: it reliably predicts the country of origin and the number of false positives is fully manageable". Elian Carsenat, the founder of NamSor™, said "searching for names in the Big Data is like seeking a gold needle in a haystack: doable once the right tool exists".
  • 20. Conclusions 20  We recognize names in any language, any place, any database; we can classify and we can sort  Onomastic class is no ‘hard fact’ like a place of birth, a nationality, etc. but it’s accurate and fine-grain  As a statistics tool, it might be dabatable. But as a datamining tool, it’s sharp, simple and efficient : it can help find research directions, discover trends  We see use cases in Migration research; Education & Skills; Labour & Social Affairs; Territorial Development/FDI; Science & Innovation
  • 21. Merci !  http://fdimagnet.com/  http://namsor.com/ 21 Juillet 2013, Ambassade de Lituanie à Paris  elian.carsenat@namsor.com  +33 6 52 77 99 07  Twitter @NamsSor_com