Publicité

Text mining names in ‘Big Data’ to recognize migration trends

Founder NamSor.com à NamSor™ Applied Onomastics
1 Jun 2014
Publicité

Contenu connexe

Publicité

Dernier(20)

Text mining names in ‘Big Data’ to recognize migration trends

  1. TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied Onomastics 1 2014-05-30
  2. Names Data Mining is just a Tool 2 Zeynep Değirmencioğlu Şükrü Kaya Şükrü Saracoğlu Elian Carsenat Hüseyin Yıldız Mahmut Yıldırım Fatih Öztürk Mehmet Bölükbaşı Mehmet Yılmaz Elif Yıldırım Ahmet Yıldırım Mustafa Yücedağ Mustafa Uzunyılmaz Fatih Kılıç Fatih Yılmaz Murat Yıldırım Hüseyin Kılıç Oğuzhan Yıldız Mevlüt Çavuşoğlu … (Source: Freebase)
  3. What’s in a name? What’s a name? 3  Elian Carsenat  @ElianCarsenat (Twitter)  elian.carsenat@namsor.com  elian.carsenat@sfr.fr  tioulpanov (Skype)  NamSor.com  Onomastics = the science of proper names
  4. Onoma != Residence != Nationality 4 Source: OECD
  5. NamSor sorts names : functions, use cases 5 2.Name Transliteration & Matching 3.Named Entity Extraction, Parsing 1.Name Ling. Classification Multilingual Text Mining Control Watch ListsSocial Networks Analytics Geo demographics
  6. NamSor supervised learning 6 FN LN MetteAndersen LeneAndersson EvaArndt-Riise HeidiAstrup MieAugustesen MargotBærentzen LouiseBager Nørgaard MarieBagger Rasmussen YuttaBarding UllaBarding-Poulsen FN LN XianDongmei ZhengDongmei JinDongxiang XuDongxiang LiDongxiao QinDongya LiDongying HanDuan LiDuihong JiangFan Training set : Athletes Step 1 – Learn stereotypes bitao gong biwang jiang birgitta agerberth birgitte l. eriksen bitao gong bitten thorengaard biwang Jiang birgitta agerberth birgitte l. eriksen bitten thorengaard Data set : Inventors Step 2 – Classify
  7. Accuracy is measurable ~80% The very first backtesting on the onomastics of 150,000 Olympic game athletes 7 TOTAL PERF Row Labels 3794 97%Japan 260 93%Mongolia 1576 92%Greece 262 89%Lithuania 4150 89%Italy 2818 88%Poland 2180 87%South Korea Japan Indonesia Sri Lanka Nigeria Congo (B) Japan 3686 4 3 3 3 Mongolia Iraq Japan Mali Kazakhstan Mongolia 243 2 1 1 1 Greece Italy Georgia Romania Great Britain Greece 1444 14 6 5 5 Lithuania Namibia Greece Latvia Russia Lithuania 234 3 3 3 2 Italy Spain Portugal France Austria Italy 3675 81 80 29 26 Poland Czechoslovakia Czech Republic Slovakia Austria Poland 2486 46 38 34 22 South Korea North Korea Chinese Taipei Equatorial Guinea China South Korea 1901 209 10 6 5 Euro athletes (excl. Anglo & Latin). Breakdown accuracy 84% Ex- Yugoslavia athletes Breakdown accuracy 75%
  8. Decrypting identity accross space/time: India Geodemographics (1914)8 Source: Commonwealth WWI Casualties
  9. Unsupervised learning is fine-grain: Country/Region,…9  Ex. Russian Federation
  10. In progress : Syrian names (backtesting) Onoma Count Syria 201 Saudi Arabia 20 Iraq 8 Kuwait 4 United Arab Emirates 3 Egypt 3 Qatar 2 Bahrain 2 Soudan 2 Lebanon 2 Algeria 1 Oman 1 Grand Total 249 10 201 Syria Saudi Arabia Iraq Kuwait United Arab Emirates Egypt Qatar Bahrain Soudan Lebanon Algeria Oman ‫طاهر‬ ‫الحريري‬ ‫عبدالغفار‬ ‫العيدة‬ ‫سليمان‬ ‫عبدالغفار‬ ‫شحادة‬ ‫قاسم‬ ‫األسعد‬ ‫مؤمن‬ ‫حموده‬ ‫مفلح‬ ‫محمد‬ ‫الجراد‬ ‫نزار‬ ‫الحروب‬ ‫نزار‬ ‫العيدة‬ ‫سليمان‬ ‫أسامة‬ ‫الحراكي‬ ‫أنس‬ ‫الصغير‬ ‫خالد‬ ‫الهبول‬ ‫وفيق‬ ‫الواحد‬ ‫عبد‬ ‫إسراء‬ ‫يونس‬ ‫رشا‬ ‫نزهة‬ ‫زكريا‬ ‫محمد‬ ‫وهبة‬ ‫كمال‬ ‫بركات‬ ‫عيد‬ ‫محمد‬ ‫اللو‬ […] Syrian names recognized at ~80% Other name may effectively be non- Syrian or generic to the Arab world.
  11. What can you dig with this tool? 11
  12. Mining 5M names to recognize Gender, breakdown by nationality/likely origin 12
  13. Mining 1M names to map Diasporas 13 Source: Twitter
  14. Mining 3M Geo-Tweets Population flows on Twitter 14 Source Target Type Id Onoma Weight United Kingdom France Directed 16 Great Britain 37 Spain France Directed 55 Spain 14 United States France Directed 75 Great Britain 12 Turkey France Directed 79 Turkey 11 Brazil France Directed 87 Portugal 10 United Kingdom France Directed 112 Ireland 9 Italy France Directed 152 Italy 7 Switzerland France Directed 226 France 5 Belgium France Directed 247 France 5 United Kingdom France Directed 258 France 5 Mexico France Directed 287 Spain 4 Ireland France Directed 317 Great Britain 4 United Kingdom France Directed 333 Italy 4 United States France Directed 375 France 4 Source: Twitter
  15. Mining 150k names in Patents to see where the Turkish ‘brain juice’ flows15
  16. Mining names : a word of caution 16
  17. Can ‘Big Data’ answer any question? 17  Trash in, Gold out ? Yes, to some extent  Beware of biases induced by the data source itself  Data access limitations / privacy issues  Open Data vs. Free APIs vs. Commercial Databases
  18. Still, tools make possible the impossible 18
  19. originating FDI leads 19  NamSor™ announces FDI Magnet, a new offering for Investment Promotion Agencies.  What is the Idea behind it: “ As recently as 1986 Ireland was one of the poorest countries in the European Union (EU), but today it is one of the richest. The engine of this new Irish prosperity has been Foreign Direct Investment (FDI). [Between 1986 and 2002], the Irish have done almost everything right. They have attracted huge amounts of money from America – due largely to a century of personal and familial ties – and they have used this money to build factories ”.  A successful approach which Milda Darguzaite, the Managing Director of Invest Lithuania, considers relevant for her own country. With three million people living in Lithuania and nearly one million people of Lithuanian origin living abroad, there is a good many personal and familial ties to be leveraged to attract new investment projects to the country. NamSor name recognition software helped discover those ties.  Recognizing names and their origin in global professional databases allows Investment Promotion Agencies to identify potentially interesting high profile contacts in different countries / industrial sectors and reach out to them. Another method to accelerate the origination of new leads is to better understand and leverage the existing network of foreign businessmen in the country itself.  NamSor™ filters data from millions of meaningless elements to a few dozen actionable names.  Domas Girtavicius, a Senior consultant at Invest Lithuania, said "we were impressed by the accuracy of the name recognition software: it reliably predicts the country of origin and the number of false positives is fully manageable". Elian Carsenat, the founder of NamSor™, said "searching for names in the Big Data is like seeking a gold needle in a haystack: doable once the right tool exists".
  20. Conclusions 20  We recognize names in any language, any place, any database; we can classify and we can sort  Onomastic class is no ‘hard fact’ like a place of birth, a nationality, etc. but it’s accurate and fine-grain  As a statistics tool, it might be dabatable. But as a datamining tool, it’s sharp, simple and efficient : it can help find research directions, discover trends  We see use cases in Migration research; Education & Skills; Labour & Social Affairs; Territorial Development/FDI; Science & Innovation
  21. Merci !  http://fdimagnet.com/  http://namsor.com/ 21 Juillet 2013, Ambassade de Lituanie à Paris  elian.carsenat@namsor.com  +33 6 52 77 99 07  Twitter @NamsSor_com
Publicité