SlideShare une entreprise Scribd logo
1  sur  25
Getting Things
Gregor William Stewart

Director of Product Management, Text Analytics

Basis Technology Corporation
Introduction
 Product Manager for Text

Analytics, including:
–
–
–
–
–

Rosette Linguistics Platform
Entity Analytics
Name Indexing and Translation
Chat Translator
Highlight

 Questing for:
–
–
–
–

2

Quality: accuracy, performance
Coverage: languages, domains, genres
Integration: tasks, workflows, UX
Innovation: new aggregates, functions
Overview
1

4

2

+

Source

Tasks

Technologies

Adaptations

Properties

Description

Action (Input/Output)

“Out of Box”

Comparison

Problem(s)

Components

Suggested Adaptations

Challenge(s)

Process

Potential Benefits

Approach(es)

Adaptation Opportunities

Costs

Solution
Signal(s)

 Focus on entity analytics in four stages of the processing

and exploitation of SOCOM-2012-0000011-HT
 Reaching “state of the art” in practice means adapting to
source, task and user.
3
‫1100000-2102-‪Source: SOCOM‬‬‫‪HT‬‬
‫‪ An Arabic language source‬‬

‫‪document‬‬
‫‪Letters/emails from one colleague‬‬
‫‪to others, regarding policy‬‬
‫‪Written years before it was‬‬
‫‪acquired, processed‬‬
‫‪Perhaps imperfectly transcribed, or‬‬
‫‪OCRed into our forensics platform‬‬
‫,‪Of uncertain provenance, content‬‬
‫‪value‬‬
‫‪Not a current web news article for‬‬
‫‪wide consumption, with metadata‬‬

‫‪‬‬
‫‪‬‬
‫‪‬‬
‫‪‬‬
‫‪‬‬

‫1100000-2102-‪SOCOM‬‬

‫ ‬
‫ ‬
‫ـــــ  /   ﺍﺍﻟﻌﺰﻳﻳﺰ  ﻋﺪﻧ ﻥ  / ﺍﺍ)  ﺣﺎﻓﻆ  ﺳﻠﻄﺎﻥ  (  ‬
‫ﻥ‬
‫أأﺧﻲ  ‬
‫ﻋﺭﺳﺎﺋﻞ إإﻟﻰ  ﻛﺮ ﺍ ﻭ ﻭأأ ﻣﻋﻤﺮ  ﻭ ﻧﺳﻬﻬﻢ  ؛ ﺭﺳﺎﺋﻞ  ﺗﻮﺟﻴﻬﻬﻴﻴﺔ ‬
‫ ﻴ‬
‫ﺎ ﺭ‬
‫ﻲ‬
‫ﻭ ﻭﺑ‬
‫ﻲ ‬
‫ﻃﻠﺒ ﻢ  ﻣﻨﻜﻢ ﺘأأﻳﻳﻀﺎ  ﻓﻲ ﺭﺳﺎﺋﻞ  ﺳﺎ ﺑﺔ ﺍﺍﻟﻤﺴﺎﺭﺭﺔ  ﺑﻜﺘﺎﺑﺔﺭ‬
‫ﻘ‬
‫ﺭ  ‬
‫ ‬
‫ ‬
‫ﻴ‬
‫،‬
‫ﻓ‬
‫ﻱ‬
‫ﻥ‬
‫ﺮأ‬
‫ﺣﺎﺯﺯﺔ،،  ﻓﺈﻧ ﻨ ﻲﻑ  ﻋﻰ  ﺍﻹﺓﺓ  ﻦ ﺍﺍﻷﺧﻄﺎء ﺍﺍﻟﺴﻴﻴﺎ ﺳﻴﺔ،،  ﻓﻘﺪ  ﺳﻤﻌ ﻢ  ﻭﻻﺑﺪ  ﺧﻄﺒﺔ أأﺑﻲ  ﻋﻤﺮ ﺍﺍﻷ ﺧﻴﺮﺓﺓ،  ﻭ ﻲ  ﻧﻈ ﻱ  أﻥ ‬
‫ﻭ‬
‫ﻭ ﺘ‬
‫ﻮ ﻴ‬
‫ﻣﺧ‬
‫ﻠ‬
‫أأﺧﺎﻑ  ‬
‫  ﻣ‬
‫ﺤﻬﺎ أأﺧﻄﺎء  ﻭﺍﺍﺿ ﻭﺔ ﻓ:  ﻴﻴﻬﻬأأﺷﻴﻴﺎ ء  ﺎ  ﻛﺎﻥ ﻳﻳﻨ ﺒﻲ  أﻥ  ﺗﺬﻛﺮ  ﻓﻲ  ﺧﻄﺒﺔ  ﻗﺎﺋﺪ ﻛﻬﻬﺬﺍﺍ،  ﻭﻳﻳﺪ ﻝ ﻫﻫﺎ  ﻓﻲ  ﺧﻄﺎﺑﻪﻪ    ﻻ ﺳﻴ ﻴﺎ  ﻲ ‬
‫ﻭ‬
‫ﻭ‬
‫ﻭ ﻝ ﺫﺫﻛﺮﻫ‬
‫، ﻫ‬
‫ﻐ‬
‫أﻥ‬
‫ﻤ ﺎ ﻥ ﻓ ﻥ ﻣ‬
‫ﻓﻴﻬ   ﻴ‬
‫ﻮ‬
‫ﺳﻴﺎﻕ ﺍﺍ ﻟﻮﺍﺍﺏ  ﻭﺍﺍﻟﻤﺒﺎﺩﺩئ  ﻋﻠﻰ أأﻧﻬﻬﻢ  ﻣﺘﺸﺪﺩﺩﻭﻥ،  ﻭﺗﻌﻄﻲ إإﻳﻳﺤﺎء  ﻬﻬﻢ  ﻣﺘﻌ ّ ﻘﻥ  ﻣﺴﺘﻌﺠﻠﻮﻥ. .!  ﻭﻴﻴﻬﺎ  ﺗﻨﻔﻴﻴﺮ  ﻭﻗ ﺔ  ﺣﻜﻤﺔ.  ‬
‫ﻭ ﻬ‬
‫ ‬
‫ﻥ‬
‫ﻤ‬
‫ﻭ‬
‫ﺑﺄﻧ‬
‫ ‬
‫ﻭﻥ ﻭ ،‬
‫ﻓ‬
‫ﻠ‬
‫ﺏ ئ‬
‫!‬
‫ﻴ‬
‫ ‬
‫ﻭﺜ‬
‫ﻕ‬
‫ﻭأأﻧﺎ  ﻋﻦ  ﻧ ﻔﻲ  ﻛﺘﺒﺖ ﻟﻬﻬﻢ  ﻭﻋﺎﺗﺒﺘﻬﻬﻢ ﺴﻭﺷﺪﺩﺩﺕ  ﻋﻠﻴﻴﻬﻢ  ﺑﻌﺾ ﺍﺍﻟﺘﺸﺪﻳﻳﺪ.  ‬
‫ﻬ‬
‫،ﻭ   ،   ﺕ‬
‫ﺧ‬
‫ﻭ‬
‫ﻭ‬
‫ﺮء  ﺗﻠﻮ ‬
‫ﻭأ أﺎﻑ  ﻬﻬﻢ إإﻥ ﺍﺍﺳﺘﻤ ﻭﺍﺍ ﻲ   ﺜﻫﻫﺬﺍﺍ ﺍﺍﻷﺳﻠﻮ  ﻭﺍﺍﻟﻄﺮﻳﻳﻘﺔ ﻳﻳﻔﺴ ﻭ ﻭ ﻭﻳﻳﻨﻔ ﻭﻭﻥﺍﺍﻟ ﺎﺱ  ﻭﻳﻳﻔﻘﺪﻭﻧﻬﻬﻢ  ﻭﻳﻳﻜﺴﺒﻮﻥ ﺍﺍﻷﻋﺪﺍ  ‬
‫ﻥ‬
‫ﻥ  ﻥ ﺪ ﻭﻥ  ﺱ ﻭ ﻭ ﻭﻨ‬
‫ ‬
‫ﺏ‬
‫ﺏ‬
‫ﻭ‬
‫ﺮﻓ ﻭﻞ ﻣ‬
‫ﻫ‬
‫  ﻫ‬
‫ﻭ‬
‫ﻭﺍ ﻑ أأﻧ ﻥ ،‬
‫ﺍﺍﻷﻋﺪﺍﺍء  ﻭﻳﻳﻌ ﻄﻥ  ﻟﻸﻋﺪ ﺍ ﻭﺍﺍ، ﻮﺼﻡ  ﺍﺍﻟ ﻔ ﻮﺮﺔ  ﻟﻠﻨﻴﻴﻞ  ﻣ ﻨﻬﻢ،،  ﻭﺍﺍ  ﻟﺤﻤﺔ  ﻬﻬﻢ  ﺷ ﺮﺔ  ﺟﺪﺍ  ﺗﺸﻮﻳﻬﻬﺎ  ﻭﺗﻨﻔﻴﻴﺮﺍً  ﻭﻛﺬﺑﺎ ﻭﺍﺍﻓﺘﺮﺍﺍء، ‬
‫ﺳ ﺍ ﻭ ﻳ ﺍﻭ‬
‫ﻠ‬
‫ﻭ ﻋﻠﻴﻴ‬
‫ﻬ‬
‫ﻥ‬
‫ﻭ‬
‫ﻡ‬
‫ﺻ‬
‫ء‬
‫ﺍﻟ ﺨ‬
‫ﻭ ‬
‫،‬
‫ﻫﻫ ﺍ ﻳﻳﺴﺘﺪﻋﻲ  ﻏﻠﻖ  ﺎ أأﻣﻜﻨﻨﺎ  ﻣﻦ أأ ﻮﺍﺍﺏ  ﻭﻭ ﻗﻊ ﺍﺍﻟﻄﺮﻳﻳﻖ  ﻋﻠﻰ ﺍﺍﻷﻋﺪﺍﺍء،،   ﻓﻜﻴﻴﻒ  ﺑ ﺈﻮﺍﺍﻧ ﺧﻳﻳﺰﻳﻳ ﻭﻭﻥﺍﺍﻟ ﻄﻴﻦ ﻴﺑﻠﺔ  ﻭﻳﻳﻔﺘﺤﻮﻥ ‬
‫ﻭ‬
‫ﻨ  ﺎ ﻥ  ﺪ‬
‫ ‬
‫ ‬
‫ﺑ‬
‫ﻄ‬
‫ﺏ‬
‫ﺍ   ﺬ ﻣ‬
‫ﻭ‬
‫ﻭﻫ‬
‫ﻫ‬
‫ﻥ‬
‫ﻋﻠﻰ أأﻧﻔﺴﻬ ﻭأأﻮﺍﺍﺑً  ﻣﻦ ﺍﺍﻟﺸﺮ..! ‬
‫ﻬ‬
‫ﺑ‬
‫  ﻢ  ﺎ  ‬
‫ﻭﺍﺍﻟﻤﻘﺼﻮﺩﺩ  :  ﻻ  ﺗ ﺘﺮﻮﺍﺍ  ﻣﺤﻤﻮﺩﺩ ﻴ)  ﻄﻴﺔ  (  ﻭﺣﺪﻩﻩ،  أﺭﻳﻳﺪ  ﻣﻨﻜﻢ ﺍﺍﺳﺘﺼﺪﺍﺭ ﺭﺳﺎﺋﻞ  ﺧ ﺎﺔ  ﻭﻋﺎﻣﺔ أ ﻋﻠﻨ ﻴ ﻭﺳ ﺮﻳﺔ  ،  ﻣﻦ  ﻋﺒﺪ ‬
‫  ﻭﺔ   ﻳ‬
‫ﻭﺻ‬
‫ﺭ  ﺭ  ‬
‫ﺍ‬
‫،‬
‫ﺭ أ‬
‫ﻋ، ﻭ‬
‫ﻴ ﻛ‬
‫ﺍﺍﻟﺸﺎﻓﻲﺩ)  ﻛﻠﻴﻴﻢ  (  ﻭﺣ ﻰ  ﻣﻦ ﺍﺍﻟﺼﺎﺩﻕ  ﻪﺯﻣﺮﺍﺍﻱ  ( إإﺫﺍﺍ ﻟ ﺱ أﻜ ﻴﻴﻬﺎ ﺮ ﺒ ﺎﺢ  ﻭﺗﻮﺟﻴﻬﻬﺎﺕ  ﻣﺒﺎﺷ ﻦ ﺷﻪ  ﻣﺒﺎﺷ ﺓ  ﻭﻣﺤﺪﺩﺓ ‬
‫ﻭ‬
‫ﻓ ﺓ  ﻭﺩ ،‬
‫ﺮ‬
‫ﺓ‬
‫ﺫ‬
‫ﻬ‬
‫ﺘ ﻕ ) ﺯ ﻱ     ﻣﻭ  ﻧﺼ ﻭ ﺋ‬
‫ﻭ ﻴ‬
‫، ﺕ‬
‫ﺓ‬
‫ﻭ‬
‫ ‬
‫ﻭ‬
‫ ‬
‫ﺍ ﺎ،‬
‫ﻣ‬
‫ﻭﺍﺍﺿﺤﺔ  ﻟﻠﻜﺮﻭ  ﻭﺍﺍ ﻲ ﻣﻋﻤﺮ  ﻭإ إ ﻬﻬﻢ،  ﻲ ،ﻣﺴﺎﺋﻞ  ﺳﻴﻴﺎﺳﺔ  ﺍ ﺱ ﻭﺍﺍﻟﺘﻌﺎﻣﻞ  ﻣﻌﻬﻢ  ﻭﻣﻊ ﺍﺍﻟﻔﺼﺎﺋﻞ ﺍﺍﻷﺧﺮﻯ،  ﻭﻋﺪﻡﻡ ‬
‫  ﻯ ﻭ‬
‫ﻭ ﻬ‬
‫ﻨ ﻓ‬
‫ﻭ‬
‫ﺑ‬
‫ﺧ‬
‫ﻲ ﻮﺍﺍﻧ‬
‫ﻭ‬
‫ﻭ‬
‫ﻭ‬
‫ﺍ‬
‫ﻭ‬
‫ﺍﺍﻻﺳﺘﻌﺠﺎﻝ،،  أﻥ أ ﻥﻳﻳﺤﺪﺛﻮﺍ أأﻣﺮﺍ  )ﻛﺒﻴﻴﺮﺍﺍ ﻣﻬﻬﻤﺍ(  إإﻻ  ﺑﻤﺸﻮ ﺭ ﻭأأﻥﻭﻳﻳﺴﻌﻮﺍ  ﻫﻫﺪﻳﻳﻦ  ﻻﺳﺘﻴﻴﻌﺎ ﺏﺍﺍﻟ ﺎﺱ،  أﻥ أ ﻥﻳﻳﺼ ﻮﺍﺍﻔأأﺣﺪﺍ ‬
‫ﻻ ﻭ ﻭ‬
‫ﻨ‬
‫،‬
‫ﺏ  ﺱ‬
‫ ‬
‫ﺭ ﺓ ﻥ ﺟﺎﻫ‬
‫ﻫ‬
‫ ‬
‫ﺍ‬
‫ﺓﻴ‬
‫ﺎ‬
‫ً‬
‫ﺍ‬
‫ﻝ‬
‫ﻻ ‬
‫ﻴ ‬
‫  ﺸﻭﻋﺔأﻭ  أ ﻫﻫﺎ،،  ﻥ ﻫﻫ ﺍﺍ  ﺳﺎﻖ  ﻷﻭ ﺍﻪ  ﻭﻟﻠﻧﻨﺱ ﺍأﻓﻬﻬﺎﻡﻡ أﻣﺨﺘﻠﻔﺔ  ﻭﺗﺄﻭﻳﻳﻼﺕ  ﻭﻧﻈﺮﺍﺍﺕ .. ‬
‫ﺕﻭ ﻭﺕ‬
‫ﻭ‬
‫ﺑ‬
‫ﻪ‬
‫ﺎﺱ‬
‫ﻭ ﻭ‬
‫ﺮ ﻳ  ﻧﺤﻮﻫ . ﻓﺈﻥ ﻫ‬
‫ﻫ‬
‫ﻭ‬
‫ﻫ‬
‫ﻦ ﻣ ﻫﻫ ﺪ ﻳ ﺍﺍﻵﺧﺮﻳﻳﻦ  ﺑﻌﺪﻡﻡ‬
‫ﻦ‬
‫ﺍﺍﻟﻤﺠﺎﻫ ﺬ‬
‫ﻫ‬
‫ﻭﻧ ﻮ ﺫﺫﻟﻚ  ﻣﺎ ﻳﻳﻨﺎﺳﺐ.  ‬
‫ ‬
‫ع‬
‫ﻓ‬
‫ﺤ‬
‫ﺮ ﻤ‬
‫ﻭ‬
‫ﺳ ﺍ‬
‫ﻭﺍﺍﻟﺮﺟﺎء ﺍﺍﻹ ﺍع  ﻲ ﺫﺫﻟﻚ.. ‬
‫،‬
‫ﻟ‬
‫ﻭ‬
‫ﺔﻭﺍﺍﻛﺘﺐ أأﻧﺖ  ﻧﻔﺴﻚ أأﺧﻲ ﺍﺍﻟﻜﺮﻳﻳﻢ ،،  ﻓﺎﻟﻜﺮ ﻭ ﻣ ﻳﻳﻌﺮﻓﻚ  ﻭﺩﺍﺍﺋﻤﺎ ﻳﻳﺴﺄﻟ ﻲ ﻨﻋﻨﻚ، ﻭ ﻝ  ﺧﺎﻲ  ﻓﻼﻥ.  ‬
‫ﻥ‬
‫ﻭﻳﻳﻘﻮﻝ  ‬
‫ﺍ ﻲ ﻭ  ﺩ ﻭ‬
‫ ‬
‫ ‬
‫ﻭ‬
‫  ‬
‫ﻠ‬
‫ﺧﻲ  ﻋﺒﺪ ﺍﺍﻟﺤﻔﻴﻴﻆ ﻳﻳﻜﺘﺐ  ﻭﻳﻳﻜﺘﺐ  ﻭﻻ ﻳﻳﻤﻞ  ﻣﻦ ﺍﺍﻟﻤﺮ ﺍﺳ ﻭﺍﺍﻟﻀ ﻐﻋﻠﻰ ﺍﺍﻹﺧﻮﺓﺓ.. ‬
‫ﻭ‬
‫ﻂ ‬
‫ ‬
‫ﻭ‬
‫ﻭ‬
‫ﻠ‬
‫ﻭﻭ‬
‫ﻭأأﻳﻳﻀﺎﺄأأﺣﻤﺪ  ﻋﺒﺪ ﺍﺍﻟﻌﻈﻴﻢ  ﻓﻬﻮ  ﻣﺆﺛﺮ  ﻓﻴﻴﻬﻢ  ﺟﺪﺍﺍ،  ﻭﻳﻳﺤﺘﺮﻣﻮﻧﻪ  ﻛﺜﻴﻴﺮﺍﺍ.. ‬
‫ﻪ‬
‫ﻬ ﻭ ،‬
‫ﻬﻴ‬
‫  ‬
‫ﻭ‬
‫ﻠﻭﻛﻞ  ﻣ ﺕ ﻪ ﻟ ﺗﺛﻴﻴﺮ.  ‬
‫َﻦ ﻪ‬
‫ﻭ‬
‫ ‬
‫ﺕ‬
‫ ‬
‫أأﻳ ﺎ ﻭﻣﺴﺄﻟﺔ   أﺧﻯ  ﻣﻬﻬﺔ  ﺟﺪﺍ،،   ﻻﺑ ﺪأﻥ  ﺗﻜﺘﺒﻮﺍ   ﻹ ﻮﺍﺍﻧأ ﺎ أأ ﻧﺎﺭ ﺍﺍ ﺴﺔ،،  ﻓﺈ ﻬﻢ ﻳﻳﻨﺘﻈﺮﻭﻥ  ﻣ ﻨﻢ  ﻣﺮﺍﺍﺳﻼ  ﻭأأﺟﻮﺑﺔ  ﻭﻰ ‬
‫ﻋ‬
‫ﻥ‬
‫ﻜ‬
‫ﺭ ﻟ ﻨ ﻧ ﻬ ﻭ‬
‫ﻯ‬
‫ﻤ‬
‫ ‬
‫ﻥ‬
‫ﺍ‬
‫ﺧ‬
‫ﺼ‬
‫ﺮ‬
‫أ‬
‫ﻳﻀ ﻭ‬
‫،ﺷﻜﺎﻭ ﻫﻫ ﻭﺭﺳ ﺋﻠﻬﻢ،،ﻬﺍﺍﻛ ﺘﻮﺍﺍ ﻟﻬﻢ   ﻭﺍﺍﺳﺘﻭﻦ  ﺑﻌﺒﺪ  ﺍ ﺍ ،ﻆ  ﻭﺑﺄﺣﻤﺪ،  ﻭﺣﺎﻭﻝ ﻟﺍﻳﻳﻀﺎ أأﻥ  ﺗﺴﺘﺼﺪﺭ  ﻟﻬﻢ  ﺭﺳ ﺔ  ﻦ  ﻋﺒﺪ ﺍﺍﻟﺸﺎﻓﻲ، ‬
‫ﻬ‬
‫ﻣ‬
‫ﺭﻟ‬
‫ﺭ‬
‫ﻭ ﻭ ،ﻝ ﺍ ﻥ  ‬
‫ﻢ‬
‫ﺎ‬
‫ ‬
‫،‬
‫ﻌ‬
‫ﺤﻔﻴ ﻭ ﻴ‬
‫ﺒ ﺎ‬
‫ﻬ ﻫ‬
‫ ‬
‫ﻭ ﺍﺍ ﺭ ﻫ ﻭ‬
‫ﻜﺍﺍﻛﺘ ﻮﺍﺍ  ﻬﻬﻢ  ﻛﻼﻣﺎ  ﻟﻄﻴﻴ ﺎ  ﻋﺎﺩﺩﻳﺎ  ﻳﻳﺨﺴﺮﻭ ﻭ ﻣﻨﻬﻬﺎ  ﺷﻴﻴ ،ﻫﻫﺬﺍ  ﻲ ﺍﺍﻟﺤﺪ ﺍﺍﻷﺩﺩﻧﻰ   ﻭﻟﻮ  ﻛﻠﻤﺎﺕ  ﺴﻴ ﻴﺔ  ﻃﻴﻴﺔ  ﺗﻌ ﻭﻥ  ﺪﻴﻬﻬﺎ  ﺑﺎﻟﺨﻴﻴﺮ ‬
‫ﻴ‬
‫ﺕ‬
‫ﻄ‬
‫ﻓﺒ‬
‫ﻭﻥ‬
‫،‬
‫ﺑ‬
‫ﺎ‬
‫ﻭ ،‬
‫ﻫ  ﻓ ﺌ‬
‫ﺍ‬
‫،‬
‫ﻫ‬
‫ﻳ ﻻ  ﻥ  ﻥ‬
‫ﻔ‬
‫، ﺒﻟ‬
‫ﻥ‬
‫ﻮ ﻮ‬
‫ﻬﻬﻢﻫﻫﻢ ‬
‫ﻭﺑﺎ ﻟ ﺤﻖ  ﻣﻦ ﺍﺍﻷ ﻮﺭ، ﻭﺑﺄﻧﻜﻢ  ﺗﺘﺎﺑ ﻌﻥ  ﻭﺗ ﻨﺤﻥ ﻭ ﻬﻬﻮﻥ، ﺼ ﻧﻢ  ﺭﺍﺍﺳ ﻠﻢ  ﻭﺳﺘﺮﺍﺍﺳﻠﻮﻥ ﺍﺍﻹ ﺧﺓ   ﻭأأﻳﻳ ﺎ  ﺗﺪﻋﻮﻧ ﻫ‬
‫ﻫ‬
‫ﺼ‬
‫ﻭ‬
‫ﻀ‬
‫،،‬
‫ﻭﺗﻮﺟ ﻥ ﻭأأ ﺓ ﺭ ﻭ ﻮ ﻥ ‬
‫ ‬
‫ﻭ‬
‫  ﻭ‬
‫ﺭ ﻭ ،‬
‫ﺘ‬
‫ﻘ‬
‫ ‬
‫ﻣ‬
‫ﻭ ﺘ‬
‫)ﺍﺍﻷﻧ ﺎﺭ(  إإﻰ ﺍﺍﻟﻜﻮﻥ   إ إﻮﺍﺍﻧﻬﻢ  ﻛﺎ  ﻓﻌﻞ  ﻋﺒﺪ ﺍﺍﻟﺤﻔﻴﻴﻆ  ﻓﻲ ﺍﺍ ﺮﻳﻂ  ،  ﻭﺗﺮﻭ ﻭأ أﻥﺍﺍﻟﻮﺍﺍﺟﺐ ﻳﻳﻘﺘ ﻲ ﺫﺫﻟﻚ ،، ﺭ ﻢ  ﻣﺎ ‬
‫ﻀ‬
‫ﺭ  ‬
‫ﻟ‬
‫ ‬
‫ﻳ‬
‫ﻥ ،‬
‫ﺸ ﻭ ﻥ  ﻥ‬
‫ ‬
‫ﻜ‬
‫ﺧ‬
‫ﻤ‬
‫ﻣ ﻬﻊ‬
‫،‬
‫ﺭ ﻏ ﻟ ﻥ‬
‫،‬
‫ﻫﻫﻨﺎﻟﻚ  ﻣﻦ  ﻧﻘﺺ  ﻭﺧﻠ ﻞ  ﻭ ﻟﻦ ﺍﺍﻟﻔﺮ ﻫﻫﻲ أأﺷ ﺮ  ﻦ  ﻛﻞ ﺫﺫﻟ ﻚ ، ، ﻧﻢ  ﺑﺎﻟﻌ ﺡ:  ﺳﺘﻜﻮﻧﻮﻥ  ﻣﻊﻥﻼﻮﺍﺍﻧﻜﻢ  ﻋﺎ ِـ َ إإﺻ ﺡ ‬
‫ﻣﻞ‬
‫  إإﺧ ‬
‫ﻭأأ ﻜ ﺲ ‬
‫ﻭ‬
‫ ‬
‫ﻣ‬
‫ﻭ ﻜ ﺔ ﻫ‬
‫ﻗ‬
‫ﻫ‬
‫ﻭ‬
‫ﻫ‬
‫ﻫ‬
‫ﻭﺗﺴﺪﻳﻳﺪ  ﺑﺈﺫﺫﻥ ﺍﺍﷲ.......إإﻟﺦ،‬
‫   ‬
‫ﺼ‬
‫،‬
‫ﻥ‬
‫ﻭ‬
‫ ‬
‫ﻃ ﺒﻌﻳﻳﺎ أأ ﻲ ﻴﺍﺍ ﻟﺰﻳﻳﺰ  أأﺎ  ﻛﺘﺒ ُ  ﻹﺧﻮﺓﺓ ﺍﺍﻷ ﻧﺎﺭ،  ﻋﺪﺓﺓ ﺭﺳﺎﺋﻞ،،   آ ﺮﻫﺎ  ﻣﻦ ﻳﻳﻮﻣ ﻦ ﻭأأﻧﺎ  ﻋﻰ  ﺗﻮﺍﺍﺻﻞ  ﻌﻬﻬﻢ  ﻭﻧﺼﺢ ‬
‫ﻭ‬
‫، ،ﻫ ﻭ‬
‫ﻠ‬
‫ﺧﻫ‬
‫ﻫ‬
‫ﺍ‬
‫ﺭ‬
‫آ‬
‫ﺭﻴ‬
‫ﻌ‬
‫ﻧ‬
‫ﺖ‬
‫ ‬
‫ﺧ‬
‫ﻣ‬
‫ﺎ ‬
‫ﻫﻫﻢ  ﻭﻣﺤﺎﻭﻟﺔ إإﺻ ﺡ  ﻭﺗﻘﺮﻳﻳﺐ  ﺑﻴﻴﻨﻬﻬﻢ  ﻭﺑﻴﻴﻦ ﺍﺍﻟﻜﺮﻭﻡﻡ   ﻭ ﻟﻦ  ﺩﺍﺍﺋﻤﺎ أ أﺷﻮ  إإﻰ ، ﷲ  ﻣﻭﺣﺪﺗﻲ ‬
‫ﺍ ﻦ ﻭ‬
‫ﻜ‬
‫ﻟ‬
‫ﺩ ،‬
‫ﻜ‬
‫ﻭ   ﻭ ﻭ‬
‫ﻼﺡ‬
‫ ﻭ‬
‫ﻭﺗﻮﺟﻴﻪﻪ،  ﻭﺗﻄﻴﻴﻴﻴﺐ  ﻟﺨﻮﺍﺍﻃﺮﻫ ﻭ ﻭ‬
‫ﻫ‬
‫ﺬ‬
‫ﻴ‬
‫،‬
‫ﻭ‬
‫ﻭ‬
‫ﻭﺍﺍﻧﻔﺮﺍﺍﺩﺩﻱ،  ﻭﻻ  ﺣﻮﻝ  ﻭﻻ  ﻗﻮﺓﺓﻙ ﻻ  ﺑﷲ  ،  ﺣﺘﻰ أ أﺎﻑ ﺍﺍﻟ ﺎﺱ ﻨﺗﻤ ّ  ﻣﻨﻲ  ﻭأﺻﻴﻴﺮ  ﻋﻨﺪﻫﻢ  ﻣ ﺒﺘﻻ  ..!! ‬
‫أ‬
‫ﻫ ﻫ‬
‫ﻫ‬
‫ﻭ‬
‫،ﻑ ﺱ   ﻞ‬
‫ﺧ‬
‫،‬
‫ﻝ‬
‫ﻭ‬
‫ ‬
‫ﺎ‬
‫إإ‬
‫ﻱ‬
‫أ أﺷﻮ  إإﻰ  ﺍﷲ ﺍﻭﺣﺪﻩﻩ.  ‬
‫ﻭ‬
‫ ‬
‫ﻜ‬
‫ﻟ‬
‫ﻌ‬
‫ﻙ‬
‫ﻭﺣﺴﺒﻲ ﻌﷲ  ﻭﻧ ﻌﺍﺍﻟﻮﻛﻴﻴﻞ. ‬
‫ﺍﺍ ﻭ ﻢ ‬
‫ﻭ‬
‫ ‬
‫أ أﻲ  ﺍﺍ ﻟﺰﻳﻳﺰ /  ﻛﻴﻴﻒ أأﺣﻮﺍﻝ ﺍﺍﻟﺼﻮﻣﺎﻝ؟  ﻫﻫﻞ  ﻋﻨﺪﻛﻢ  ﺗﻮﺍﺍﺻ  ﻣﻬ،ﻫﻫﻞ ﻳﻳﻮﺳﻒ  ﺎﺯﺍﻝ  ﺣﻴﻴً  ﻭﻣﻮﺟﻮ ﺍﻫﻫﻨﺎﺩ ﺍ ؟أﻭ  ﺣ ﺪ ﻭﻦ ‬
‫ﻢ ‬
‫ﻫ ﻞ ﺩأ‬
‫ﻫ‬
‫  ﺯ ﻝﺍ ﻬ ﺎﻭ‬
‫ﻣ‬
‫ﻫ ﻭ‬
‫ﻫ ،‬
‫ﻝ   ﻝ ﻭﻫ‬
‫  ﻫ‬
‫ﺍ‬
‫ﻣ‬
‫/‬
‫ﺧ‬
‫ ‬
‫إ إﻮﺍﺍﻧ ﺎ  ﻣﻮﺟﻮ  ﻫﻫﺎﻙ؟ أأ ﺎ  ﻛﻮﻧ ﻨ ُ  ﺑﻌﺾ ﺍﺍﻟﻌﻼﻗﺎﺕ ﺍﺍﻟﺒﺴﻴﻴﻄﺔ  ﻣﻊ إإﺧﻮﺓﺓ  ﻋﺒﺮ  ﻣﺎﺭ ﻑﺍﺍﻟﻨﺖ  ﻃ ﺒﺎ،،ﺩ ﻭﻫﻲ  ﺑﺴﻴﻴﻄﺔ  ﻭﺑﺼﺪﺩﺩ ‬
‫ﻭﻫ ﻫ ﻭ‬
‫ﻌ‬
‫ﻫ‬
‫ﺭ ﻌ‬
‫ﻑ ‬
‫   ‬
‫ﺄ‬
‫ﺩ‬
‫ﻧ ﺕ‬
‫ ‬
‫ﺖ‬
‫ﺟﻫ ﻙ‬
‫ﻫ‬
‫ﷲ‬
‫ﺧ‬
‫ﺭ‬
‫ ‬
‫ﻴﺍﺍﻟﺘﻮ ّﻖ  ﻭ ﻟﻦ  ﻟﻠﻬﻬﺎ  ﺗﺘﻄﻮﺭ،،   ﻓﺭﻮﻌﻥ  ﺗﺨﺒﺮﻭ ﻲ ﻧﺑﻤﺎ  ﻋﻨﺪﻛﻢ  ﻣﺎ  ﻳ ﻳﺴﺢ  ﺍﺍ ﻟﻝ ُ  ﺑ ﺬﺮﻩﻩ،  ﻭﻗﺪ ،ﺳﻤﻌ ُ  ﻛﻠﻤﺔ أأ ﻲ ﺑﻳﻳﺤﻴﻰ ‬
‫ ‬
‫ﻭ ﻛ ﺖ‬
‫ﻤ‬
‫ﺤ‬
‫ﺎﻝ‬
‫ﻓ‬
‫ﺭ‬
‫أﻭ ﻥ‬
‫أ‬
‫ﻜ‬
‫ﺛ ﻭ‬
‫ﻤ‬
‫ﺍﺍﻟﺠﺪﻳﻳﺪﺓﺓ  ﺍ ﺍ ﻟ ﺟﻬ ﻬإإﻟﻴﻴﻬﻢ ﻭﻫﻫﻲ  ﻃﻴﻴﺒﺔ  ﺑﺎﺭﻙ ﺍﺍ  ﻴﻪ ﻴﻭﻓﻴﻴﻜﻢ.  ‬
‫ﺭ‬
‫ ‬
‫ﻭﻪ‬
‫ﻙ‬
‫ﻭﻫ‬
‫ﻬ‬
‫ﻫ‬
‫ﻮ ﺔ ‬

‫‪Vocabulary‬‬

‫‪Form‬‬

‫‪Domain‬‬

‫”‪“Grammar‬‬

‫4‬
Task: Triage
 Triage: should we process further








5

and/or urgently?
Too few trained, trusted linguists to
review all the documents in time
Enable non-linguist to do linguist’s
job
Gisting: MT All vs. MT Names alone
Combine Entity Extraction with
Specialized Machine Translation
Integrate into Triage workflow
Signal: Documents Selected (How
are guidelines interpreted?)
Technology Entity Extraction (1)

6
Technology Entity Extraction (2)

7
Technology: Entity Extraction (3)
2
5

Domai
n
Text

Tagged
Text

23

Unsupervised Model

4

Supervised Model

Input
Text
Pattern Match (Regex)
Exact Match (Gazetteer)

Deterministic Extractor
User
Defined
Lists

8

1

User
Defined
Patterns

Entity Redactor

Probabilistic Extractor

Overlap
Adjudication

Entity Joining

Filtering

3

Output
Text
Adaptation: Entity Extraction to Triage
 Out of the box:
– False +/- because contextual cues are fewer/different.
– Weapon in this document missed, because not a default entity type.

 Adaptation:
– Add custom entity type(s) via deterministic extractor, e.g. weapons list

 Benefit:
– Highlights important documents that might otherwise be missed.
– Fast and unlikely to affect performance of other components

 Difficulties:
– Requires forethought, maintenance of lists and patterns in many
languages, but much less work than developing a new model

9
Task: Translation
 Produce standardized, “user








10

language” versions of the source
document
Too few translators; name
standardization particularly labor
intensive
Speed up translation without
compromising quality
MT All reduces translation
productivity
NER, Coref and Name
Translation/Standardization
Signals: Resource
Selections, Corrections, Resolutions
Adaptation: Extraction to Translation
(1)
 Out of the box:
– Same problems as in Gisting case, only now they matter more.

 Adaptation:
– Train unsupervised model to help with form and domain differences
– Tune co-reference algorithm to most important entity types
– Develop form/domain specific resource sets, and allow users to select them.

 Benefit:
– Fewer errors in highlighting should mean translation actually speeds up

 Difficulties:
– Often hard to amass a big enough corpus of like material for model building.
– Form/Domain may be ephemeral

11
Adaptation: Extraction to Translation
(2)
Thanks:~ Itai_Rolnick$ cat
en_wc.txt | grep -i "
aleppo " | tr ' ' 'n' |
shuf | head

 Unsupervised algorithm clusters words

Loveland -- City in Colorado
Svetogorsk -- Town in Russia
MASSOUD -- ?Probably also of a village.
Atiak -- Town in Uganda
Waltha -- typo for Waltham? - town in Mass
BASILICA -- type of Church?
Sapukai -- Town in Paraguai
Yeisk -- Town in Russia
Descoberto -- Town in Brasil
SINKHOLE -- ? A pub in Beligium ??



12









with distributional similarities together
Word cluster ID is one feature used in
learning the sequence model
Based on Collins & Singer (1999)
Part of REX Field Training Kit
Shown: random sample of words
clustered with “Aleppo” in a ~10GB
English model
Note they’re almost all LOCs
Would an annotated training corpus
ever cover so many remote entities?
Task: Cataloging
 Distill content into an index, to









13

facilitate search and further
refinement at scale
Impossible to annotate more than a
tiny fraction of documents by hand
High quality automated enrichment
that makes efficient use of
knowledge resources and structure
in data
Many approaches, e.g. LSI, topic
modeling, document classification
Entity resolution is robust extension
of NER; data and knowledge driven.
Signals: mentions/aliases, shallow
relationships between entities
Technology: Entity Resolution (1)
Alberto

Alberto Amos
Alberto
Fernandez…
Fernandez…

Alberto Fernandez…
… born in Cuba
… US Ambassador

Sportsmen?
YES

Alberto

Alberto

Alberto
Fernandiz…
Alberto Fernandez
de la Puebla…
Albert
Fernandez…

Alberto Fernandez…

Alberto
Alberto

… Chief of Cabinet
… Argentina…
…Prof of Criminal Law…

Alberto

Ratio of
Politicians to
Sportsmen?
2:1

Alberto
Alberto M.
Fernandez…
Fernandez… Alburto Fernandez…
Alberto Fernandez…

Alberto
Alberto
Alberto
Alberto

… born Sept 7, 1984
… cycling
… Madrid

Nickname
“El Galleta?”
?
Technology: Entity Resolution (2)

15
Technology: Entity Resolution (3)

16
Technology: Entity Resolution (4)

17
Technology: Entity Resolution (5)
Resolution Engine

Entity
Mention

Link or
Ghost

Candidate Selection
Ranking

3
4

Learned
Seeded

2

Entity Index

Knowledge Base 1
18
Adaptation: EntRes to Cataloging (1)
 Out of the box:
– Quality dependent on output of extraction and order of input
– Lots of ghosts, poor links if Wikipedia-based KB doesn’t contain entities in document
– Seeding context selection may not be suited to domain

 Adaptations:
–
–
–
–

Custom KB, sized and suited to the domain and languages
Seeding using context most likely to match in your domain
Choose Linking or Learning mode
Choose evidence factoring scheme that meets your operational needs

 Benefits:
– Linking throughput is high, accuracy is high, ghosts are informative (because fewer
confounders)
– System can maintain low latency after ingestion of many documents
– Linking accuracy can remain high after ingestion of many documents

 Difficulties:
– Each element requires experimentation and thought
– Changes likely to cause discontinuities unless re-indexing

19
Adaptation: Ent Res to Cataloging (2)
 In Linking mode:
– Link to existing KB or declare unknown,
discarding context
– State size is constant, latency stable

 In Learning mode:
– Link to existing KB or create New, storing
context
– State size increases, increasing latency
– Semantic drift
– Confidence measure gets complicated

 Scaling with learning introduces the

need to factor evidence.
 Evidence factoring schemes need to
be customized to use cases.

20
Task: Retrieval
 Find relevant information for further







21

analysis
String-based retrieval methods are
easy to understand, but require a
lot of effort and distract from the
task.
Deliver search modalities that are
more productive but still
interpretable and correctable
Search using entity-driven facets, as
well as keywords
Signals: query log, click through,
curation, corrections
Adaptation: EntRes to Retrieval
 Out of the box:
– Entity labels not in user’s language confusing
– Returns results that can’t be easily summarized as a Boolean, cf aliases
– Complex, potentially misleading measure of confidence

 Adaptations:
– Use name translation for non user-language labels, e.g. from KB
– Present users with cues to expansion in string terms, e.g. mentions
– Present confidence measure carefully

 Benefits:
– User spends less time confused, search is more productive

 Difficulties:
– Users still want to do things like exclude certain mentions.

22
Summary
 News-trained NER OK for Triage, but adding entity types via

lists and patterns could improve results considerably.
 Speeding up Translation requires a better fit: unsupervised
adaptation and custom resource selection could make the
difference between time saved or wasted.
 Cataloguing by resolved entities enables powerful search, but
relies on high quality extraction; Learning-mode requires
evidence factoring at scale.
 Entity-based search is incredibly productive compared to
Boolean and keyword approaches, but users need cues that
explain expansion and robust measures of confidence.

23
Remaining Challenges
 Current reality: even “simple” adaptation can be difficult:
–
–
–
–

Too much knowledge, experience required
Too much data required, e.g. 10GB for unsupervised
Mostly “out of band”
Usually Offline

 Through the REX Field Training Kit and Entity Resolution

API, Basis lowering the barriers to manual adaptation to
sources, tasks and users today.
 Integration of explicit signals, e.g. corrections and implicit
signals, e.g. selections is ongoing.

24
Q&A
gregor@basistech.com

Director of Product Management, Text Analytics

Basis Technology Corporation

Contenu connexe

Plus de Basis Technology

Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLRBasis Technology
 
Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014Basis Technology
 
Basis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in JapanBasis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in JapanBasis Technology
 
Rosette Search Essentials for Elasticsearch
Rosette Search Essentials for ElasticsearchRosette Search Essentials for Elasticsearch
Rosette Search Essentials for ElasticsearchBasis Technology
 
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian CarrierOSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian CarrierBasis Technology
 
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff GodboldHLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff GodboldBasis Technology
 
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierHLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierBasis Technology
 
OSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
OSS 2013 - Real World Facets with Entity Resolution by Benson MarguliesOSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
OSS 2013 - Real World Facets with Entity Resolution by Benson MarguliesBasis Technology
 
HLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David MurgatroydHLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David MurgatroydBasis Technology
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformBasis Technology
 
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceA Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceBasis Technology
 
Autopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceAutopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceBasis Technology
 
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...Basis Technology
 
Big Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology ConferenceBig Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology ConferenceBasis Technology
 
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceMultilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceBasis Technology
 

Plus de Basis Technology (15)

Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLR
 
Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014
 
Basis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in JapanBasis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in Japan
 
Rosette Search Essentials for Elasticsearch
Rosette Search Essentials for ElasticsearchRosette Search Essentials for Elasticsearch
Rosette Search Essentials for Elasticsearch
 
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian CarrierOSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
 
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff GodboldHLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
 
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierHLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
 
OSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
OSS 2013 - Real World Facets with Entity Resolution by Benson MarguliesOSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
OSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
 
HLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David MurgatroydHLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
 
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceA Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
 
Autopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceAutopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics Conference
 
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
 
Big Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology ConferenceBig Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology Conference
 
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceMultilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
 

Dernier

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Dernier (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

  • 1. Getting Things Gregor William Stewart Director of Product Management, Text Analytics Basis Technology Corporation
  • 2. Introduction  Product Manager for Text Analytics, including: – – – – – Rosette Linguistics Platform Entity Analytics Name Indexing and Translation Chat Translator Highlight  Questing for: – – – – 2 Quality: accuracy, performance Coverage: languages, domains, genres Integration: tasks, workflows, UX Innovation: new aggregates, functions
  • 3. Overview 1 4 2 + Source Tasks Technologies Adaptations Properties Description Action (Input/Output) “Out of Box” Comparison Problem(s) Components Suggested Adaptations Challenge(s) Process Potential Benefits Approach(es) Adaptation Opportunities Costs Solution Signal(s)  Focus on entity analytics in four stages of the processing and exploitation of SOCOM-2012-0000011-HT  Reaching “state of the art” in practice means adapting to source, task and user. 3
  • 4. ‫1100000-2102-‪Source: SOCOM‬‬‫‪HT‬‬ ‫‪ An Arabic language source‬‬ ‫‪document‬‬ ‫‪Letters/emails from one colleague‬‬ ‫‪to others, regarding policy‬‬ ‫‪Written years before it was‬‬ ‫‪acquired, processed‬‬ ‫‪Perhaps imperfectly transcribed, or‬‬ ‫‪OCRed into our forensics platform‬‬ ‫,‪Of uncertain provenance, content‬‬ ‫‪value‬‬ ‫‪Not a current web news article for‬‬ ‫‪wide consumption, with metadata‬‬ ‫‪‬‬ ‫‪‬‬ ‫‪‬‬ ‫‪‬‬ ‫‪‬‬ ‫1100000-2102-‪SOCOM‬‬ ‫ ‬ ‫ ‬ ‫ـــــ  /   ﺍﺍﻟﻌﺰﻳﻳﺰ  ﻋﺪﻧ ﻥ  / ﺍﺍ)  ﺣﺎﻓﻆ  ﺳﻠﻄﺎﻥ  (  ‬ ‫ﻥ‬ ‫أأﺧﻲ  ‬ ‫ﻋﺭﺳﺎﺋﻞ إإﻟﻰ  ﻛﺮ ﺍ ﻭ ﻭأأ ﻣﻋﻤﺮ  ﻭ ﻧﺳﻬﻬﻢ  ؛ ﺭﺳﺎﺋﻞ  ﺗﻮﺟﻴﻬﻬﻴﻴﺔ ‬ ‫ ﻴ‬ ‫ﺎ ﺭ‬ ‫ﻲ‬ ‫ﻭ ﻭﺑ‬ ‫ﻲ ‬ ‫ﻃﻠﺒ ﻢ  ﻣﻨﻜﻢ ﺘأأﻳﻳﻀﺎ  ﻓﻲ ﺭﺳﺎﺋﻞ  ﺳﺎ ﺑﺔ ﺍﺍﻟﻤﺴﺎﺭﺭﺔ  ﺑﻜﺘﺎﺑﺔﺭ‬ ‫ﻘ‬ ‫ﺭ  ‬ ‫ ‬ ‫ ‬ ‫ﻴ‬ ‫،‬ ‫ﻓ‬ ‫ﻱ‬ ‫ﻥ‬ ‫ﺮأ‬ ‫ﺣﺎﺯﺯﺔ،،  ﻓﺈﻧ ﻨ ﻲﻑ  ﻋﻰ  ﺍﻹﺓﺓ  ﻦ ﺍﺍﻷﺧﻄﺎء ﺍﺍﻟﺴﻴﻴﺎ ﺳﻴﺔ،،  ﻓﻘﺪ  ﺳﻤﻌ ﻢ  ﻭﻻﺑﺪ  ﺧﻄﺒﺔ أأﺑﻲ  ﻋﻤﺮ ﺍﺍﻷ ﺧﻴﺮﺓﺓ،  ﻭ ﻲ  ﻧﻈ ﻱ  أﻥ ‬ ‫ﻭ‬ ‫ﻭ ﺘ‬ ‫ﻮ ﻴ‬ ‫ﻣﺧ‬ ‫ﻠ‬ ‫أأﺧﺎﻑ  ‬ ‫  ﻣ‬ ‫ﺤﻬﺎ أأﺧﻄﺎء  ﻭﺍﺍﺿ ﻭﺔ ﻓ:  ﻴﻴﻬﻬأأﺷﻴﻴﺎ ء  ﺎ  ﻛﺎﻥ ﻳﻳﻨ ﺒﻲ  أﻥ  ﺗﺬﻛﺮ  ﻓﻲ  ﺧﻄﺒﺔ  ﻗﺎﺋﺪ ﻛﻬﻬﺬﺍﺍ،  ﻭﻳﻳﺪ ﻝ ﻫﻫﺎ  ﻓﻲ  ﺧﻄﺎﺑﻪﻪ    ﻻ ﺳﻴ ﻴﺎ  ﻲ ‬ ‫ﻭ‬ ‫ﻭ‬ ‫ﻭ ﻝ ﺫﺫﻛﺮﻫ‬ ‫، ﻫ‬ ‫ﻐ‬ ‫أﻥ‬ ‫ﻤ ﺎ ﻥ ﻓ ﻥ ﻣ‬ ‫ﻓﻴﻬ   ﻴ‬ ‫ﻮ‬ ‫ﺳﻴﺎﻕ ﺍﺍ ﻟﻮﺍﺍﺏ  ﻭﺍﺍﻟﻤﺒﺎﺩﺩئ  ﻋﻠﻰ أأﻧﻬﻬﻢ  ﻣﺘﺸﺪﺩﺩﻭﻥ،  ﻭﺗﻌﻄﻲ إإﻳﻳﺤﺎء  ﻬﻬﻢ  ﻣﺘﻌ ّ ﻘﻥ  ﻣﺴﺘﻌﺠﻠﻮﻥ. .!  ﻭﻴﻴﻬﺎ  ﺗﻨﻔﻴﻴﺮ  ﻭﻗ ﺔ  ﺣﻜﻤﺔ.  ‬ ‫ﻭ ﻬ‬ ‫ ‬ ‫ﻥ‬ ‫ﻤ‬ ‫ﻭ‬ ‫ﺑﺄﻧ‬ ‫ ‬ ‫ﻭﻥ ﻭ ،‬ ‫ﻓ‬ ‫ﻠ‬ ‫ﺏ ئ‬ ‫!‬ ‫ﻴ‬ ‫ ‬ ‫ﻭﺜ‬ ‫ﻕ‬ ‫ﻭأأﻧﺎ  ﻋﻦ  ﻧ ﻔﻲ  ﻛﺘﺒﺖ ﻟﻬﻬﻢ  ﻭﻋﺎﺗﺒﺘﻬﻬﻢ ﺴﻭﺷﺪﺩﺩﺕ  ﻋﻠﻴﻴﻬﻢ  ﺑﻌﺾ ﺍﺍﻟﺘﺸﺪﻳﻳﺪ.  ‬ ‫ﻬ‬ ‫،ﻭ   ،   ﺕ‬ ‫ﺧ‬ ‫ﻭ‬ ‫ﻭ‬ ‫ﺮء  ﺗﻠﻮ ‬ ‫ﻭأ أﺎﻑ  ﻬﻬﻢ إإﻥ ﺍﺍﺳﺘﻤ ﻭﺍﺍ ﻲ   ﺜﻫﻫﺬﺍﺍ ﺍﺍﻷﺳﻠﻮ  ﻭﺍﺍﻟﻄﺮﻳﻳﻘﺔ ﻳﻳﻔﺴ ﻭ ﻭ ﻭﻳﻳﻨﻔ ﻭﻭﻥﺍﺍﻟ ﺎﺱ  ﻭﻳﻳﻔﻘﺪﻭﻧﻬﻬﻢ  ﻭﻳﻳﻜﺴﺒﻮﻥ ﺍﺍﻷﻋﺪﺍ  ‬ ‫ﻥ‬ ‫ﻥ  ﻥ ﺪ ﻭﻥ  ﺱ ﻭ ﻭ ﻭﻨ‬ ‫ ‬ ‫ﺏ‬ ‫ﺏ‬ ‫ﻭ‬ ‫ﺮﻓ ﻭﻞ ﻣ‬ ‫ﻫ‬ ‫  ﻫ‬ ‫ﻭ‬ ‫ﻭﺍ ﻑ أأﻧ ﻥ ،‬ ‫ﺍﺍﻷﻋﺪﺍﺍء  ﻭﻳﻳﻌ ﻄﻥ  ﻟﻸﻋﺪ ﺍ ﻭﺍﺍ، ﻮﺼﻡ  ﺍﺍﻟ ﻔ ﻮﺮﺔ  ﻟﻠﻨﻴﻴﻞ  ﻣ ﻨﻬﻢ،،  ﻭﺍﺍ  ﻟﺤﻤﺔ  ﻬﻬﻢ  ﺷ ﺮﺔ  ﺟﺪﺍ  ﺗﺸﻮﻳﻬﻬﺎ  ﻭﺗﻨﻔﻴﻴﺮﺍً  ﻭﻛﺬﺑﺎ ﻭﺍﺍﻓﺘﺮﺍﺍء، ‬ ‫ﺳ ﺍ ﻭ ﻳ ﺍﻭ‬ ‫ﻠ‬ ‫ﻭ ﻋﻠﻴﻴ‬ ‫ﻬ‬ ‫ﻥ‬ ‫ﻭ‬ ‫ﻡ‬ ‫ﺻ‬ ‫ء‬ ‫ﺍﻟ ﺨ‬ ‫ﻭ ‬ ‫،‬ ‫ﻫﻫ ﺍ ﻳﻳﺴﺘﺪﻋﻲ  ﻏﻠﻖ  ﺎ أأﻣﻜﻨﻨﺎ  ﻣﻦ أأ ﻮﺍﺍﺏ  ﻭﻭ ﻗﻊ ﺍﺍﻟﻄﺮﻳﻳﻖ  ﻋﻠﻰ ﺍﺍﻷﻋﺪﺍﺍء،،   ﻓﻜﻴﻴﻒ  ﺑ ﺈﻮﺍﺍﻧ ﺧﻳﻳﺰﻳﻳ ﻭﻭﻥﺍﺍﻟ ﻄﻴﻦ ﻴﺑﻠﺔ  ﻭﻳﻳﻔﺘﺤﻮﻥ ‬ ‫ﻭ‬ ‫ﻨ  ﺎ ﻥ  ﺪ‬ ‫ ‬ ‫ ‬ ‫ﺑ‬ ‫ﻄ‬ ‫ﺏ‬ ‫ﺍ   ﺬ ﻣ‬ ‫ﻭ‬ ‫ﻭﻫ‬ ‫ﻫ‬ ‫ﻥ‬ ‫ﻋﻠﻰ أأﻧﻔﺴﻬ ﻭأأﻮﺍﺍﺑً  ﻣﻦ ﺍﺍﻟﺸﺮ..! ‬ ‫ﻬ‬ ‫ﺑ‬ ‫  ﻢ  ﺎ  ‬ ‫ﻭﺍﺍﻟﻤﻘﺼﻮﺩﺩ  :  ﻻ  ﺗ ﺘﺮﻮﺍﺍ  ﻣﺤﻤﻮﺩﺩ ﻴ)  ﻄﻴﺔ  (  ﻭﺣﺪﻩﻩ،  أﺭﻳﻳﺪ  ﻣﻨﻜﻢ ﺍﺍﺳﺘﺼﺪﺍﺭ ﺭﺳﺎﺋﻞ  ﺧ ﺎﺔ  ﻭﻋﺎﻣﺔ أ ﻋﻠﻨ ﻴ ﻭﺳ ﺮﻳﺔ  ،  ﻣﻦ  ﻋﺒﺪ ‬ ‫  ﻭﺔ   ﻳ‬ ‫ﻭﺻ‬ ‫ﺭ  ﺭ  ‬ ‫ﺍ‬ ‫،‬ ‫ﺭ أ‬ ‫ﻋ، ﻭ‬ ‫ﻴ ﻛ‬ ‫ﺍﺍﻟﺸﺎﻓﻲﺩ)  ﻛﻠﻴﻴﻢ  (  ﻭﺣ ﻰ  ﻣﻦ ﺍﺍﻟﺼﺎﺩﻕ  ﻪﺯﻣﺮﺍﺍﻱ  ( إإﺫﺍﺍ ﻟ ﺱ أﻜ ﻴﻴﻬﺎ ﺮ ﺒ ﺎﺢ  ﻭﺗﻮﺟﻴﻬﻬﺎﺕ  ﻣﺒﺎﺷ ﻦ ﺷﻪ  ﻣﺒﺎﺷ ﺓ  ﻭﻣﺤﺪﺩﺓ ‬ ‫ﻭ‬ ‫ﻓ ﺓ  ﻭﺩ ،‬ ‫ﺮ‬ ‫ﺓ‬ ‫ﺫ‬ ‫ﻬ‬ ‫ﺘ ﻕ ) ﺯ ﻱ     ﻣﻭ  ﻧﺼ ﻭ ﺋ‬ ‫ﻭ ﻴ‬ ‫، ﺕ‬ ‫ﺓ‬ ‫ﻭ‬ ‫ ‬ ‫ﻭ‬ ‫ ‬ ‫ﺍ ﺎ،‬ ‫ﻣ‬ ‫ﻭﺍﺍﺿﺤﺔ  ﻟﻠﻜﺮﻭ  ﻭﺍﺍ ﻲ ﻣﻋﻤﺮ  ﻭإ إ ﻬﻬﻢ،  ﻲ ،ﻣﺴﺎﺋﻞ  ﺳﻴﻴﺎﺳﺔ  ﺍ ﺱ ﻭﺍﺍﻟﺘﻌﺎﻣﻞ  ﻣﻌﻬﻢ  ﻭﻣﻊ ﺍﺍﻟﻔﺼﺎﺋﻞ ﺍﺍﻷﺧﺮﻯ،  ﻭﻋﺪﻡﻡ ‬ ‫  ﻯ ﻭ‬ ‫ﻭ ﻬ‬ ‫ﻨ ﻓ‬ ‫ﻭ‬ ‫ﺑ‬ ‫ﺧ‬ ‫ﻲ ﻮﺍﺍﻧ‬ ‫ﻭ‬ ‫ﻭ‬ ‫ﻭ‬ ‫ﺍ‬ ‫ﻭ‬ ‫ﺍﺍﻻﺳﺘﻌﺠﺎﻝ،،  أﻥ أ ﻥﻳﻳﺤﺪﺛﻮﺍ أأﻣﺮﺍ  )ﻛﺒﻴﻴﺮﺍﺍ ﻣﻬﻬﻤﺍ(  إإﻻ  ﺑﻤﺸﻮ ﺭ ﻭأأﻥﻭﻳﻳﺴﻌﻮﺍ  ﻫﻫﺪﻳﻳﻦ  ﻻﺳﺘﻴﻴﻌﺎ ﺏﺍﺍﻟ ﺎﺱ،  أﻥ أ ﻥﻳﻳﺼ ﻮﺍﺍﻔأأﺣﺪﺍ ‬ ‫ﻻ ﻭ ﻭ‬ ‫ﻨ‬ ‫،‬ ‫ﺏ  ﺱ‬ ‫ ‬ ‫ﺭ ﺓ ﻥ ﺟﺎﻫ‬ ‫ﻫ‬ ‫ ‬ ‫ﺍ‬ ‫ﺓﻴ‬ ‫ﺎ‬ ‫ً‬ ‫ﺍ‬ ‫ﻝ‬ ‫ﻻ ‬ ‫ﻴ ‬ ‫  ﺸﻭﻋﺔأﻭ  أ ﻫﻫﺎ،،  ﻥ ﻫﻫ ﺍﺍ  ﺳﺎﻖ  ﻷﻭ ﺍﻪ  ﻭﻟﻠﻧﻨﺱ ﺍأﻓﻬﻬﺎﻡﻡ أﻣﺨﺘﻠﻔﺔ  ﻭﺗﺄﻭﻳﻳﻼﺕ  ﻭﻧﻈﺮﺍﺍﺕ .. ‬ ‫ﺕﻭ ﻭﺕ‬ ‫ﻭ‬ ‫ﺑ‬ ‫ﻪ‬ ‫ﺎﺱ‬ ‫ﻭ ﻭ‬ ‫ﺮ ﻳ  ﻧﺤﻮﻫ . ﻓﺈﻥ ﻫ‬ ‫ﻫ‬ ‫ﻭ‬ ‫ﻫ‬ ‫ﻦ ﻣ ﻫﻫ ﺪ ﻳ ﺍﺍﻵﺧﺮﻳﻳﻦ  ﺑﻌﺪﻡﻡ‬ ‫ﻦ‬ ‫ﺍﺍﻟﻤﺠﺎﻫ ﺬ‬ ‫ﻫ‬ ‫ﻭﻧ ﻮ ﺫﺫﻟﻚ  ﻣﺎ ﻳﻳﻨﺎﺳﺐ.  ‬ ‫ ‬ ‫ع‬ ‫ﻓ‬ ‫ﺤ‬ ‫ﺮ ﻤ‬ ‫ﻭ‬ ‫ﺳ ﺍ‬ ‫ﻭﺍﺍﻟﺮﺟﺎء ﺍﺍﻹ ﺍع  ﻲ ﺫﺫﻟﻚ.. ‬ ‫،‬ ‫ﻟ‬ ‫ﻭ‬ ‫ﺔﻭﺍﺍﻛﺘﺐ أأﻧﺖ  ﻧﻔﺴﻚ أأﺧﻲ ﺍﺍﻟﻜﺮﻳﻳﻢ ،،  ﻓﺎﻟﻜﺮ ﻭ ﻣ ﻳﻳﻌﺮﻓﻚ  ﻭﺩﺍﺍﺋﻤﺎ ﻳﻳﺴﺄﻟ ﻲ ﻨﻋﻨﻚ، ﻭ ﻝ  ﺧﺎﻲ  ﻓﻼﻥ.  ‬ ‫ﻥ‬ ‫ﻭﻳﻳﻘﻮﻝ  ‬ ‫ﺍ ﻲ ﻭ  ﺩ ﻭ‬ ‫ ‬ ‫ ‬ ‫ﻭ‬ ‫  ‬ ‫ﻠ‬ ‫ﺧﻲ  ﻋﺒﺪ ﺍﺍﻟﺤﻔﻴﻴﻆ ﻳﻳﻜﺘﺐ  ﻭﻳﻳﻜﺘﺐ  ﻭﻻ ﻳﻳﻤﻞ  ﻣﻦ ﺍﺍﻟﻤﺮ ﺍﺳ ﻭﺍﺍﻟﻀ ﻐﻋﻠﻰ ﺍﺍﻹﺧﻮﺓﺓ.. ‬ ‫ﻭ‬ ‫ﻂ ‬ ‫ ‬ ‫ﻭ‬ ‫ﻭ‬ ‫ﻠ‬ ‫ﻭﻭ‬ ‫ﻭأأﻳﻳﻀﺎﺄأأﺣﻤﺪ  ﻋﺒﺪ ﺍﺍﻟﻌﻈﻴﻢ  ﻓﻬﻮ  ﻣﺆﺛﺮ  ﻓﻴﻴﻬﻢ  ﺟﺪﺍﺍ،  ﻭﻳﻳﺤﺘﺮﻣﻮﻧﻪ  ﻛﺜﻴﻴﺮﺍﺍ.. ‬ ‫ﻪ‬ ‫ﻬ ﻭ ،‬ ‫ﻬﻴ‬ ‫  ‬ ‫ﻭ‬ ‫ﻠﻭﻛﻞ  ﻣ ﺕ ﻪ ﻟ ﺗﺛﻴﻴﺮ.  ‬ ‫َﻦ ﻪ‬ ‫ﻭ‬ ‫ ‬ ‫ﺕ‬ ‫ ‬ ‫أأﻳ ﺎ ﻭﻣﺴﺄﻟﺔ   أﺧﻯ  ﻣﻬﻬﺔ  ﺟﺪﺍ،،   ﻻﺑ ﺪأﻥ  ﺗﻜﺘﺒﻮﺍ   ﻹ ﻮﺍﺍﻧأ ﺎ أأ ﻧﺎﺭ ﺍﺍ ﺴﺔ،،  ﻓﺈ ﻬﻢ ﻳﻳﻨﺘﻈﺮﻭﻥ  ﻣ ﻨﻢ  ﻣﺮﺍﺍﺳﻼ  ﻭأأﺟﻮﺑﺔ  ﻭﻰ ‬ ‫ﻋ‬ ‫ﻥ‬ ‫ﻜ‬ ‫ﺭ ﻟ ﻨ ﻧ ﻬ ﻭ‬ ‫ﻯ‬ ‫ﻤ‬ ‫ ‬ ‫ﻥ‬ ‫ﺍ‬ ‫ﺧ‬ ‫ﺼ‬ ‫ﺮ‬ ‫أ‬ ‫ﻳﻀ ﻭ‬ ‫،ﺷﻜﺎﻭ ﻫﻫ ﻭﺭﺳ ﺋﻠﻬﻢ،،ﻬﺍﺍﻛ ﺘﻮﺍﺍ ﻟﻬﻢ   ﻭﺍﺍﺳﺘﻭﻦ  ﺑﻌﺒﺪ  ﺍ ﺍ ،ﻆ  ﻭﺑﺄﺣﻤﺪ،  ﻭﺣﺎﻭﻝ ﻟﺍﻳﻳﻀﺎ أأﻥ  ﺗﺴﺘﺼﺪﺭ  ﻟﻬﻢ  ﺭﺳ ﺔ  ﻦ  ﻋﺒﺪ ﺍﺍﻟﺸﺎﻓﻲ، ‬ ‫ﻬ‬ ‫ﻣ‬ ‫ﺭﻟ‬ ‫ﺭ‬ ‫ﻭ ﻭ ،ﻝ ﺍ ﻥ  ‬ ‫ﻢ‬ ‫ﺎ‬ ‫ ‬ ‫،‬ ‫ﻌ‬ ‫ﺤﻔﻴ ﻭ ﻴ‬ ‫ﺒ ﺎ‬ ‫ﻬ ﻫ‬ ‫ ‬ ‫ﻭ ﺍﺍ ﺭ ﻫ ﻭ‬ ‫ﻜﺍﺍﻛﺘ ﻮﺍﺍ  ﻬﻬﻢ  ﻛﻼﻣﺎ  ﻟﻄﻴﻴ ﺎ  ﻋﺎﺩﺩﻳﺎ  ﻳﻳﺨﺴﺮﻭ ﻭ ﻣﻨﻬﻬﺎ  ﺷﻴﻴ ،ﻫﻫﺬﺍ  ﻲ ﺍﺍﻟﺤﺪ ﺍﺍﻷﺩﺩﻧﻰ   ﻭﻟﻮ  ﻛﻠﻤﺎﺕ  ﺴﻴ ﻴﺔ  ﻃﻴﻴﺔ  ﺗﻌ ﻭﻥ  ﺪﻴﻬﻬﺎ  ﺑﺎﻟﺨﻴﻴﺮ ‬ ‫ﻴ‬ ‫ﺕ‬ ‫ﻄ‬ ‫ﻓﺒ‬ ‫ﻭﻥ‬ ‫،‬ ‫ﺑ‬ ‫ﺎ‬ ‫ﻭ ،‬ ‫ﻫ  ﻓ ﺌ‬ ‫ﺍ‬ ‫،‬ ‫ﻫ‬ ‫ﻳ ﻻ  ﻥ  ﻥ‬ ‫ﻔ‬ ‫، ﺒﻟ‬ ‫ﻥ‬ ‫ﻮ ﻮ‬ ‫ﻬﻬﻢﻫﻫﻢ ‬ ‫ﻭﺑﺎ ﻟ ﺤﻖ  ﻣﻦ ﺍﺍﻷ ﻮﺭ، ﻭﺑﺄﻧﻜﻢ  ﺗﺘﺎﺑ ﻌﻥ  ﻭﺗ ﻨﺤﻥ ﻭ ﻬﻬﻮﻥ، ﺼ ﻧﻢ  ﺭﺍﺍﺳ ﻠﻢ  ﻭﺳﺘﺮﺍﺍﺳﻠﻮﻥ ﺍﺍﻹ ﺧﺓ   ﻭأأﻳﻳ ﺎ  ﺗﺪﻋﻮﻧ ﻫ‬ ‫ﻫ‬ ‫ﺼ‬ ‫ﻭ‬ ‫ﻀ‬ ‫،،‬ ‫ﻭﺗﻮﺟ ﻥ ﻭأأ ﺓ ﺭ ﻭ ﻮ ﻥ ‬ ‫ ‬ ‫ﻭ‬ ‫  ﻭ‬ ‫ﺭ ﻭ ،‬ ‫ﺘ‬ ‫ﻘ‬ ‫ ‬ ‫ﻣ‬ ‫ﻭ ﺘ‬ ‫)ﺍﺍﻷﻧ ﺎﺭ(  إإﻰ ﺍﺍﻟﻜﻮﻥ   إ إﻮﺍﺍﻧﻬﻢ  ﻛﺎ  ﻓﻌﻞ  ﻋﺒﺪ ﺍﺍﻟﺤﻔﻴﻴﻆ  ﻓﻲ ﺍﺍ ﺮﻳﻂ  ،  ﻭﺗﺮﻭ ﻭأ أﻥﺍﺍﻟﻮﺍﺍﺟﺐ ﻳﻳﻘﺘ ﻲ ﺫﺫﻟﻚ ،، ﺭ ﻢ  ﻣﺎ ‬ ‫ﻀ‬ ‫ﺭ  ‬ ‫ﻟ‬ ‫ ‬ ‫ﻳ‬ ‫ﻥ ،‬ ‫ﺸ ﻭ ﻥ  ﻥ‬ ‫ ‬ ‫ﻜ‬ ‫ﺧ‬ ‫ﻤ‬ ‫ﻣ ﻬﻊ‬ ‫،‬ ‫ﺭ ﻏ ﻟ ﻥ‬ ‫،‬ ‫ﻫﻫﻨﺎﻟﻚ  ﻣﻦ  ﻧﻘﺺ  ﻭﺧﻠ ﻞ  ﻭ ﻟﻦ ﺍﺍﻟﻔﺮ ﻫﻫﻲ أأﺷ ﺮ  ﻦ  ﻛﻞ ﺫﺫﻟ ﻚ ، ، ﻧﻢ  ﺑﺎﻟﻌ ﺡ:  ﺳﺘﻜﻮﻧﻮﻥ  ﻣﻊﻥﻼﻮﺍﺍﻧﻜﻢ  ﻋﺎ ِـ َ إإﺻ ﺡ ‬ ‫ﻣﻞ‬ ‫  إإﺧ ‬ ‫ﻭأأ ﻜ ﺲ ‬ ‫ﻭ‬ ‫ ‬ ‫ﻣ‬ ‫ﻭ ﻜ ﺔ ﻫ‬ ‫ﻗ‬ ‫ﻫ‬ ‫ﻭ‬ ‫ﻫ‬ ‫ﻫ‬ ‫ﻭﺗﺴﺪﻳﻳﺪ  ﺑﺈﺫﺫﻥ ﺍﺍﷲ.......إإﻟﺦ،‬ ‫   ‬ ‫ﺼ‬ ‫،‬ ‫ﻥ‬ ‫ﻭ‬ ‫ ‬ ‫ﻃ ﺒﻌﻳﻳﺎ أأ ﻲ ﻴﺍﺍ ﻟﺰﻳﻳﺰ  أأﺎ  ﻛﺘﺒ ُ  ﻹﺧﻮﺓﺓ ﺍﺍﻷ ﻧﺎﺭ،  ﻋﺪﺓﺓ ﺭﺳﺎﺋﻞ،،   آ ﺮﻫﺎ  ﻣﻦ ﻳﻳﻮﻣ ﻦ ﻭأأﻧﺎ  ﻋﻰ  ﺗﻮﺍﺍﺻﻞ  ﻌﻬﻬﻢ  ﻭﻧﺼﺢ ‬ ‫ﻭ‬ ‫، ،ﻫ ﻭ‬ ‫ﻠ‬ ‫ﺧﻫ‬ ‫ﻫ‬ ‫ﺍ‬ ‫ﺭ‬ ‫آ‬ ‫ﺭﻴ‬ ‫ﻌ‬ ‫ﻧ‬ ‫ﺖ‬ ‫ ‬ ‫ﺧ‬ ‫ﻣ‬ ‫ﺎ ‬ ‫ﻫﻫﻢ  ﻭﻣﺤﺎﻭﻟﺔ إإﺻ ﺡ  ﻭﺗﻘﺮﻳﻳﺐ  ﺑﻴﻴﻨﻬﻬﻢ  ﻭﺑﻴﻴﻦ ﺍﺍﻟﻜﺮﻭﻡﻡ   ﻭ ﻟﻦ  ﺩﺍﺍﺋﻤﺎ أ أﺷﻮ  إإﻰ ، ﷲ  ﻣﻭﺣﺪﺗﻲ ‬ ‫ﺍ ﻦ ﻭ‬ ‫ﻜ‬ ‫ﻟ‬ ‫ﺩ ،‬ ‫ﻜ‬ ‫ﻭ   ﻭ ﻭ‬ ‫ﻼﺡ‬ ‫ ﻭ‬ ‫ﻭﺗﻮﺟﻴﻪﻪ،  ﻭﺗﻄﻴﻴﻴﻴﺐ  ﻟﺨﻮﺍﺍﻃﺮﻫ ﻭ ﻭ‬ ‫ﻫ‬ ‫ﺬ‬ ‫ﻴ‬ ‫،‬ ‫ﻭ‬ ‫ﻭ‬ ‫ﻭﺍﺍﻧﻔﺮﺍﺍﺩﺩﻱ،  ﻭﻻ  ﺣﻮﻝ  ﻭﻻ  ﻗﻮﺓﺓﻙ ﻻ  ﺑﷲ  ،  ﺣﺘﻰ أ أﺎﻑ ﺍﺍﻟ ﺎﺱ ﻨﺗﻤ ّ  ﻣﻨﻲ  ﻭأﺻﻴﻴﺮ  ﻋﻨﺪﻫﻢ  ﻣ ﺒﺘﻻ  ..!! ‬ ‫أ‬ ‫ﻫ ﻫ‬ ‫ﻫ‬ ‫ﻭ‬ ‫،ﻑ ﺱ   ﻞ‬ ‫ﺧ‬ ‫،‬ ‫ﻝ‬ ‫ﻭ‬ ‫ ‬ ‫ﺎ‬ ‫إإ‬ ‫ﻱ‬ ‫أ أﺷﻮ  إإﻰ  ﺍﷲ ﺍﻭﺣﺪﻩﻩ.  ‬ ‫ﻭ‬ ‫ ‬ ‫ﻜ‬ ‫ﻟ‬ ‫ﻌ‬ ‫ﻙ‬ ‫ﻭﺣﺴﺒﻲ ﻌﷲ  ﻭﻧ ﻌﺍﺍﻟﻮﻛﻴﻴﻞ. ‬ ‫ﺍﺍ ﻭ ﻢ ‬ ‫ﻭ‬ ‫ ‬ ‫أ أﻲ  ﺍﺍ ﻟﺰﻳﻳﺰ /  ﻛﻴﻴﻒ أأﺣﻮﺍﻝ ﺍﺍﻟﺼﻮﻣﺎﻝ؟  ﻫﻫﻞ  ﻋﻨﺪﻛﻢ  ﺗﻮﺍﺍﺻ  ﻣﻬ،ﻫﻫﻞ ﻳﻳﻮﺳﻒ  ﺎﺯﺍﻝ  ﺣﻴﻴً  ﻭﻣﻮﺟﻮ ﺍﻫﻫﻨﺎﺩ ﺍ ؟أﻭ  ﺣ ﺪ ﻭﻦ ‬ ‫ﻢ ‬ ‫ﻫ ﻞ ﺩأ‬ ‫ﻫ‬ ‫  ﺯ ﻝﺍ ﻬ ﺎﻭ‬ ‫ﻣ‬ ‫ﻫ ﻭ‬ ‫ﻫ ،‬ ‫ﻝ   ﻝ ﻭﻫ‬ ‫  ﻫ‬ ‫ﺍ‬ ‫ﻣ‬ ‫/‬ ‫ﺧ‬ ‫ ‬ ‫إ إﻮﺍﺍﻧ ﺎ  ﻣﻮﺟﻮ  ﻫﻫﺎﻙ؟ أأ ﺎ  ﻛﻮﻧ ﻨ ُ  ﺑﻌﺾ ﺍﺍﻟﻌﻼﻗﺎﺕ ﺍﺍﻟﺒﺴﻴﻴﻄﺔ  ﻣﻊ إإﺧﻮﺓﺓ  ﻋﺒﺮ  ﻣﺎﺭ ﻑﺍﺍﻟﻨﺖ  ﻃ ﺒﺎ،،ﺩ ﻭﻫﻲ  ﺑﺴﻴﻴﻄﺔ  ﻭﺑﺼﺪﺩﺩ ‬ ‫ﻭﻫ ﻫ ﻭ‬ ‫ﻌ‬ ‫ﻫ‬ ‫ﺭ ﻌ‬ ‫ﻑ ‬ ‫   ‬ ‫ﺄ‬ ‫ﺩ‬ ‫ﻧ ﺕ‬ ‫ ‬ ‫ﺖ‬ ‫ﺟﻫ ﻙ‬ ‫ﻫ‬ ‫ﷲ‬ ‫ﺧ‬ ‫ﺭ‬ ‫ ‬ ‫ﻴﺍﺍﻟﺘﻮ ّﻖ  ﻭ ﻟﻦ  ﻟﻠﻬﻬﺎ  ﺗﺘﻄﻮﺭ،،   ﻓﺭﻮﻌﻥ  ﺗﺨﺒﺮﻭ ﻲ ﻧﺑﻤﺎ  ﻋﻨﺪﻛﻢ  ﻣﺎ  ﻳ ﻳﺴﺢ  ﺍﺍ ﻟﻝ ُ  ﺑ ﺬﺮﻩﻩ،  ﻭﻗﺪ ،ﺳﻤﻌ ُ  ﻛﻠﻤﺔ أأ ﻲ ﺑﻳﻳﺤﻴﻰ ‬ ‫ ‬ ‫ﻭ ﻛ ﺖ‬ ‫ﻤ‬ ‫ﺤ‬ ‫ﺎﻝ‬ ‫ﻓ‬ ‫ﺭ‬ ‫أﻭ ﻥ‬ ‫أ‬ ‫ﻜ‬ ‫ﺛ ﻭ‬ ‫ﻤ‬ ‫ﺍﺍﻟﺠﺪﻳﻳﺪﺓﺓ  ﺍ ﺍ ﻟ ﺟﻬ ﻬإإﻟﻴﻴﻬﻢ ﻭﻫﻫﻲ  ﻃﻴﻴﺒﺔ  ﺑﺎﺭﻙ ﺍﺍ  ﻴﻪ ﻴﻭﻓﻴﻴﻜﻢ.  ‬ ‫ﺭ‬ ‫ ‬ ‫ﻭﻪ‬ ‫ﻙ‬ ‫ﻭﻫ‬ ‫ﻬ‬ ‫ﻫ‬ ‫ﻮ ﺔ ‬ ‫‪Vocabulary‬‬ ‫‪Form‬‬ ‫‪Domain‬‬ ‫”‪“Grammar‬‬ ‫4‬
  • 5. Task: Triage  Triage: should we process further       5 and/or urgently? Too few trained, trusted linguists to review all the documents in time Enable non-linguist to do linguist’s job Gisting: MT All vs. MT Names alone Combine Entity Extraction with Specialized Machine Translation Integrate into Triage workflow Signal: Documents Selected (How are guidelines interpreted?)
  • 8. Technology: Entity Extraction (3) 2 5 Domai n Text Tagged Text 23 Unsupervised Model 4 Supervised Model Input Text Pattern Match (Regex) Exact Match (Gazetteer) Deterministic Extractor User Defined Lists 8 1 User Defined Patterns Entity Redactor Probabilistic Extractor Overlap Adjudication Entity Joining Filtering 3 Output Text
  • 9. Adaptation: Entity Extraction to Triage  Out of the box: – False +/- because contextual cues are fewer/different. – Weapon in this document missed, because not a default entity type.  Adaptation: – Add custom entity type(s) via deterministic extractor, e.g. weapons list  Benefit: – Highlights important documents that might otherwise be missed. – Fast and unlikely to affect performance of other components  Difficulties: – Requires forethought, maintenance of lists and patterns in many languages, but much less work than developing a new model 9
  • 10. Task: Translation  Produce standardized, “user      10 language” versions of the source document Too few translators; name standardization particularly labor intensive Speed up translation without compromising quality MT All reduces translation productivity NER, Coref and Name Translation/Standardization Signals: Resource Selections, Corrections, Resolutions
  • 11. Adaptation: Extraction to Translation (1)  Out of the box: – Same problems as in Gisting case, only now they matter more.  Adaptation: – Train unsupervised model to help with form and domain differences – Tune co-reference algorithm to most important entity types – Develop form/domain specific resource sets, and allow users to select them.  Benefit: – Fewer errors in highlighting should mean translation actually speeds up  Difficulties: – Often hard to amass a big enough corpus of like material for model building. – Form/Domain may be ephemeral 11
  • 12. Adaptation: Extraction to Translation (2) Thanks:~ Itai_Rolnick$ cat en_wc.txt | grep -i " aleppo " | tr ' ' 'n' | shuf | head  Unsupervised algorithm clusters words Loveland -- City in Colorado Svetogorsk -- Town in Russia MASSOUD -- ?Probably also of a village. Atiak -- Town in Uganda Waltha -- typo for Waltham? - town in Mass BASILICA -- type of Church? Sapukai -- Town in Paraguai Yeisk -- Town in Russia Descoberto -- Town in Brasil SINKHOLE -- ? A pub in Beligium ??  12      with distributional similarities together Word cluster ID is one feature used in learning the sequence model Based on Collins & Singer (1999) Part of REX Field Training Kit Shown: random sample of words clustered with “Aleppo” in a ~10GB English model Note they’re almost all LOCs Would an annotated training corpus ever cover so many remote entities?
  • 13. Task: Cataloging  Distill content into an index, to      13 facilitate search and further refinement at scale Impossible to annotate more than a tiny fraction of documents by hand High quality automated enrichment that makes efficient use of knowledge resources and structure in data Many approaches, e.g. LSI, topic modeling, document classification Entity resolution is robust extension of NER; data and knowledge driven. Signals: mentions/aliases, shallow relationships between entities
  • 14. Technology: Entity Resolution (1) Alberto Alberto Amos Alberto Fernandez… Fernandez… Alberto Fernandez… … born in Cuba … US Ambassador Sportsmen? YES Alberto Alberto Alberto Fernandiz… Alberto Fernandez de la Puebla… Albert Fernandez… Alberto Fernandez… Alberto Alberto … Chief of Cabinet … Argentina… …Prof of Criminal Law… Alberto Ratio of Politicians to Sportsmen? 2:1 Alberto Alberto M. Fernandez… Fernandez… Alburto Fernandez… Alberto Fernandez… Alberto Alberto Alberto Alberto … born Sept 7, 1984 … cycling … Madrid Nickname “El Galleta?” ?
  • 18. Technology: Entity Resolution (5) Resolution Engine Entity Mention Link or Ghost Candidate Selection Ranking 3 4 Learned Seeded 2 Entity Index Knowledge Base 1 18
  • 19. Adaptation: EntRes to Cataloging (1)  Out of the box: – Quality dependent on output of extraction and order of input – Lots of ghosts, poor links if Wikipedia-based KB doesn’t contain entities in document – Seeding context selection may not be suited to domain  Adaptations: – – – – Custom KB, sized and suited to the domain and languages Seeding using context most likely to match in your domain Choose Linking or Learning mode Choose evidence factoring scheme that meets your operational needs  Benefits: – Linking throughput is high, accuracy is high, ghosts are informative (because fewer confounders) – System can maintain low latency after ingestion of many documents – Linking accuracy can remain high after ingestion of many documents  Difficulties: – Each element requires experimentation and thought – Changes likely to cause discontinuities unless re-indexing 19
  • 20. Adaptation: Ent Res to Cataloging (2)  In Linking mode: – Link to existing KB or declare unknown, discarding context – State size is constant, latency stable  In Learning mode: – Link to existing KB or create New, storing context – State size increases, increasing latency – Semantic drift – Confidence measure gets complicated  Scaling with learning introduces the need to factor evidence.  Evidence factoring schemes need to be customized to use cases. 20
  • 21. Task: Retrieval  Find relevant information for further     21 analysis String-based retrieval methods are easy to understand, but require a lot of effort and distract from the task. Deliver search modalities that are more productive but still interpretable and correctable Search using entity-driven facets, as well as keywords Signals: query log, click through, curation, corrections
  • 22. Adaptation: EntRes to Retrieval  Out of the box: – Entity labels not in user’s language confusing – Returns results that can’t be easily summarized as a Boolean, cf aliases – Complex, potentially misleading measure of confidence  Adaptations: – Use name translation for non user-language labels, e.g. from KB – Present users with cues to expansion in string terms, e.g. mentions – Present confidence measure carefully  Benefits: – User spends less time confused, search is more productive  Difficulties: – Users still want to do things like exclude certain mentions. 22
  • 23. Summary  News-trained NER OK for Triage, but adding entity types via lists and patterns could improve results considerably.  Speeding up Translation requires a better fit: unsupervised adaptation and custom resource selection could make the difference between time saved or wasted.  Cataloguing by resolved entities enables powerful search, but relies on high quality extraction; Learning-mode requires evidence factoring at scale.  Entity-based search is incredibly productive compared to Boolean and keyword approaches, but users need cues that explain expansion and robust measures of confidence. 23
  • 24. Remaining Challenges  Current reality: even “simple” adaptation can be difficult: – – – – Too much knowledge, experience required Too much data required, e.g. 10GB for unsupervised Mostly “out of band” Usually Offline  Through the REX Field Training Kit and Entity Resolution API, Basis lowering the barriers to manual adaptation to sources, tasks and users today.  Integration of explicit signals, e.g. corrections and implicit signals, e.g. selections is ongoing. 24
  • 25. Q&A gregor@basistech.com Director of Product Management, Text Analytics Basis Technology Corporation

Notes de l'éditeur

  1. Getting “high quality” entities from textDoing it quickly and accuratelyGuiding people in their use
  2. Adapt to your needs. A system that adapts to data.Could use click through data to factor evidence
  3. Note Time and place is missing from diagram – affects both vocab and grammar
  4. Operational priorities change too quickly to merit the development of a model for interest, and the learned model would probably miss many things that we wanted to see. TaskProblem(s)Challenge(s)Approach(es)Solution
  5. Traditionally, finding (putative) entity mentions in text:Mark spans that we think refer to something “in the world”For each, make a guess about the kind of thing each refers to, e.g. PERSON, PLACE, ORGANISATIONOptionally, group the mentions that you think co-refer into chainsMost often called Named Entity Recognition (NER)Embarrassingly good method combines statistical B-I-O sequence model with lists and known patternsStatistical model typically trained using local features over annotated newswire text: abundance, quality
  6. Deterministic or Explicit Components:Gazeteers: Lists, e.g. company names, product namesRegular Expressions: Patterns, e.g. Probabilistic or Implicit Components:Training and testing data, e.g. annotated newswire, raw domain textFeatures, e.g. metadata_subject=markets, prior_word_class=543LearnersModel(s) – what the learner outputsCombiner/RedactorAdjudication between component outputsEntity Joining/In document Co-reference ResolutionModify joining rulesSet confidence thresholdsIdentify entity types consistentlySet weight or length preferencesEasier:Novel entities with small number of forms - Novel, highly productive but structured entities – regular expressionsForms we know aren’t entities - blacklistsBroad vocabulary and style shift – using unsupervised word class modelsHarder:New Entity Types – additional annotated data, and feature engineeringStructure change – additional annotated data if within bounds set by featuresFine Grained Entities (lots of data and annotation)
  7. Extraction and Co-Reference Performance varies greatly by entity type and languageBrittle to changes in domain and genre:Distribution of Entity TypesVocabulary Differences“Grammar” or Structure, inc. document length, abbreviationData sparsity means:Fine grained, rarer entities can be very difficult to extractPerformance on very short texts is typically very lowEntity types decided up front/embedded in models
  8. First step is to build a representation of the entity-base structured to make feature evaluation easy, so we can learn to link.Our system begins by building an index of the information in the knowledge base - the entity-base can be anything from a list, to a database, to a graph, to a rich, semi-structured text resource like Wikipedia. For each entity in the knowledge base, we create a entry in the index containing information that is known to be useful for efficiently differentiating it from other entities (called features), e.g. the non-stop words in a canonical mention sentence, like “president” and “USA” in the opening line of Barack Obama’s Wikipedia page.
  9. (AT LEFT) Let’s focus on four of these coreference chains: Hyon Song-wol, Ri Sol-ju, Wangjeasan Light Music Band and Chosun.  In a first pass, we compare the surface form of the mentions in each chain with the labels of the entities in the index, this generates a small number of candidates for closer consideration. In a second pass, we score the degree of similarity of these mentions and their surrounding context, like the non-stop words in the sentences they appear in, with the contents of the candidate entity entries. We can think of this as trying to find which entity or entities the mentions are closest to in some space, called the feature space. Here we can see that Hyon Song-wol, and Wangjesean groups are quite closely associated with the entities we would expect them to be; that Chosun is equally well associated with two entities; and that Ri Sol-ju is not particularly closely associated with any of the known entities.
  10. (AT LEFT) In this example, our scoring resolves Hyon-Song-wol and Wangjeasan correctly to the respective entries in wikipedia; correctly identifies that Ri Sol-ju is a genuinely new name or “ghost” entity which we may wish to create a knowledge base entry for; but incorrectly associates Chosun with the wikipedia page for Korea, rather than the news agency ChosunIlbo. Had Chosun been correctly tagged as an ORG at the NER stage, it would almost certainly have been resolved correctly. This example emphasizes how important high quality foundational linguistic components are for higher level tasks, and how flexibility must be built into downstream algorithms to prevent the errors that do occur from being unrecoverable.
  11. Deterministic or Explicit Components:Gazeteers: Lists, e.g. company names, product namesRegular Expressions: Patterns, e.g. Probabilistic or Implicit Components:Training and testing data, e.g. annotated newswire, raw domain textFeatures, e.g. metadata_subject=markets, prior_word_class=543LearnersModel(s) – what the learner outputsCombiner/RedactorAdjudication between component outputsEntity Joining/In document Co-reference ResolutionModify joining rulesSet confidence thresholdsIdentify entity types consistentlySet weight or length preferencesEasier:Novel entities with small number of forms - Novel, highly productive but structured entities – regular expressionsForms we know aren’t entities - blacklistsBroad vocabulary and style shift – using unsupervised word class modelsHarder:New Entity Types – additional annotated data, and feature engineeringStructure change – additional annotated data if within bounds set by featuresFine Grained Entities (lots of data and annotation)
  12. If recent TAC data is anything to go by, entity linking is expected to associate very different strings.
  13. REX Field Training KitA Package of Tools and Processes for English, PashtoProvides guidelines for:Effective use of gazetteer and regular expression componentsAnnotation of data and training of supervised modelsClustering tool allows adaptation to domain vocabulary for languages that have word class dataSlated for 2.0:Coverage for all languages REX supports, inc. Korean, ArabicSeed resources for specific domains
  14. Less data: better balance between stability of models and volume of data available for adaptationLess effort: automated adaptation, tools and UIs for annotation projectsInline, e.g. task performance as feedback, in addition to correctionOnline, e.g. dynamic knowledge sources without discontinuities