HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

Getting Things
Gregor William Stewart

Director of Product Management, Text Analytics

Basis Technology Corporation

Introduction
 Product Manager for Text

Analytics, including:
–
–
–
–
–

Rosette Linguistics Platform
Entity Analytics
Name Indexing and Translation
Chat Translator
Highlight

 Questing for:
–
–
–
–

2

Quality: accuracy, performance
Coverage: languages, domains, genres
Integration: tasks, workflows, UX
Innovation: new aggregates, functions

Overview
1

4

2

+

Source

Tasks

Technologies

Adaptations

Properties

Description

Action (Input/Output)

“Out of Box”

Comparison

Problem(s)

Components

Suggested Adaptations

Challenge(s)

Process

Potential Benefits

Approach(es)

Adaptation Opportunities

Costs

Solution
Signal(s)

 Focus on entity analytics in four stages of the processing

and exploitation of SOCOM-2012-0000011-HT
 Reaching “state of the art” in practice means adapting to
source, task and user.
3

‫1100000-2102-‪Source: SOCOM‬‬‫‪HT‬‬
‫‪ An Arabic language source‬‬

‫‪document‬‬
‫‪Letters/emails from one colleague‬‬
‫‪to others, regarding policy‬‬
‫‪Written years before it was‬‬
‫‪acquired, processed‬‬
‫‪Perhaps imperfectly transcribed, or‬‬
‫‪OCRed into our forensics platform‬‬
‫,‪Of uncertain provenance, content‬‬
‫‪value‬‬
‫‪Not a current web news article for‬‬
‫‪wide consumption, with metadata‬‬

‫‪‬‬
‫‪‬‬
‫‪‬‬
‫‪‬‬
‫‪‬‬

‫1100000-2102-‪SOCOM‬‬

‫ ‬
‫ ‬
‫ـــــ / ﺍﺍﻟﻌﺰﻳﻳﺰ ﻋﺪﻧ ﻥ / ﺍﺍ) ﺣﺎﻓﻆ ﺳﻠﻄﺎﻥ ( ‬
‫ﻥ‬
‫أأﺧﻲ ‬
‫ﻋﺭﺳﺎﺋﻞ إإﻟﻰ ﻛﺮ ﺍ ﻭ ﻭأأ ﻣﻋﻤﺮ ﻭ ﻧﺳﻬﻬﻢ ؛ ﺭﺳﺎﺋﻞ ﺗﻮﺟﻴﻬﻬﻴﻴﺔ ‬
‫ ﻴ‬
‫ﺎ ﺭ‬
‫ﻲ‬
‫ﻭ ﻭﺑ‬
‫ﻲ ‬
‫ﻃﻠﺒ ﻢ ﻣﻨﻜﻢ ﺘأأﻳﻳﻀﺎ ﻓﻲ ﺭﺳﺎﺋﻞ ﺳﺎ ﺑﺔ ﺍﺍﻟﻤﺴﺎﺭﺭﺔ ﺑﻜﺘﺎﺑﺔﺭ‬
‫ﻘ‬
‫ﺭ ‬
‫ ‬
‫ ‬
‫ﻴ‬
‫،‬
‫ﻓ‬
‫ﻱ‬
‫ﻥ‬
‫ﺮأ‬
‫ﺣﺎﺯﺯﺔ،، ﻓﺈﻧ ﻨ ﻲﻑ ﻋﻰ ﺍﻹﺓﺓ ﻦ ﺍﺍﻷﺧﻄﺎء ﺍﺍﻟﺴﻴﻴﺎ ﺳﻴﺔ،، ﻓﻘﺪ ﺳﻤﻌ ﻢ ﻭﻻﺑﺪ ﺧﻄﺒﺔ أأﺑﻲ  ﻋﻤﺮ ﺍﺍﻷ ﺧﻴﺮﺓﺓ، ﻭ ﻲ ﻧﻈ ﻱ أﻥ ‬
‫ﻭ‬
‫ﻭ ﺘ‬
‫ﻮ ﻴ‬
‫ﻣﺧ‬
‫ﻠ‬
‫أأﺧﺎﻑ ‬
‫ ﻣ‬
‫ﺤﻬﺎ أأﺧﻄﺎء ﻭﺍﺍﺿ ﻭﺔ ﻓ: ﻴﻴﻬﻬأأﺷﻴﻴﺎ ء ﺎ ﻛﺎﻥ ﻳﻳﻨ ﺒﻲ أﻥ ﺗﺬﻛﺮ ﻓﻲ ﺧﻄﺒﺔ ﻗﺎﺋﺪ ﻛﻬﻬﺬﺍﺍ، ﻭﻳﻳﺪ ﻝ ﻫﻫﺎ ﻓﻲ ﺧﻄﺎﺑﻪﻪ    ﻻ ﺳﻴ ﻴﺎ ﻲ ‬
‫ﻭ‬
‫ﻭ‬
‫ﻭ ﻝ ﺫﺫﻛﺮﻫ‬
‫، ﻫ‬
‫ﻐ‬
‫أﻥ‬
‫ﻤ ﺎ ﻥ ﻓ ﻥ ﻣ‬
‫ﻓﻴﻬ ﻴ‬
‫ﻮ‬
‫ﺳﻴﺎﻕ ﺍﺍ ﻟﻮﺍﺍﺏ ﻭﺍﺍﻟﻤﺒﺎﺩﺩئ ﻋﻠﻰ أأﻧﻬﻬﻢ  ﻣﺘﺸﺪﺩﺩﻭﻥ، ﻭﺗﻌﻄﻲ إإﻳﻳﺤﺎء ﻬﻬﻢ ﻣﺘﻌ ّ ﻘﻥ ﻣﺴﺘﻌﺠﻠﻮﻥ. .! ﻭﻴﻴﻬﺎ ﺗﻨﻔﻴﻴﺮ ﻭﻗ ﺔ ﺣﻜﻤﺔ. ‬
‫ﻭ ﻬ‬
‫ ‬
‫ﻥ‬
‫ﻤ‬
‫ﻭ‬
‫ﺑﺄﻧ‬
‫ ‬
‫ﻭﻥ ﻭ ،‬
‫ﻓ‬
‫ﻠ‬
‫ﺏ ئ‬
‫!‬
‫ﻴ‬
‫ ‬
‫ﻭﺜ‬
‫ﻕ‬
‫ﻭأأﻧﺎ ﻋﻦ ﻧ ﻔﻲ ﻛﺘﺒﺖ ﻟﻬﻬﻢ ﻭﻋﺎﺗﺒﺘﻬﻬﻢ ﺴﻭﺷﺪﺩﺩﺕ ﻋﻠﻴﻴﻬﻢ ﺑﻌﺾ ﺍﺍﻟﺘﺸﺪﻳﻳﺪ. ‬
‫ﻬ‬
‫،ﻭ ، ﺕ‬
‫ﺧ‬
‫ﻭ‬
‫ﻭ‬
‫ﺮء ﺗﻠﻮ ‬
‫ﻭأ أﺎﻑ ﻬﻬﻢ إإﻥ ﺍﺍﺳﺘﻤ ﻭﺍﺍ ﻲ ﺜﻫﻫﺬﺍﺍ ﺍﺍﻷﺳﻠﻮ ﻭﺍﺍﻟﻄﺮﻳﻳﻘﺔ ﻳﻳﻔﺴ ﻭ ﻭ ﻭﻳﻳﻨﻔ ﻭﻭﻥﺍﺍﻟ ﺎﺱ ﻭﻳﻳﻔﻘﺪﻭﻧﻬﻬﻢ ﻭﻳﻳﻜﺴﺒﻮﻥ ﺍﺍﻷﻋﺪﺍ ‬
‫ﻥ‬
‫ﻥ ﻥ ﺪ ﻭﻥ ﺱ ﻭ ﻭ ﻭﻨ‬
‫ ‬
‫ﺏ‬
‫ﺏ‬
‫ﻭ‬
‫ﺮﻓ ﻭﻞ ﻣ‬
‫ﻫ‬
‫ ﻫ‬
‫ﻭ‬
‫ﻭﺍ ﻑ أأﻧ ﻥ ،‬
‫ﺍﺍﻷﻋﺪﺍﺍء ﻭﻳﻳﻌ ﻄﻥ ﻟﻸﻋﺪ ﺍ ﻭﺍﺍ، ﻮﺼﻡ ﺍﺍﻟ ﻔ ﻮﺮﺔ ﻟﻠﻨﻴﻴﻞ ﻣ ﻨﻬﻢ،، ﻭﺍﺍ ﻟﺤﻤﺔ ﻬﻬﻢ ﺷ ﺮﺔ ﺟﺪﺍ ﺗﺸﻮﻳﻬﻬﺎ ﻭﺗﻨﻔﻴﻴﺮﺍً ﻭﻛﺬﺑﺎ ﻭﺍﺍﻓﺘﺮﺍﺍء، ‬
‫ﺳ ﺍ ﻭ ﻳ ﺍﻭ‬
‫ﻠ‬
‫ﻭ ﻋﻠﻴﻴ‬
‫ﻬ‬
‫ﻥ‬
‫ﻭ‬
‫ﻡ‬
‫ﺻ‬
‫ء‬
‫ﺍﻟ ﺨ‬
‫ﻭ ‬
‫،‬
‫ﻫﻫ ﺍ ﻳﻳﺴﺘﺪﻋﻲ ﻏﻠﻖ ﺎ أأﻣﻜﻨﻨﺎ ﻣﻦ أأ ﻮﺍﺍﺏ ﻭﻭ ﻗﻊ ﺍﺍﻟﻄﺮﻳﻳﻖ ﻋﻠﻰ ﺍﺍﻷﻋﺪﺍﺍء،، ﻓﻜﻴﻴﻒ ﺑ ﺈﻮﺍﺍﻧ ﺧﻳﻳﺰﻳﻳ ﻭﻭﻥﺍﺍﻟ ﻄﻴﻦ ﻴﺑﻠﺔ ﻭﻳﻳﻔﺘﺤﻮﻥ ‬
‫ﻭ‬
‫ﻨ ﺎ ﻥ ﺪ‬
‫ ‬
‫ ‬
‫ﺑ‬
‫ﻄ‬
‫ﺏ‬
‫ﺍ ﺬ ﻣ‬
‫ﻭ‬
‫ﻭﻫ‬
‫ﻫ‬
‫ﻥ‬
‫ﻋﻠﻰ أأﻧﻔﺴﻬ ﻭأأﻮﺍﺍﺑً ﻣﻦ ﺍﺍﻟﺸﺮ..! ‬
‫ﻬ‬
‫ﺑ‬
‫ ﻢ ﺎ ‬
‫ﻭﺍﺍﻟﻤﻘﺼﻮﺩﺩ : ﻻ ﺗ ﺘﺮﻮﺍﺍ ﻣﺤﻤﻮﺩﺩ ﻴ) ﻄﻴﺔ ( ﻭﺣﺪﻩﻩ، أﺭﻳﻳﺪ ﻣﻨﻜﻢ ﺍﺍﺳﺘﺼﺪﺍﺭ ﺭﺳﺎﺋﻞ ﺧ ﺎﺔ ﻭﻋﺎﻣﺔ أ ﻋﻠﻨ ﻴ ﻭﺳ ﺮﻳﺔ ، ﻣﻦ ﻋﺒﺪ ‬
‫ ﻭﺔ ﻳ‬
‫ﻭﺻ‬
‫ﺭ ﺭ ‬
‫ﺍ‬
‫،‬
‫ﺭ أ‬
‫ﻋ، ﻭ‬
‫ﻴ ﻛ‬
‫ﺍﺍﻟﺸﺎﻓﻲﺩ) ﻛﻠﻴﻴﻢ ( ﻭﺣ ﻰ ﻣﻦ ﺍﺍﻟﺼﺎﺩﻕ ﻪﺯﻣﺮﺍﺍﻱ ( إإﺫﺍﺍ ﻟ ﺱ أﻜ ﻴﻴﻬﺎ ﺮ ﺒ ﺎﺢ ﻭﺗﻮﺟﻴﻬﻬﺎﺕ ﻣﺒﺎﺷ ﻦ ﺷﻪ ﻣﺒﺎﺷ ﺓ ﻭﻣﺤﺪﺩﺓ ‬
‫ﻭ‬
‫ﻓ ﺓ ﻭﺩ ،‬
‫ﺮ‬
‫ﺓ‬
‫ﺫ‬
‫ﻬ‬
‫ﺘ ﻕ ) ﺯ ﻱ ﻣﻭ ﻧﺼ ﻭ ﺋ‬
‫ﻭ ﻴ‬
‫، ﺕ‬
‫ﺓ‬
‫ﻭ‬
‫ ‬
‫ﻭ‬
‫ ‬
‫ﺍ ﺎ،‬
‫ﻣ‬
‫ﻭﺍﺍﺿﺤﺔ ﻟﻠﻜﺮﻭ ﻭﺍﺍ ﻲ ﻣﻋﻤﺮ ﻭإ إ ﻬﻬﻢ، ﻲ ،ﻣﺴﺎﺋﻞ ﺳﻴﻴﺎﺳﺔ ﺍ ﺱ ﻭﺍﺍﻟﺘﻌﺎﻣﻞ ﻣﻌﻬﻢ ﻭﻣﻊ ﺍﺍﻟﻔﺼﺎﺋﻞ ﺍﺍﻷﺧﺮﻯ، ﻭﻋﺪﻡﻡ ‬
‫ ﻯ ﻭ‬
‫ﻭ ﻬ‬
‫ﻨ ﻓ‬
‫ﻭ‬
‫ﺑ‬
‫ﺧ‬
‫ﻲ ﻮﺍﺍﻧ‬
‫ﻭ‬
‫ﻭ‬
‫ﻭ‬
‫ﺍ‬
‫ﻭ‬
‫ﺍﺍﻻﺳﺘﻌﺠﺎﻝ،، أﻥ أ ﻥﻳﻳﺤﺪﺛﻮﺍ أأﻣﺮﺍ )ﻛﺒﻴﻴﺮﺍﺍ ﻣﻬﻬﻤﺍ( إإﻻ ﺑﻤﺸﻮ ﺭ ﻭأأﻥﻭﻳﻳﺴﻌﻮﺍ ﻫﻫﺪﻳﻳﻦ ﻻﺳﺘﻴﻴﻌﺎ ﺏﺍﺍﻟ ﺎﺱ، أﻥ أ ﻥﻳﻳﺼ ﻮﺍﺍﻔأأﺣﺪﺍ ‬
‫ﻻ ﻭ ﻭ‬
‫ﻨ‬
‫،‬
‫ﺏ ﺱ‬
‫ ‬
‫ﺭ ﺓ ﻥ ﺟﺎﻫ‬
‫ﻫ‬
‫ ‬
‫ﺍ‬
‫ﺓﻴ‬
‫ﺎ‬
‫ً‬
‫ﺍ‬
‫ﻝ‬
‫ﻻ ‬
‫ﻴ ‬
‫ ﺸﻭﻋﺔأﻭ أ ﻫﻫﺎ،، ﻥ ﻫﻫ ﺍﺍ ﺳﺎﻖ ﻷﻭ ﺍﻪ ﻭﻟﻠﻧﻨﺱ ﺍأﻓﻬﻬﺎﻡﻡ أﻣﺨﺘﻠﻔﺔ ﻭﺗﺄﻭﻳﻳﻼﺕ ﻭﻧﻈﺮﺍﺍﺕ .. ‬
‫ﺕﻭ ﻭﺕ‬
‫ﻭ‬
‫ﺑ‬
‫ﻪ‬
‫ﺎﺱ‬
‫ﻭ ﻭ‬
‫ﺮ ﻳ ﻧﺤﻮﻫ . ﻓﺈﻥ ﻫ‬
‫ﻫ‬
‫ﻭ‬
‫ﻫ‬
‫ﻦ ﻣ ﻫﻫ ﺪ ﻳ ﺍﺍﻵﺧﺮﻳﻳﻦ ﺑﻌﺪﻡﻡ‬
‫ﻦ‬
‫ﺍﺍﻟﻤﺠﺎﻫ ﺬ‬
‫ﻫ‬
‫ﻭﻧ ﻮ ﺫﺫﻟﻚ ﻣﺎ ﻳﻳﻨﺎﺳﺐ. ‬
‫ ‬
‫ع‬
‫ﻓ‬
‫ﺤ‬
‫ﺮ ﻤ‬
‫ﻭ‬
‫ﺳ ﺍ‬
‫ﻭﺍﺍﻟﺮﺟﺎء ﺍﺍﻹ ﺍع ﻲ ﺫﺫﻟﻚ.. ‬
‫،‬
‫ﻟ‬
‫ﻭ‬
‫ﺔﻭﺍﺍﻛﺘﺐ أأﻧﺖ ﻧﻔﺴﻚ أأﺧﻲ ﺍﺍﻟﻜﺮﻳﻳﻢ ،، ﻓﺎﻟﻜﺮ ﻭ ﻣ ﻳﻳﻌﺮﻓﻚ ﻭﺩﺍﺍﺋﻤﺎ ﻳﻳﺴﺄﻟ ﻲ ﻨﻋﻨﻚ، ﻭ ﻝ ﺧﺎﻲ ﻓﻼﻥ. ‬
‫ﻥ‬
‫ﻭﻳﻳﻘﻮﻝ ‬
‫ﺍ ﻲ ﻭ ﺩ ﻭ‬
‫ ‬
‫ ‬
‫ﻭ‬
‫  ‬
‫ﻠ‬
‫ﺧﻲ ﻋﺒﺪ ﺍﺍﻟﺤﻔﻴﻴﻆ ﻳﻳﻜﺘﺐ ﻭﻳﻳﻜﺘﺐ ﻭﻻ ﻳﻳﻤﻞ ﻣﻦ ﺍﺍﻟﻤﺮ ﺍﺳ ﻭﺍﺍﻟﻀ ﻐﻋﻠﻰ ﺍﺍﻹﺧﻮﺓﺓ.. ‬
‫ﻭ‬
‫ﻂ ‬
‫ ‬
‫ﻭ‬
‫ﻭ‬
‫ﻠ‬
‫ﻭﻭ‬
‫ﻭأأﻳﻳﻀﺎﺄأأﺣﻤﺪ ﻋﺒﺪ ﺍﺍﻟﻌﻈﻴﻢ ﻓﻬﻮ ﻣﺆﺛﺮ ﻓﻴﻴﻬﻢ ﺟﺪﺍﺍ، ﻭﻳﻳﺤﺘﺮﻣﻮﻧﻪ ﻛﺜﻴﻴﺮﺍﺍ.. ‬
‫ﻪ‬
‫ﻬ ﻭ ،‬
‫ﻬﻴ‬
‫  ‬
‫ﻭ‬
‫ﻠﻭﻛﻞ ﻣ ﺕ ﻪ ﻟ ﺗﺛﻴﻴﺮ. ‬
‫َﻦ ﻪ‬
‫ﻭ‬
‫ ‬
‫ﺕ‬
‫ ‬
‫أأﻳ ﺎ ﻭﻣﺴﺄﻟﺔ أﺧﻯ ﻣﻬﻬﺔ ﺟﺪﺍ،، ﻻﺑ ﺪأﻥ ﺗﻜﺘﺒﻮﺍ   ﻹ ﻮﺍﺍﻧأ ﺎ أأ ﻧﺎﺭ ﺍﺍ ﺴﺔ،، ﻓﺈ ﻬﻢ ﻳﻳﻨﺘﻈﺮﻭﻥ ﻣ ﻨﻢ ﻣﺮﺍﺍﺳﻼ ﻭأأﺟﻮﺑﺔ ﻭﻰ ‬
‫ﻋ‬
‫ﻥ‬
‫ﻜ‬
‫ﺭ ﻟ ﻨ ﻧ ﻬ ﻭ‬
‫ﻯ‬
‫ﻤ‬
‫ ‬
‫ﻥ‬
‫ﺍ‬
‫ﺧ‬
‫ﺼ‬
‫ﺮ‬
‫أ‬
‫ﻳﻀ ﻭ‬
‫،ﺷﻜﺎﻭ ﻫﻫ ﻭﺭﺳ ﺋﻠﻬﻢ،،ﻬﺍﺍﻛ ﺘﻮﺍﺍ ﻟﻬﻢ ﻭﺍﺍﺳﺘﻭﻦ ﺑﻌﺒﺪ ﺍ ﺍ ،ﻆ ﻭﺑﺄﺣﻤﺪ، ﻭﺣﺎﻭﻝ ﻟﺍﻳﻳﻀﺎ أأﻥ ﺗﺴﺘﺼﺪﺭ ﻟﻬﻢ ﺭﺳ ﺔ ﻦ ﻋﺒﺪ ﺍﺍﻟﺸﺎﻓﻲ، ‬
‫ﻬ‬
‫ﻣ‬
‫ﺭﻟ‬
‫ﺭ‬
‫ﻭ ﻭ ،ﻝ ﺍ ﻥ ‬
‫ﻢ‬
‫ﺎ‬
‫ ‬
‫،‬
‫ﻌ‬
‫ﺤﻔﻴ ﻭ ﻴ‬
‫ﺒ ﺎ‬
‫ﻬ ﻫ‬
‫ ‬
‫ﻭ ﺍﺍ ﺭ ﻫ ﻭ‬
‫ﻜﺍﺍﻛﺘ ﻮﺍﺍ ﻬﻬﻢ ﻛﻼﻣﺎ ﻟﻄﻴﻴ ﺎ ﻋﺎﺩﺩﻳﺎ ﻳﻳﺨﺴﺮﻭ ﻭ ﻣﻨﻬﻬﺎ ﺷﻴﻴ ،ﻫﻫﺬﺍ ﻲ ﺍﺍﻟﺤﺪ ﺍﺍﻷﺩﺩﻧﻰ   ﻭﻟﻮ  ﻛﻠﻤﺎﺕ ﺴﻴ ﻴﺔ ﻃﻴﻴﺔ ﺗﻌ ﻭﻥ ﺪﻴﻬﻬﺎ ﺑﺎﻟﺨﻴﻴﺮ ‬
‫ﻴ‬
‫ﺕ‬
‫ﻄ‬
‫ﻓﺒ‬
‫ﻭﻥ‬
‫،‬
‫ﺑ‬
‫ﺎ‬
‫ﻭ ،‬
‫ﻫ ﻓ ﺌ‬
‫ﺍ‬
‫،‬
‫ﻫ‬
‫ﻳ ﻻ ﻥ ﻥ‬
‫ﻔ‬
‫، ﺒﻟ‬
‫ﻥ‬
‫ﻮ ﻮ‬
‫ﻬﻬﻢﻫﻫﻢ ‬
‫ﻭﺑﺎ ﻟ ﺤﻖ ﻣﻦ ﺍﺍﻷ ﻮﺭ، ﻭﺑﺄﻧﻜﻢ ﺗﺘﺎﺑ ﻌﻥ ﻭﺗ ﻨﺤﻥ ﻭ ﻬﻬﻮﻥ، ﺼ ﻧﻢ ﺭﺍﺍﺳ ﻠﻢ ﻭﺳﺘﺮﺍﺍﺳﻠﻮﻥ ﺍﺍﻹ ﺧﺓ ﻭأأﻳﻳ ﺎ ﺗﺪﻋﻮﻧ ﻫ‬
‫ﻫ‬
‫ﺼ‬
‫ﻭ‬
‫ﻀ‬
‫،،‬
‫ﻭﺗﻮﺟ ﻥ ﻭأأ ﺓ ﺭ ﻭ ﻮ ﻥ ‬
‫ ‬
‫ﻭ‬
‫ ﻭ‬
‫ﺭ ﻭ ،‬
‫ﺘ‬
‫ﻘ‬
‫ ‬
‫ﻣ‬
‫ﻭ ﺘ‬
‫)ﺍﺍﻷﻧ ﺎﺭ( إإﻰ ﺍﺍﻟﻜﻮﻥ إ إﻮﺍﺍﻧﻬﻢ ﻛﺎ ﻓﻌﻞ ﻋﺒﺪ ﺍﺍﻟﺤﻔﻴﻴﻆ ﻓﻲ ﺍﺍ ﺮﻳﻂ ، ﻭﺗﺮﻭ ﻭأ أﻥﺍﺍﻟﻮﺍﺍﺟﺐ ﻳﻳﻘﺘ ﻲ ﺫﺫﻟﻚ ،، ﺭ ﻢ ﻣﺎ ‬
‫ﻀ‬
‫ﺭ ‬
‫ﻟ‬
‫ ‬
‫ﻳ‬
‫ﻥ ،‬
‫ﺸ ﻭ ﻥ ﻥ‬
‫ ‬
‫ﻜ‬
‫ﺧ‬
‫ﻤ‬
‫ﻣ ﻬﻊ‬
‫،‬
‫ﺭ ﻏ ﻟ ﻥ‬
‫،‬
‫ﻫﻫﻨﺎﻟﻚ ﻣﻦ ﻧﻘﺺ ﻭﺧﻠ ﻞ ﻭ ﻟﻦ ﺍﺍﻟﻔﺮ ﻫﻫﻲ أأﺷ ﺮ ﻦ ﻛﻞ ﺫﺫﻟ ﻚ ، ، ﻧﻢ ﺑﺎﻟﻌ ﺡ: ﺳﺘﻜﻮﻧﻮﻥ ﻣﻊﻥﻼﻮﺍﺍﻧﻜﻢ ﻋﺎ ِـ َ إإﺻ ﺡ ‬
‫ﻣﻞ‬
‫ إإﺧ ‬
‫ﻭأأ ﻜ ﺲ ‬
‫ﻭ‬
‫ ‬
‫ﻣ‬
‫ﻭ ﻜ ﺔ ﻫ‬
‫ﻗ‬
‫ﻫ‬
‫ﻭ‬
‫ﻫ‬
‫ﻫ‬
‫ﻭﺗﺴﺪﻳﻳﺪ ﺑﺈﺫﺫﻥ ﺍﺍﷲ.......إإﻟﺦ،‬
‫ ‬
‫ﺼ‬
‫،‬
‫ﻥ‬
‫ﻭ‬
‫ ‬
‫ﻃ ﺒﻌﻳﻳﺎ أأ ﻲ ﻴﺍﺍ ﻟﺰﻳﻳﺰ أأﺎ ﻛﺘﺒ ُ ﻹﺧﻮﺓﺓ ﺍﺍﻷ ﻧﺎﺭ، ﻋﺪﺓﺓ ﺭﺳﺎﺋﻞ،،   آ ﺮﻫﺎ ﻣﻦ ﻳﻳﻮﻣ ﻦ ﻭأأﻧﺎ ﻋﻰ ﺗﻮﺍﺍﺻﻞ ﻌﻬﻬﻢ ﻭﻧﺼﺢ ‬
‫ﻭ‬
‫، ،ﻫ ﻭ‬
‫ﻠ‬
‫ﺧﻫ‬
‫ﻫ‬
‫ﺍ‬
‫ﺭ‬
‫آ‬
‫ﺭﻴ‬
‫ﻌ‬
‫ﻧ‬
‫ﺖ‬
‫ ‬
‫ﺧ‬
‫ﻣ‬
‫ﺎ ‬
‫ﻫﻫﻢ ﻭﻣﺤﺎﻭﻟﺔ إإﺻ ﺡ ﻭﺗﻘﺮﻳﻳﺐ ﺑﻴﻴﻨﻬﻬﻢ ﻭﺑﻴﻴﻦ ﺍﺍﻟﻜﺮﻭﻡﻡ ﻭ ﻟﻦ ﺩﺍﺍﺋﻤﺎ أ أﺷﻮ إإﻰ ، ﷲ ﻣﻭﺣﺪﺗﻲ ‬
‫ﺍ ﻦ ﻭ‬
‫ﻜ‬
‫ﻟ‬
‫ﺩ ،‬
‫ﻜ‬
‫ﻭ ﻭ ﻭ‬
‫ﻼﺡ‬
‫ ﻭ‬
‫ﻭﺗﻮﺟﻴﻪﻪ، ﻭﺗﻄﻴﻴﻴﻴﺐ ﻟﺨﻮﺍﺍﻃﺮﻫ ﻭ ﻭ‬
‫ﻫ‬
‫ﺬ‬
‫ﻴ‬
‫،‬
‫ﻭ‬
‫ﻭ‬
‫ﻭﺍﺍﻧﻔﺮﺍﺍﺩﺩﻱ، ﻭﻻ ﺣﻮﻝ ﻭﻻ ﻗﻮﺓﺓﻙ ﻻ ﺑﷲ ، ﺣﺘﻰ أ أﺎﻑ ﺍﺍﻟ ﺎﺱ ﻨﺗﻤ ّ ﻣﻨﻲ ﻭأﺻﻴﻴﺮ ﻋﻨﺪﻫﻢ ﻣ ﺒﺘﻻ ..!! ‬
‫أ‬
‫ﻫ ﻫ‬
‫ﻫ‬
‫ﻭ‬
‫،ﻑ ﺱ ﻞ‬
‫ﺧ‬
‫،‬
‫ﻝ‬
‫ﻭ‬
‫ ‬
‫ﺎ‬
‫إإ‬
‫ﻱ‬
‫أ أﺷﻮ إإﻰ ﺍﷲ ﺍﻭﺣﺪﻩﻩ. ‬
‫ﻭ‬
‫ ‬
‫ﻜ‬
‫ﻟ‬
‫ﻌ‬
‫ﻙ‬
‫ﻭﺣﺴﺒﻲ ﻌﷲ ﻭﻧ ﻌﺍﺍﻟﻮﻛﻴﻴﻞ. ‬
‫ﺍﺍ ﻭ ﻢ ‬
‫ﻭ‬
‫ ‬
‫أ أﻲ ﺍﺍ ﻟﺰﻳﻳﺰ / ﻛﻴﻴﻒ أأﺣﻮﺍﻝ ﺍﺍﻟﺼﻮﻣﺎﻝ؟ ﻫﻫﻞ ﻋﻨﺪﻛﻢ ﺗﻮﺍﺍﺻ ﻣﻬ،ﻫﻫﻞ ﻳﻳﻮﺳﻒ ﺎﺯﺍﻝ ﺣﻴﻴً ﻭﻣﻮﺟﻮ ﺍﻫﻫﻨﺎﺩ ﺍ ؟أﻭ  ﺣ ﺪ ﻭﻦ ‬
‫ﻢ ‬
‫ﻫ ﻞ ﺩأ‬
‫ﻫ‬
‫ ﺯ ﻝﺍ ﻬ ﺎﻭ‬
‫ﻣ‬
‫ﻫ ﻭ‬
‫ﻫ ،‬
‫ﻝ ﻝ ﻭﻫ‬
‫ ﻫ‬
‫ﺍ‬
‫ﻣ‬
‫/‬
‫ﺧ‬
‫ ‬
‫إ إﻮﺍﺍﻧ ﺎ ﻣﻮﺟﻮ ﻫﻫﺎﻙ؟ أأ ﺎ ﻛﻮﻧ ﻨ ُ ﺑﻌﺾ ﺍﺍﻟﻌﻼﻗﺎﺕ ﺍﺍﻟﺒﺴﻴﻴﻄﺔ ﻣﻊ إإﺧﻮﺓﺓ ﻋﺒﺮ ﻣﺎﺭ ﻑﺍﺍﻟﻨﺖ ﻃ ﺒﺎ،،ﺩ ﻭﻫﻲ ﺑﺴﻴﻴﻄﺔ ﻭﺑﺼﺪﺩﺩ ‬
‫ﻭﻫ ﻫ ﻭ‬
‫ﻌ‬
‫ﻫ‬
‫ﺭ ﻌ‬
‫ﻑ ‬
‫ ‬
‫ﺄ‬
‫ﺩ‬
‫ﻧ ﺕ‬
‫ ‬
‫ﺖ‬
‫ﺟﻫ ﻙ‬
‫ﻫ‬
‫ﷲ‬
‫ﺧ‬
‫ﺭ‬
‫ ‬
‫ﻴﺍﺍﻟﺘﻮ ّﻖ ﻭ ﻟﻦ ﻟﻠﻬﻬﺎ ﺗﺘﻄﻮﺭ،، ﻓﺭﻮﻌﻥ ﺗﺨﺒﺮﻭ ﻲ ﻧﺑﻤﺎ ﻋﻨﺪﻛﻢ ﻣﺎ ﻳ ﻳﺴﺢ ﺍﺍ ﻟﻝ ُ ﺑ ﺬﺮﻩﻩ، ﻭﻗﺪ ،ﺳﻤﻌ ُ ﻛﻠﻤﺔ أأ ﻲ ﺑﻳﻳﺤﻴﻰ ‬
‫ ‬
‫ﻭ ﻛ ﺖ‬
‫ﻤ‬
‫ﺤ‬
‫ﺎﻝ‬
‫ﻓ‬
‫ﺭ‬
‫أﻭ ﻥ‬
‫أ‬
‫ﻜ‬
‫ﺛ ﻭ‬
‫ﻤ‬
‫ﺍﺍﻟﺠﺪﻳﻳﺪﺓﺓ ﺍ ﺍ ﻟ ﺟﻬ ﻬإإﻟﻴﻴﻬﻢ ﻭﻫﻫﻲ ﻃﻴﻴﺒﺔ ﺑﺎﺭﻙ ﺍﺍ ﻴﻪ ﻴﻭﻓﻴﻴﻜﻢ. ‬
‫ﺭ‬
‫ ‬
‫ﻭﻪ‬
‫ﻙ‬
‫ﻭﻫ‬
‫ﻬ‬
‫ﻫ‬
‫ﻮ ﺔ ‬

‫‪Vocabulary‬‬

‫‪Form‬‬

‫‪Domain‬‬

‫”‪“Grammar‬‬

‫4‬

Task: Triage
 Triage: should we process further








5

and/or urgently?
Too few trained, trusted linguists to
review all the documents in time
Enable non-linguist to do linguist’s
job
Gisting: MT All vs. MT Names alone
Combine Entity Extraction with
Specialized Machine Translation
Integrate into Triage workflow
Signal: Documents Selected (How
are guidelines interpreted?)

Technology Entity Extraction (1)

6

Technology Entity Extraction (2)

7

Technology: Entity Extraction (3)
2
5

Domai
n
Text

Tagged
Text

23

Unsupervised Model

4

Supervised Model

Input
Text
Pattern Match (Regex)
Exact Match (Gazetteer)

Deterministic Extractor
User
Defined
Lists

8

1

User
Defined
Patterns

Entity Redactor

Probabilistic Extractor

Overlap
Adjudication

Entity Joining

Filtering

3

Output
Text

Adaptation: Entity Extraction to Triage
 Out of the box:
– False +/- because contextual cues are fewer/different.
– Weapon in this document missed, because not a default entity type.

 Adaptation:
– Add custom entity type(s) via deterministic extractor, e.g. weapons list

 Benefit:
– Highlights important documents that might otherwise be missed.
– Fast and unlikely to affect performance of other components

 Difficulties:
– Requires forethought, maintenance of lists and patterns in many
languages, but much less work than developing a new model

9

Task: Translation
 Produce standardized, “user








10

language” versions of the source
document
Too few translators; name
standardization particularly labor
intensive
Speed up translation without
compromising quality
MT All reduces translation
productivity
NER, Coref and Name
Translation/Standardization
Signals: Resource
Selections, Corrections, Resolutions

Adaptation: Extraction to Translation
(1)
 Out of the box:
– Same problems as in Gisting case, only now they matter more.

 Adaptation:
– Train unsupervised model to help with form and domain differences
– Tune co-reference algorithm to most important entity types
– Develop form/domain specific resource sets, and allow users to select them.

 Benefit:
– Fewer errors in highlighting should mean translation actually speeds up

 Difficulties:
– Often hard to amass a big enough corpus of like material for model building.
– Form/Domain may be ephemeral

11

Adaptation: Extraction to Translation
(2)
Thanks:~ Itai_Rolnick$ cat
en_wc.txt | grep -i "
aleppo " | tr ' ' 'n' |
shuf | head

 Unsupervised algorithm clusters words

Loveland -- City in Colorado
Svetogorsk -- Town in Russia
MASSOUD -- ?Probably also of a village.
Atiak -- Town in Uganda
Waltha -- typo for Waltham? - town in Mass
BASILICA -- type of Church?
Sapukai -- Town in Paraguai
Yeisk -- Town in Russia
Descoberto -- Town in Brasil
SINKHOLE -- ? A pub in Beligium ??



12









with distributional similarities together
Word cluster ID is one feature used in
learning the sequence model
Based on Collins & Singer (1999)
Part of REX Field Training Kit
Shown: random sample of words
clustered with “Aleppo” in a ~10GB
English model
Note they’re almost all LOCs
Would an annotated training corpus
ever cover so many remote entities?

Task: Cataloging
 Distill content into an index, to









13

facilitate search and further
refinement at scale
Impossible to annotate more than a
tiny fraction of documents by hand
High quality automated enrichment
that makes efficient use of
knowledge resources and structure
in data
Many approaches, e.g. LSI, topic
modeling, document classification
Entity resolution is robust extension
of NER; data and knowledge driven.
Signals: mentions/aliases, shallow
relationships between entities

Technology: Entity Resolution (1)
Alberto

Alberto Amos
Alberto
Fernandez…
Fernandez…

Alberto Fernandez…
… born in Cuba
… US Ambassador

Sportsmen?
YES

Alberto

Alberto

Alberto
Fernandiz…
Alberto Fernandez
de la Puebla…
Albert
Fernandez…


Alberto
Alberto

… Chief of Cabinet
… Argentina…
…Prof of Criminal Law…

Alberto

Ratio of
Politicians to
Sportsmen?
2:1

Alberto
Alberto M.
Fernandez…
Fernandez… Alburto Fernandez…

Alberto
Alberto
Alberto
Alberto

… born Sept 7, 1984
… cycling
… Madrid

Nickname
“El Galleta?”
?


15


16


17

Resolution Engine

Entity
Mention

Link or
Ghost

Candidate Selection
Ranking

3
4

Learned
Seeded

2

Entity Index

Knowledge Base 1
18

Adaptation: EntRes to Cataloging (1)
 Out of the box:
– Quality dependent on output of extraction and order of input
– Lots of ghosts, poor links if Wikipedia-based KB doesn’t contain entities in document
– Seeding context selection may not be suited to domain

 Adaptations:
–
–
–
–

Custom KB, sized and suited to the domain and languages
Seeding using context most likely to match in your domain
Choose Linking or Learning mode
Choose evidence factoring scheme that meets your operational needs

 Benefits:
– Linking throughput is high, accuracy is high, ghosts are informative (because fewer
confounders)
– System can maintain low latency after ingestion of many documents
– Linking accuracy can remain high after ingestion of many documents

 Difficulties:
– Each element requires experimentation and thought
– Changes likely to cause discontinuities unless re-indexing

19

Adaptation: Ent Res to Cataloging (2)
 In Linking mode:
– Link to existing KB or declare unknown,
discarding context
– State size is constant, latency stable

 In Learning mode:
– Link to existing KB or create New, storing
context
– State size increases, increasing latency
– Semantic drift
– Confidence measure gets complicated

 Scaling with learning introduces the

need to factor evidence.
 Evidence factoring schemes need to
be customized to use cases.

20

Task: Retrieval
 Find relevant information for further







21

analysis
String-based retrieval methods are
easy to understand, but require a
lot of effort and distract from the
task.
Deliver search modalities that are
more productive but still
interpretable and correctable
Search using entity-driven facets, as
well as keywords
Signals: query log, click through,
curation, corrections

Adaptation: EntRes to Retrieval
 Out of the box:
– Entity labels not in user’s language confusing
– Returns results that can’t be easily summarized as a Boolean, cf aliases
– Complex, potentially misleading measure of confidence

 Adaptations:
– Use name translation for non user-language labels, e.g. from KB
– Present users with cues to expansion in string terms, e.g. mentions
– Present confidence measure carefully

 Benefits:
– User spends less time confused, search is more productive

 Difficulties:
– Users still want to do things like exclude certain mentions.

22

Summary
 News-trained NER OK for Triage, but adding entity types via

lists and patterns could improve results considerably.
 Speeding up Translation requires a better fit: unsupervised
adaptation and custom resource selection could make the
difference between time saved or wasted.
 Cataloguing by resolved entities enables powerful search, but
relies on high quality extraction; Learning-mode requires
evidence factoring at scale.
 Entity-based search is incredibly productive compared to
Boolean and keyword approaches, but users need cues that
explain expansion and robust measures of confidence.

23

Remaining Challenges
 Current reality: even “simple” adaptation can be difficult:
–
–
–
–

Too much knowledge, experience required
Too much data required, e.g. 10GB for unsupervised
Mostly “out of band”
Usually Offline

 Through the REX Field Training Kit and Entity Resolution

API, Basis lowering the barriers to manual adaptation to
sources, tasks and users today.
 Integration of explicit signals, e.g. corrections and implicit
signals, e.g. selections is ongoing.

24

Q&A
gregor@basistech.com

Director of Product Management, Text Analytics

Basis Technology Corporation

HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

Recommandé

Recommandé

Contenu connexe

Plus de Basis Technology

Plus de Basis Technology (15)

Dernier

Dernier (20)

HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

Notes de l'éditeur