SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
A Methodology and Tool Suite for
Evaluating the Accuracy of
Interoperating Statistical Natural
Language Processing Engines
Uma Murthy
Virginia Tech

John Pitrelli, Ganesh Ramaswamy,
Martin Franz, and Burn Lewis
IBM T.J. Watson Research Center


Interspeech
22-26 September 2008
Brisbane, Australia
Outline
•    Motivation
•    Context
•    Issues
•    Evaluation methodology
•    Example evaluation modules
•    Future directions


                                  2
Motivation
•  Combining Natural Language Processing
   (NLP) engines for information processing in
   complex tasks
•  Evaluation of accuracy of output of individual
   NLP engines exists
   –  sliding window, BLEU score, word-error rate, etc.
•  No work on evaluation methods for large
   combinations, or aggregates, of NLP engines
   –  Foreign language videos  transcription 
      translation  story segmentation  topic
      clustering


                                                          3
Project Goal

To develop a methodology and tool suite for
   evaluating the accuracy (of output) of
 interoperating statistical natural language
            processing engines


           in the context of IOD


                                          4
Interoperability Demonstration
System (IOD)




                       Built upon UIMA

                                         5
Issues
1.  How is the accuracy of one engine or a set
    of engines evaluated, in the context of being
    present in an aggregate?
2.  What is the measure of accuracy of an
    aggregate and how can it be computed?
3.  How can the mechanics of this evaluation
    methodology be validated and tested?




                                                6
“Evaluation Space”
•  Core of the evaluation methodology
•  Various options of comparison of
   evaluation space of ground truth options
   based on human-generated and
   machine-generated outputs at every
   stage in the pipeline



                                          7
8
1.  Comparison between M-
    M-M… and H-H-H…
    evaluates the accuracy of
    the entire aggregate


2.  Emerging pattern

3.  Comparison of adjacent
    evaluations determines
    how much one engine
    (TC) degrades accuracy
    of the aggregate
4.  Do not consider H-M
    sequences

5.  Comparing two engines of
    the same function

6.  Assembling ground truths
    is the most expensive
    task

                          9
Evaluation Modules
•  Uses evaluation space as a template to automatically
   evaluate the performance of an aggregate
•  Development
    –  Explore methods that are used to evaluate the last
       engine in the aggregate
    –  If required, modify these methods, considering
       •  Preceding engines and, their input and output
       •  Different ground truth formats
•  Testing:
    –  Focus on validating the mechanics of evaluation and
       not the engines in question


                                                          10
Example Evaluation Modules
•  STTSBD
 – Sliding-window scheme
 – Automatically generated comparable
   ROC curves
   •  Validated module with six 30-minute Arabic
      news shows
•  STTMT
 – BLEU metric
 – Automatically generated BLEU scores
   •  Validated module with two Arabic-English MT
      engines on 38 minutes of audio
                                                    11
Future Directions
•  Develop more evaluation modules and
   validate them
    –  Test with actual ground truths
    –  Test with more data-sets
    –  Test on different engines (of the same
       kind)
•  Methodology
    –  Identify points of error
    –  How much does an engine impact the
       performance of the aggregate?


                                                12
Summary
•  Presented a methodology for automatic
   evaluation of accuracy of aggregates of
   interoperating statistical NLP engines
   –  Evaluation space and evaluation modules
•  Developed and validated evaluation modules
   for two aggregates

•  Miles to go!
   –  Small portion of a vast research area

                                                13
Thank You



      ?
            ?

                14
Back-up Slides




                 15
Evaluation Module Implementation
•  Each module was implemented as a
   UIMA CAS consumer
•  Ground truth and other evaluation
   parameters were input as CAS
   Consumer parameters




                                       16
Measuring the performance of
story boundary detection
TDT-style sliding window approach:
       partial credit for slightly misplaced segment boundaries




• True and system agree within the window t correct.
• No system boundary in a window containing a true boundary t Miss
• System boundary in a window containing no true boundary t False
Alarm

• Window length: 15 seconds
                                     Source: Franz, et al. “Breaking Translation Symmetry”


                                                                                  17
STTSBD Test Constraints
•  Ground truth availability: word-position-
   based story boundaries on ASR
   transcripts
  –  Transcripts were already segmented into
     sentences
•  For the pipeline (STTSBD) output, we
   needed to compare time-based story
   boundaries on Arabic speech

                                               18

Contenu connexe

Similaire à Evaluating Accuracy of Interoperating Statistical Natural Language Processing Engines

Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...Dr. Cornelius Ludmann
 
Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Lionel Briand
 
An empirical evaluation of cost-based federated SPARQL query Processing Engines
An empirical evaluation of cost-based federated SPARQL query Processing EnginesAn empirical evaluation of cost-based federated SPARQL query Processing Engines
An empirical evaluation of cost-based federated SPARQL query Processing EnginesUmair Qudus
 
The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017Roman Katerinenko
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Praveen Penumathsa
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...Boris Glavic
 
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...Lionel Briand
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Lionel Briand
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingLionel Briand
 
Performance Testing Java Applications
Performance Testing Java ApplicationsPerformance Testing Java Applications
Performance Testing Java ApplicationsC4Media
 
SSBSE 2020 keynote
SSBSE 2020 keynoteSSBSE 2020 keynote
SSBSE 2020 keynoteShiva Nejati
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning민재 정
 
A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...Lionel Briand
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Lucidworks
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Lionel Briand
 
Pro smartbooksquestions
Pro smartbooksquestionsPro smartbooksquestions
Pro smartbooksquestionsyoummr
 
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...Chromatography & Mass Spectrometry Solutions
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingApplications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingLionel Briand
 
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and TacticalTLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and TacticalAnna Royzman
 

Similaire à Evaluating Accuracy of Interoperating Statistical Natural Language Processing Engines (20)

Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
 
Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...
 
An empirical evaluation of cost-based federated SPARQL query Processing Engines
An empirical evaluation of cost-based federated SPARQL query Processing EnginesAn empirical evaluation of cost-based federated SPARQL query Processing Engines
An empirical evaluation of cost-based federated SPARQL query Processing Engines
 
The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
 
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software Testing
 
Performance Testing Java Applications
Performance Testing Java ApplicationsPerformance Testing Java Applications
Performance Testing Java Applications
 
SSBSE 2020 keynote
SSBSE 2020 keynoteSSBSE 2020 keynote
SSBSE 2020 keynote
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
techniques.ppt
techniques.ppttechniques.ppt
techniques.ppt
 
A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
 
Pro smartbooksquestions
Pro smartbooksquestionsPro smartbooksquestions
Pro smartbooksquestions
 
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingApplications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security Testing
 
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and TacticalTLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
 

Dernier

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Dernier (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Evaluating Accuracy of Interoperating Statistical Natural Language Processing Engines

  • 1. A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating Statistical Natural Language Processing Engines Uma Murthy Virginia Tech John Pitrelli, Ganesh Ramaswamy, Martin Franz, and Burn Lewis IBM T.J. Watson Research Center Interspeech 22-26 September 2008 Brisbane, Australia
  • 2. Outline •  Motivation •  Context •  Issues •  Evaluation methodology •  Example evaluation modules •  Future directions 2
  • 3. Motivation •  Combining Natural Language Processing (NLP) engines for information processing in complex tasks •  Evaluation of accuracy of output of individual NLP engines exists –  sliding window, BLEU score, word-error rate, etc. •  No work on evaluation methods for large combinations, or aggregates, of NLP engines –  Foreign language videos  transcription  translation  story segmentation  topic clustering 3
  • 4. Project Goal To develop a methodology and tool suite for evaluating the accuracy (of output) of interoperating statistical natural language processing engines in the context of IOD 4
  • 6. Issues 1.  How is the accuracy of one engine or a set of engines evaluated, in the context of being present in an aggregate? 2.  What is the measure of accuracy of an aggregate and how can it be computed? 3.  How can the mechanics of this evaluation methodology be validated and tested? 6
  • 7. “Evaluation Space” •  Core of the evaluation methodology •  Various options of comparison of evaluation space of ground truth options based on human-generated and machine-generated outputs at every stage in the pipeline 7
  • 8. 8
  • 9. 1.  Comparison between M- M-M… and H-H-H… evaluates the accuracy of the entire aggregate 2.  Emerging pattern 3.  Comparison of adjacent evaluations determines how much one engine (TC) degrades accuracy of the aggregate 4.  Do not consider H-M sequences 5.  Comparing two engines of the same function 6.  Assembling ground truths is the most expensive task 9
  • 10. Evaluation Modules •  Uses evaluation space as a template to automatically evaluate the performance of an aggregate •  Development –  Explore methods that are used to evaluate the last engine in the aggregate –  If required, modify these methods, considering •  Preceding engines and, their input and output •  Different ground truth formats •  Testing: –  Focus on validating the mechanics of evaluation and not the engines in question 10
  • 11. Example Evaluation Modules •  STTSBD – Sliding-window scheme – Automatically generated comparable ROC curves •  Validated module with six 30-minute Arabic news shows •  STTMT – BLEU metric – Automatically generated BLEU scores •  Validated module with two Arabic-English MT engines on 38 minutes of audio 11
  • 12. Future Directions •  Develop more evaluation modules and validate them –  Test with actual ground truths –  Test with more data-sets –  Test on different engines (of the same kind) •  Methodology –  Identify points of error –  How much does an engine impact the performance of the aggregate? 12
  • 13. Summary •  Presented a methodology for automatic evaluation of accuracy of aggregates of interoperating statistical NLP engines –  Evaluation space and evaluation modules •  Developed and validated evaluation modules for two aggregates •  Miles to go! –  Small portion of a vast research area 13
  • 14. Thank You ? ? 14
  • 16. Evaluation Module Implementation •  Each module was implemented as a UIMA CAS consumer •  Ground truth and other evaluation parameters were input as CAS Consumer parameters 16
  • 17. Measuring the performance of story boundary detection TDT-style sliding window approach: partial credit for slightly misplaced segment boundaries • True and system agree within the window t correct. • No system boundary in a window containing a true boundary t Miss • System boundary in a window containing no true boundary t False Alarm • Window length: 15 seconds Source: Franz, et al. “Breaking Translation Symmetry” 17
  • 18. STTSBD Test Constraints •  Ground truth availability: word-position- based story boundaries on ASR transcripts –  Transcripts were already segmented into sentences •  For the pipeline (STTSBD) output, we needed to compare time-based story boundaries on Arabic speech 18