SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Extracting Complex Biological Events
with Rich Graph­Based Feature Sets


 Jari Björne, Juho Heimonen, Filip Ginter, Antti
 Airola, Tapio Pahikkala, Tapio Salakoski
 BioNLP 2009 Workshop

Farzaneh Sarafraz
18 June 2009
                           
BioNLP'09 Task 1
       Events in abstracts
       Given: gene and gene products (proteins)
       Wanted: events
        −   type
        −   trigger
        −   participant(s)
        −   cause (if applicable)

                                     
Example
    "I kappa B/MAD­3 masks the nuclear localization 
      signal of NF­kappa B p65 and requires the 
      transactivation domain to inhibit NF­kappa B 
      p65 DNA binding. "


    Event: negative regulation
    Trigger: masks
    Theme1: the first p65
    Cause: MAD­3


                             
Event Types
       Gene expression             Binding
       Transcription               Regulation
       Protein Catabolism          Positive regulation
       Localisation                Negative regulation
       Phosphorylation




                              
Training and Test Data
       Training data: 800 abstracts
       Development data: 150 abstracts
       Test data: 260 abstracts




                               
The System
       Trigger recognition
        −   Methods similar to NER
        −   Classification
       Argument detection
        −   Graph edge selection
        −   Classification
       Semantic post­processing
        −   Rule­based
                                    
Trigger Detection
       Token labelling (one for each type and one ­)
       92% of triggers are single token
        −   Adjacent tokens form a trigger if they appear in the 
            training data
       Triggers that share a token:
        −   Combined class: gene expression/pos regulation
       A graph node for each trigger
        −   Not duplicated just yet
                                       
Classification ­ SVM
       Token features
        −   Binary: capitalisation, presence of punctuation or 
            numeric characters
        −   Stem
        −   Character bigrams and trigrams
        −   Token is known triggers in training data
        −   All the above for linear and dependency 
            “neighbours”

                                     
Classification ­ SVM
       Frequency features
        −   # of named entities
                In sentence
                In a linear window around the token
                Bag­of­words count of token texts in the sentence (?)
       Dependency chains
        −   Up to depth of 3 from the token are constructed
        −   At each depth both token and frequency features
        −   Plus dep type and sequence of dep types in chain
                                         
Two SVMs
       “Somewhat”  different feature sets
       Combined weighted results



    “This design should be considered an artifact of 
      the time­constrained, experiment­driven 
      development of the system rather than a 
      principled design”

                               
Precision/Recall trade­off
       Undetected trigger ­­> undetected event
       All triggers have events in the training data ­­> 
        bias towards reporting an event for all detected 
        triggers
       Adjust P/R explicitly 
        −   multiply the negative class by β
        −   find β experimentally


                                     
Edge Detection
       Multi­class SVM
       All potential directed edges
        −   Event node to named entity
        −   Event node to event node (nested event)
        −   Labelled as theme, cause, or negative
       Each edge is predicted independently



                                   
Feature Set – Central Concept

    Shortest undirected 
     path of syntactic 
     dependencies in the 
     Stanford scheme 
     parse of the 
     sentence.




                             
Feature Set
       Token text, POS, entity/event class, 
        dependency (subject)
       N­grams: merging the attributes of 2­4
        −   Consecutive tokens
        −   Consecutive dependencies
        −   Each token and two neighbouring dependencies
        −   Each dependency and two neighbouring tokens
        −   One bigram showing direction
                                  
Other Features
       Individual component features
       Semantic node features
       Frequency features




                              
Semantic Post­Processing
       Duplicate nodes
        −   Same class and same trigger
        −   Combined trigger
       Remove improper arguments
       Remove directed cycles by removing the 
        weakest link



                                  
Duplicating Event Nodes
       Task restrictions
        −   Two causes,
        −   must have theme,
        −   etc.
       Several heuristics
       x­th first dependency 
        in shortest path from 
        the event for binding
                                  
Results




           
Compared to Us




                  
What Didn't Work/Wasn't Tried
       CRF
       HMM
       Removing strong independence assumption
       Co­reference resolution (4.8%)




                               
End.




        

Contenu connexe

En vedette (11)

Language
LanguageLanguage
Language
 
Six Month
Six MonthSix Month
Six Month
 
Nacsa úJ 4.1 Jav.
Nacsa úJ 4.1 Jav.Nacsa úJ 4.1 Jav.
Nacsa úJ 4.1 Jav.
 
Workshop negations
Workshop negationsWorkshop negations
Workshop negations
 
Edu2
Edu2Edu2
Edu2
 
Eoy
EoyEoy
Eoy
 
the_life_cycle_of_a_wireframe
the_life_cycle_of_a_wireframethe_life_cycle_of_a_wireframe
the_life_cycle_of_a_wireframe
 
I2b209
I2b209I2b209
I2b209
 
Defense
DefenseDefense
Defense
 
Olivia Contradictions
Olivia ContradictionsOlivia Contradictions
Olivia Contradictions
 
Ambiguity
AmbiguityAmbiguity
Ambiguity
 

Similaire à BioNLP09 Winners

Advances in Bayesian Learning
Advances in Bayesian LearningAdvances in Bayesian Learning
Advances in Bayesian Learning
butest
 
CUHK System for the Spoken Web Search task at Mediaeval 2012
CUHK System for the Spoken Web Search task at Mediaeval 2012CUHK System for the Spoken Web Search task at Mediaeval 2012
CUHK System for the Spoken Web Search task at Mediaeval 2012
MediaEval2012
 
Performance Metrics and Figures of Merit Working Group Summary Aug2012
Performance Metrics and Figures of Merit Working Group Summary Aug2012Performance Metrics and Figures of Merit Working Group Summary Aug2012
Performance Metrics and Figures of Merit Working Group Summary Aug2012
GenomeInABottle
 
Instruction level power analysis
Instruction level power analysisInstruction level power analysis
Instruction level power analysis
Radhegovind
 
Protecting the Protector, Hardening Machine Learning Defenses Against Adversa...
Protecting the Protector, Hardening Machine Learning Defenses Against Adversa...Protecting the Protector, Hardening Machine Learning Defenses Against Adversa...
Protecting the Protector, Hardening Machine Learning Defenses Against Adversa...
Priyanka Aash
 

Similaire à BioNLP09 Winners (20)

BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
 
BioWeka
BioWekaBioWeka
BioWeka
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Workshop NGS data analysis - 3
Workshop NGS data analysis - 3
 
Machine reading for cancer biology
Machine reading for cancer biologyMachine reading for cancer biology
Machine reading for cancer biology
 
BlueHat v18 || Protecting the protector, hardening machine learning defenses ...
BlueHat v18 || Protecting the protector, hardening machine learning defenses ...BlueHat v18 || Protecting the protector, hardening machine learning defenses ...
BlueHat v18 || Protecting the protector, hardening machine learning defenses ...
 
Advances in Bayesian Learning
Advances in Bayesian LearningAdvances in Bayesian Learning
Advances in Bayesian Learning
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Deep learning notes.pptx
Deep learning notes.pptxDeep learning notes.pptx
Deep learning notes.pptx
 
Machine learning in computer security
Machine learning in computer securityMachine learning in computer security
Machine learning in computer security
 
Automatic test packet generation
Automatic test packet generationAutomatic test packet generation
Automatic test packet generation
 
CUHK System for the Spoken Web Search task at Mediaeval 2012
CUHK System for the Spoken Web Search task at Mediaeval 2012CUHK System for the Spoken Web Search task at Mediaeval 2012
CUHK System for the Spoken Web Search task at Mediaeval 2012
 
sequencea.ppt
sequencea.pptsequencea.ppt
sequencea.ppt
 
Performance Metrics and Figures of Merit Working Group Summary Aug2012
Performance Metrics and Figures of Merit Working Group Summary Aug2012Performance Metrics and Figures of Merit Working Group Summary Aug2012
Performance Metrics and Figures of Merit Working Group Summary Aug2012
 
Temporal Hypermap Theory and Application
Temporal Hypermap Theory and ApplicationTemporal Hypermap Theory and Application
Temporal Hypermap Theory and Application
 
Instruction level power analysis
Instruction level power analysisInstruction level power analysis
Instruction level power analysis
 
Protecting the Protector, Hardening Machine Learning Defenses Against Adversa...
Protecting the Protector, Hardening Machine Learning Defenses Against Adversa...Protecting the Protector, Hardening Machine Learning Defenses Against Adversa...
Protecting the Protector, Hardening Machine Learning Defenses Against Adversa...
 
Thesis proposal
Thesis proposalThesis proposal
Thesis proposal
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

BioNLP09 Winners

  • 2. BioNLP'09 Task 1  Events in abstracts  Given: gene and gene products (proteins)  Wanted: events − type − trigger − participant(s) − cause (if applicable)    
  • 3. Example "I kappa B/MAD­3 masks the nuclear localization  signal of NF­kappa B p65 and requires the  transactivation domain to inhibit NF­kappa B  p65 DNA binding. " Event: negative regulation Trigger: masks Theme1: the first p65 Cause: MAD­3    
  • 4. Event Types  Gene expression  Binding  Transcription  Regulation  Protein Catabolism  Positive regulation  Localisation  Negative regulation  Phosphorylation    
  • 5. Training and Test Data  Training data: 800 abstracts  Development data: 150 abstracts  Test data: 260 abstracts    
  • 6. The System  Trigger recognition − Methods similar to NER − Classification  Argument detection − Graph edge selection − Classification  Semantic post­processing − Rule­based    
  • 7. Trigger Detection  Token labelling (one for each type and one ­)  92% of triggers are single token − Adjacent tokens form a trigger if they appear in the  training data  Triggers that share a token: − Combined class: gene expression/pos regulation  A graph node for each trigger − Not duplicated just yet    
  • 8. Classification ­ SVM  Token features − Binary: capitalisation, presence of punctuation or  numeric characters − Stem − Character bigrams and trigrams − Token is known triggers in training data − All the above for linear and dependency  “neighbours”    
  • 9. Classification ­ SVM  Frequency features − # of named entities  In sentence  In a linear window around the token  Bag­of­words count of token texts in the sentence (?)  Dependency chains − Up to depth of 3 from the token are constructed − At each depth both token and frequency features − Plus dep type and sequence of dep types in chain    
  • 10. Two SVMs  “Somewhat”  different feature sets  Combined weighted results “This design should be considered an artifact of  the time­constrained, experiment­driven  development of the system rather than a  principled design”    
  • 11. Precision/Recall trade­off  Undetected trigger ­­> undetected event  All triggers have events in the training data ­­>  bias towards reporting an event for all detected  triggers  Adjust P/R explicitly  − multiply the negative class by β − find β experimentally    
  • 12. Edge Detection  Multi­class SVM  All potential directed edges − Event node to named entity − Event node to event node (nested event) − Labelled as theme, cause, or negative  Each edge is predicted independently    
  • 13. Feature Set – Central Concept Shortest undirected  path of syntactic  dependencies in the  Stanford scheme  parse of the  sentence.    
  • 14. Feature Set  Token text, POS, entity/event class,  dependency (subject)  N­grams: merging the attributes of 2­4 − Consecutive tokens − Consecutive dependencies − Each token and two neighbouring dependencies − Each dependency and two neighbouring tokens − One bigram showing direction    
  • 15. Other Features  Individual component features  Semantic node features  Frequency features    
  • 16. Semantic Post­Processing  Duplicate nodes − Same class and same trigger − Combined trigger  Remove improper arguments  Remove directed cycles by removing the  weakest link    
  • 17. Duplicating Event Nodes  Task restrictions − Two causes, − must have theme, − etc.  Several heuristics  x­th first dependency  in shortest path from  the event for binding    
  • 20. What Didn't Work/Wasn't Tried  CRF  HMM  Removing strong independence assumption  Co­reference resolution (4.8%)    
  • 21. End.