BioNLP09 Winners

Extracting Complex Biological Events
with Rich GraphBased Feature Sets

Jari Björne, Juho Heimonen, Filip Ginter, Antti
Airola, Tapio Pahikkala, Tapio Salakoski
BioNLP 2009 Workshop

Farzaneh Sarafraz
18 June 2009

BioNLP'09 Task 1
 Events in abstracts
 Given: gene and gene products (proteins)
 Wanted: events
− type
− trigger
− participant(s)
− cause (if applicable)

Example
"I kappa B/MAD3 masks the nuclear localization
signal of NFkappa B p65 and requires the
transactivation domain to inhibit NFkappa B
p65 DNA binding. "

Event: negative regulation
Trigger: masks
Theme1: the first p65
Cause: MAD3

Event Types
 Gene expression  Binding
 Transcription  Regulation
 Protein Catabolism  Positive regulation
 Localisation  Negative regulation
 Phosphorylation

Training and Test Data
 Training data: 800 abstracts
 Development data: 150 abstracts
 Test data: 260 abstracts

The System
 Trigger recognition
− Methods similar to NER
− Classification
 Argument detection
− Graph edge selection
− Classification
 Semantic postprocessing
− Rulebased

Trigger Detection
 Token labelling (one for each type and one )
 92% of triggers are single token
− Adjacent tokens form a trigger if they appear in the
training data
 Triggers that share a token:
− Combined class: gene expression/pos regulation
 A graph node for each trigger
− Not duplicated just yet

Classification SVM
 Token features
− Binary: capitalisation, presence of punctuation or
numeric characters
− Stem
− Character bigrams and trigrams
− Token is known triggers in training data
− All the above for linear and dependency
“neighbours”

Classification SVM
 Frequency features
− # of named entities
 In sentence
 In a linear window around the token
 Bagofwords count of token texts in the sentence (?)
 Dependency chains
− Up to depth of 3 from the token are constructed
− At each depth both token and frequency features
− Plus dep type and sequence of dep types in chain

Two SVMs
 “Somewhat” different feature sets
 Combined weighted results

“This design should be considered an artifact of
the timeconstrained, experimentdriven
development of the system rather than a
principled design”

Precision/Recall tradeoff
 Undetected trigger > undetected event
 All triggers have events in the training data >
bias towards reporting an event for all detected
triggers
 Adjust P/R explicitly
− multiply the negative class by β
− find β experimentally

Edge Detection
 Multiclass SVM
 All potential directed edges
− Event node to named entity
− Event node to event node (nested event)
− Labelled as theme, cause, or negative
 Each edge is predicted independently

Feature Set – Central Concept

Shortest undirected
path of syntactic
dependencies in the
Stanford scheme
parse of the
sentence.

Feature Set
 Token text, POS, entity/event class,
dependency (subject)
 Ngrams: merging the attributes of 24
− Consecutive tokens
− Consecutive dependencies
− Each token and two neighbouring dependencies
− Each dependency and two neighbouring tokens
− One bigram showing direction

Other Features
 Individual component features
 Semantic node features
 Frequency features

Semantic PostProcessing
 Duplicate nodes
− Same class and same trigger
− Combined trigger
 Remove improper arguments
 Remove directed cycles by removing the
weakest link

Duplicating Event Nodes
 Task restrictions
− Two causes,
− must have theme,
− etc.
 Several heuristics
 xth first dependency
in shortest path from
the event for binding

What Didn't Work/Wasn't Tried
 CRF
 HMM
 Removing strong independence assumption
 Coreference resolution (4.8%)

BioNLP09 Winners

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (11)

Similaire à BioNLP09 Winners

Similaire à BioNLP09 Winners (20)

Dernier

Dernier (20)

BioNLP09 Winners