SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
Text Mining of
Transcription Factors to Proteins Interactions
Ashish Baghudana
1
Supervisor: Prof. Dr. Burkhard Rost
Advisor: Juan Miguel Cejuela
Transcription Factors Interactions
• Binds to specific DNA sequences and controls transcription
2
Transcription Factors Interactions
• Binds to specific DNA sequences and controls transcription
• Two types of interactions
• Transcription factor transcribes gene
2
Transcription Factors Interactions
• Binds to specific DNA sequences and controls transcription
• Two types of interactions
• Transcription factor transcribes gene
• Protein modifies or interacts with transcription factor
2
Transcription Factors Interactions
• Binds to specific DNA sequences and controls transcription
• Two types of interactions
• Transcription factor transcribes gene
• Protein modifies or interacts with transcription factor
• Estimated 45,000+ such interactions
2
• TRANSFAC
• Manually curated DB
• Eukaryotic TF and genomic binding
sites
• Commercial version contains
reports for 21,000 transcription
factors[1]
• Public version contains reports for
7,000 transcription factors[1]
3
Related Work
[1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf
[2]: http://cbrc.kaust.edu.sa/tcof/
[3]: http://itfp.biosino.org/itfp/
• TRANSFAC
• Manually curated DB
• Eukaryotic TF and genomic binding
sites
• Commercial version contains
reports for 21,000 transcription
factors[1]
• Public version contains reports for
7,000 transcription factors[1]
• TcoF – Transcription co-Factor
Database
• 1365 transcription factors[2]
• Manually curated from BioGrid,
MINT and EBI
3
Related Work
[1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf
[2]: http://cbrc.kaust.edu.sa/tcof/
[3]: http://itfp.biosino.org/itfp/
• TRANSFAC
• Manually curated DB
• Eukaryotic TF and genomic binding
sites
• Commercial version contains
reports for 21,000 transcription
factors[1]
• Public version contains reports for
7,000 transcription factors[1]
• TcoF – Transcription co-Factor
Database
• 1365 transcription factors[2]
• Manually curated from BioGrid,
MINT and EBI
• Integrated Transcription Factor
Platform
• Predicted interactions using
sequence data[3]
• SVMs
3
Related Work
[1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf
[2]: http://cbrc.kaust.edu.sa/tcof/
[3]: http://itfp.biosino.org/itfp/
Problem Motivation
4
MEDLINE currently has more than 24M articles and adds over 1M articles per year
Corpus
5
6
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Filter Swissprot for transcription-factor activity, sequence specific DNA-
binding (GO Term GO:0003700) and Homo sapiens
7
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for
“Interaction with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
For each protein, obtain list of publications cited for “INTERACTION WITH”
8
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Run GNormPlus, a gene tagger, on each of these abstracts – giving Gene or
Gene Product (GGP)
9
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Entrez Gene IDs normalized to Uniprot IDs using priority selection
(first Swissprot, then TrEMBL)
10
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
All GGPs cross-referenced with GO Term GO:0003700 and its descendants. If
Uniprot ID contains annotation for transcription factor activity, labeled as
transcription factor
11
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and
Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Correct entity boundaries and offsets, and add abbreviations
12
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Annotation of relations on tagtog manually
relna Annotation
13
Relation Extraction
14
Relation Extraction
• Binary Classification
15
Entity 1 Entity 2 Feature 1 … Feature n Class
RanBP10 Glucocorticoid
Receptor
1 … 0 True
RanBP10 Estrogen
receptor alpha
0 … 1 False
Method Development
• Pipeline (based on nalaf)
16
Data
Reader
Annotation
Recognition
Feature
Generator
ParserTokenizerSplitter Learning Evaluator
Dataset object gets created and passed around from one module to the next in the pipeline
Read in textual
data from
some input
(PMID, HTML)
to create a
Dataset object
Read in
Annotations
from PubTator
or ann.json
and augment it
to Dataset
Splits the text
of each
Document in
the dataset
into sentences
Creates Tokens
representing
the smallest
processing
unit, usually
words
Parses each
sentence and
store syntactic
and
dependency
parse trees
Generate
features for
learning a
model
Handles
learning and
prediction
using
SVMLight
Evaluates
performance
of prediction
at the edge
and document
level
Edge
Generator
Generates
potential
relations and
reduces relation
extraction to
binary
classification
Writer
Writes the
predicted
annotations in
tagtog
compatible
ann.json
format
Method Development
• Pipeline (based on nalaf)
16
Data
Reader
Annotation
Recognition
Feature
Generator
ParserTokenizerSplitter Learning Evaluator
Dataset object gets created and passed around from one module to the next in the pipeline
Read in textual
data from
some input
(PMID, HTML)
to create a
Dataset object
Read in
Annotations
from PubTator
or ann.json
and augment it
to Dataset
Splits the text
of each
Document in
the dataset
into sentences
Creates Tokens
representing
the smallest
processing
unit, usually
words
Parses each
sentence and
store syntactic
and
dependency
parse trees
Generate
features for
learning a
model
Handles
learning and
prediction
using
SVMLight
Evaluates
performance
of prediction
at the edge
and document
level
Edge
Generator
Generates
potential
relations and
reduces relation
extraction to
binary
classification
Writer
Writes the
predicted
annotations in
tagtog
compatible
ann.json
format
Dependency Parsing
17
Dependency Parsing
18
(ROOT
(NP
(NP (DT The) (NN interaction))
(PP (IN between)
(NP (NNP Androgen) (NNP Receptor)
(CC and)
(NNP SMRT)))))
19
Constituency Parsing
Feature Generation
• Sentence Features
20
Feature example:
“Named Entity Count”: 2
Feature Generation
• Sentence Features
• Token Features
20
Feature example:
“interacts_stem” : “interact”
“interacts_pos” : “VBZ”
“interacts_lem” : “interact”
Feature Generation
• Sentence Features
• Token Features
• N-gram Features
20
Feature example:
“androgen receptor”,
“receptor interacts”,
“interacts with”, “with SMRT”
Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
20
Feature example:
“linear_distance” : 2
Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
receptor -> interacts -> with -> SMRT
Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
[“interacts”, “with”]
Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
“root_word”: “interacts”
Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
receptor -> interacts
SMRT -> with -> interacts
Evaluation
21
Evaluation Modes
• Non-Unique
• SVM (or Edge) Performance
• Repeated relations
• Only if offsets AND text of entities
match
• Unique
• Document Performance
• No repetitions
• If texts of entities match
22
Non-Unique:
1. CDK9, STAT3
(2nd line)
2. CDK9, STAT3
(4th line)
Unique:
1. CDK9, STAT3
Training and Cross Validation
• Two methods
• 80:20
• 60:20:20
23Train Test Development
Results
24
Results – Comparison of Methods
25
0.396
0.806
0.918
0.719
0.691
0.980
0.478
0.214
0.653
0.699
0.562
0.599
0.340
0.683 0.693
0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
Baseline Initial With Tree Kernels Undersampling Hyperparameter Tuning
Precision Recall F-Measure
Results
26
10
15
20
25
30
Fold1 Fold2 Fold3 Fold4 Fold5
Time(inseconds)
Time Taken per Fold (in seconds) – lower is better
Without Feature Selection With Feature Selection
0.55
0.6
0.65
0.7
0.75
0.8
Fold1 Fold2 Fold3 Fold4 Fold5
F-Measure
Performance per Fold (F-Measure) – higher is better
Without Feature Selection With Feature Selection
Demo
27
Conclusion
28
Conclusion
29
0.396
0.691
0.98
0.699
0.562
0.693
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Baseline Final
Precision Recall F-Measure
Performance Percentage
Precision (P) 69.1 ± 0.61%
Recall (R) 69.9 ± 0.56%
F-Measure (F) 69.3 ± 0.59%
42
Conclusion
• Development of new corpus with semi-automatic annotation
• Published at PubAnnotation: http://pubannotation.org/projects/relna
30
Conclusion
• Development of new corpus with semi-automatic annotation
• Published at PubAnnotation: http://pubannotation.org/projects/relna
• Development of new method for extracting relations of transcription
factors
• Registered with Elixir Tools: https://bio.tools/tool/RostLab/relna/0.1.0
• Available on GitHub: https://github.com/Rostlab/relna
30
Conclusion
• Development of new corpus with semi-automatic annotation
• Published at PubAnnotation: http://pubannotation.org/projects/relna
• Development of new method for extracting relations of transcription
factors
• Registered with Elixir Tools: https://bio.tools/tool/RostLab/relna/0.1.0
• Available on GitHub: https://github.com/Rostlab/relna
• Integration into nalaf and building a generalized relation extraction
tool
30
Future Work
• Coreference resolution techniques
• Generalizing method for spanning multiple sentences
• Further testing with neural networks
31
Thank you
32
Corpus Statistics
46
Corpus Statistics
47
Corpus Statistics
1196
30 3 1 2
0
200
400
600
800
1000
1200
1400
Category 1
NumberofRelations
Sentence Distance
Sentence Distance for Relations
0 1 2 3 4
48
Performance on LocText
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Original Relna
Comparision with LocText
Precision Recall F-Measure
49
Parsing
• Dependency Parsing
• Identify the relations between words
• O(n3) algorithm, with n as the number of words in the sentence
• Constituency Parsing
• Identify phrases (noun chunks, verb chunks etc.) and their relative structure
and hierarchy in the sentence
• O(n5) algorithm, with n as the number of words in the sentence
36
Exhaustive List of Features
• Sentence Features
• BOW, Stem
• #entities, #BOW count
• Token Features
• Token Text, Masked Text
• Stem, POS
• Capitalization, Digits, Hyphens and
other Punctuations
• Char bigrams and trigrams
• Dependency Feature for
Shortest Paths
• Path direction (eg. FFRFR)
• Dependency types in path
• Path length
• Intermediate Tokens
• Path Constituents (eg. “interact”,
“bind” etc.)
• Root word of the sentence
• Linear Context
33
Other Features
• Linear distance between entities
• Presence of specific words in the
sentence
• Prior tokens
• Intermediate tokens
• Post tokens
• N-gram features
• Bigram
• Trigram
• Relative Entity Order
• Conjoint Entity Text
34
N-gram Features
• The cow jumped over the moon
• “the”, “cow”, “jumped”, “over”, “moon”
• “the cow”, “cow jumped”, “jumped over”, “over the”, “the moon”
• “the cow jumped”, “cow jumped over”, “jumped over the”, “over the moon”
• …
35

Contenu connexe

Similaire à Relation.Extraction.of.TF.and.GGP

Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing Monica Munoz-Torres
 
NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw Alexander Pico
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003robertstevens65
 
Textmining activities at BioHackathon 2010
Textmining activities at BioHackathon 2010Textmining activities at BioHackathon 2010
Textmining activities at BioHackathon 2010Alberto Labarga
 
IntAct and data distribution with PSICQUIC
IntAct and data distribution with PSICQUICIntAct and data distribution with PSICQUIC
IntAct and data distribution with PSICQUICRafael C. Jimenez
 
Genome_annotation@BioDec: Python all over the place
Genome_annotation@BioDec: Python all over the placeGenome_annotation@BioDec: Python all over the place
Genome_annotation@BioDec: Python all over the placeBioDec
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewKevin Watters
 
Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Sean Davis
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...StampedeCon
 
exRNA Data Analysis Tools in the Genboree Workbench
exRNA Data Analysis Tools in the Genboree WorkbenchexRNA Data Analysis Tools in the Genboree Workbench
exRNA Data Analysis Tools in the Genboree Workbenchexrna
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBMongoDB
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics PresentationZhenhong Bao
 
Visualization of 3D Genome Data
Visualization of 3D Genome DataVisualization of 3D Genome Data
Visualization of 3D Genome DataNils Gehlenborg
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfPushpendra83
 

Similaire à Relation.Extraction.of.TF.and.GGP (20)

Linkset quality (LWDM 2013)
Linkset quality (LWDM 2013)Linkset quality (LWDM 2013)
Linkset quality (LWDM 2013)
 
Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing
 
CytoScape
CytoScapeCytoScape
CytoScape
 
NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
 
Textmining activities at BioHackathon 2010
Textmining activities at BioHackathon 2010Textmining activities at BioHackathon 2010
Textmining activities at BioHackathon 2010
 
IntAct and data distribution with PSICQUIC
IntAct and data distribution with PSICQUICIntAct and data distribution with PSICQUIC
IntAct and data distribution with PSICQUIC
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Genome_annotation@BioDec: Python all over the place
Genome_annotation@BioDec: Python all over the placeGenome_annotation@BioDec: Python all over the place
Genome_annotation@BioDec: Python all over the place
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query Overview
 
Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
exRNA Data Analysis Tools in the Genboree Workbench
exRNA Data Analysis Tools in the Genboree WorkbenchexRNA Data Analysis Tools in the Genboree Workbench
exRNA Data Analysis Tools in the Genboree Workbench
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
GoTermsAnalysisWithR
GoTermsAnalysisWithRGoTermsAnalysisWithR
GoTermsAnalysisWithR
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics Presentation
 
Enfin, DAS and BioMart
Enfin, DAS and BioMartEnfin, DAS and BioMart
Enfin, DAS and BioMart
 
Visualization of 3D Genome Data
Visualization of 3D Genome DataVisualization of 3D Genome Data
Visualization of 3D Genome Data
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdf
 

Relation.Extraction.of.TF.and.GGP

  • 1. Text Mining of Transcription Factors to Proteins Interactions Ashish Baghudana 1 Supervisor: Prof. Dr. Burkhard Rost Advisor: Juan Miguel Cejuela
  • 2. Transcription Factors Interactions • Binds to specific DNA sequences and controls transcription 2
  • 3. Transcription Factors Interactions • Binds to specific DNA sequences and controls transcription • Two types of interactions • Transcription factor transcribes gene 2
  • 4. Transcription Factors Interactions • Binds to specific DNA sequences and controls transcription • Two types of interactions • Transcription factor transcribes gene • Protein modifies or interacts with transcription factor 2
  • 5. Transcription Factors Interactions • Binds to specific DNA sequences and controls transcription • Two types of interactions • Transcription factor transcribes gene • Protein modifies or interacts with transcription factor • Estimated 45,000+ such interactions 2
  • 6. • TRANSFAC • Manually curated DB • Eukaryotic TF and genomic binding sites • Commercial version contains reports for 21,000 transcription factors[1] • Public version contains reports for 7,000 transcription factors[1] 3 Related Work [1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf [2]: http://cbrc.kaust.edu.sa/tcof/ [3]: http://itfp.biosino.org/itfp/
  • 7. • TRANSFAC • Manually curated DB • Eukaryotic TF and genomic binding sites • Commercial version contains reports for 21,000 transcription factors[1] • Public version contains reports for 7,000 transcription factors[1] • TcoF – Transcription co-Factor Database • 1365 transcription factors[2] • Manually curated from BioGrid, MINT and EBI 3 Related Work [1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf [2]: http://cbrc.kaust.edu.sa/tcof/ [3]: http://itfp.biosino.org/itfp/
  • 8. • TRANSFAC • Manually curated DB • Eukaryotic TF and genomic binding sites • Commercial version contains reports for 21,000 transcription factors[1] • Public version contains reports for 7,000 transcription factors[1] • TcoF – Transcription co-Factor Database • 1365 transcription factors[2] • Manually curated from BioGrid, MINT and EBI • Integrated Transcription Factor Platform • Predicted interactions using sequence data[3] • SVMs 3 Related Work [1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf [2]: http://cbrc.kaust.edu.sa/tcof/ [3]: http://itfp.biosino.org/itfp/
  • 9. Problem Motivation 4 MEDLINE currently has more than 24M articles and adds over 1M articles per year
  • 11. 6 Scan Uniprot for GO:0003700 & Homo sapiens List publications cited for “Interaction with” Run GNormPlus Normalize Gene ID to Uniprot ID If Uniprot ID contains GO:0003700, label transcription factor Offset and Boundary Corrections Manual Annotation of Relations on tagtog Filter Swissprot for transcription-factor activity, sequence specific DNA- binding (GO Term GO:0003700) and Homo sapiens
  • 12. 7 Scan Uniprot for GO:0003700 & Homo sapiens List publications cited for “Interaction with” Run GNormPlus Normalize Gene ID to Uniprot ID If Uniprot ID contains GO:0003700, label transcription factor Offset and Boundary Corrections Manual Annotation of Relations on tagtog For each protein, obtain list of publications cited for “INTERACTION WITH”
  • 13. 8 Scan Uniprot for GO:0003700 & Homo sapiens List publications cited for “Interaction with” Run GNormPlus Normalize Gene ID to Uniprot ID If Uniprot ID contains GO:0003700, label transcription factor Offset and Boundary Corrections Manual Annotation of Relations on tagtog Run GNormPlus, a gene tagger, on each of these abstracts – giving Gene or Gene Product (GGP)
  • 14. 9 Scan Uniprot for GO:0003700 & Homo sapiens List publications cited for “Interaction with” Run GNormPlus Normalize Gene ID to Uniprot ID If Uniprot ID contains GO:0003700, label transcription factor Offset and Boundary Corrections Manual Annotation of Relations on tagtog Entrez Gene IDs normalized to Uniprot IDs using priority selection (first Swissprot, then TrEMBL)
  • 15. 10 Scan Uniprot for GO:0003700 & Homo sapiens List publications cited for “Interaction with” Run GNormPlus Normalize Gene ID to Uniprot ID If Uniprot ID contains GO:0003700, label transcription factor Offset and Boundary Corrections Manual Annotation of Relations on tagtog All GGPs cross-referenced with GO Term GO:0003700 and its descendants. If Uniprot ID contains annotation for transcription factor activity, labeled as transcription factor
  • 16. 11 Scan Uniprot for GO:0003700 & Homo sapiens List publications cited for “Interaction with” Run GNormPlus Normalize Gene ID to Uniprot ID If Uniprot ID contains GO:0003700, label transcription factor Offset and Boundary Corrections Manual Annotation of Relations on tagtog Correct entity boundaries and offsets, and add abbreviations
  • 17. 12 Scan Uniprot for GO:0003700 & Homo sapiens List publications cited for “Interaction with” Run GNormPlus Normalize Gene ID to Uniprot ID If Uniprot ID contains GO:0003700, label transcription factor Offset and Boundary Corrections Manual Annotation of Relations on tagtog Annotation of relations on tagtog manually
  • 20. Relation Extraction • Binary Classification 15 Entity 1 Entity 2 Feature 1 … Feature n Class RanBP10 Glucocorticoid Receptor 1 … 0 True RanBP10 Estrogen receptor alpha 0 … 1 False
  • 21. Method Development • Pipeline (based on nalaf) 16 Data Reader Annotation Recognition Feature Generator ParserTokenizerSplitter Learning Evaluator Dataset object gets created and passed around from one module to the next in the pipeline Read in textual data from some input (PMID, HTML) to create a Dataset object Read in Annotations from PubTator or ann.json and augment it to Dataset Splits the text of each Document in the dataset into sentences Creates Tokens representing the smallest processing unit, usually words Parses each sentence and store syntactic and dependency parse trees Generate features for learning a model Handles learning and prediction using SVMLight Evaluates performance of prediction at the edge and document level Edge Generator Generates potential relations and reduces relation extraction to binary classification Writer Writes the predicted annotations in tagtog compatible ann.json format
  • 22. Method Development • Pipeline (based on nalaf) 16 Data Reader Annotation Recognition Feature Generator ParserTokenizerSplitter Learning Evaluator Dataset object gets created and passed around from one module to the next in the pipeline Read in textual data from some input (PMID, HTML) to create a Dataset object Read in Annotations from PubTator or ann.json and augment it to Dataset Splits the text of each Document in the dataset into sentences Creates Tokens representing the smallest processing unit, usually words Parses each sentence and store syntactic and dependency parse trees Generate features for learning a model Handles learning and prediction using SVMLight Evaluates performance of prediction at the edge and document level Edge Generator Generates potential relations and reduces relation extraction to binary classification Writer Writes the predicted annotations in tagtog compatible ann.json format
  • 25. (ROOT (NP (NP (DT The) (NN interaction)) (PP (IN between) (NP (NNP Androgen) (NNP Receptor) (CC and) (NNP SMRT))))) 19 Constituency Parsing
  • 26. Feature Generation • Sentence Features 20 Feature example: “Named Entity Count”: 2
  • 27. Feature Generation • Sentence Features • Token Features 20 Feature example: “interacts_stem” : “interact” “interacts_pos” : “VBZ” “interacts_lem” : “interact”
  • 28. Feature Generation • Sentence Features • Token Features • N-gram Features 20 Feature example: “androgen receptor”, “receptor interacts”, “interacts with”, “with SMRT”
  • 29. Feature Generation • Sentence Features • Token Features • N-gram Features • Linear Context and Distance between Entities 20 Feature example: “linear_distance” : 2
  • 30. Feature Generation • Sentence Features • Token Features • N-gram Features • Linear Context and Distance between Entities • Dependency Features • Shortest path between the two entities • Path constituents • Root word • Path to root word 20 Feature example: receptor -> interacts -> with -> SMRT
  • 31. Feature Generation • Sentence Features • Token Features • N-gram Features • Linear Context and Distance between Entities • Dependency Features • Shortest path between the two entities • Path constituents • Root word • Path to root word 20 Feature example: [“interacts”, “with”]
  • 32. Feature Generation • Sentence Features • Token Features • N-gram Features • Linear Context and Distance between Entities • Dependency Features • Shortest path between the two entities • Path constituents • Root word • Path to root word 20 Feature example: “root_word”: “interacts”
  • 33. Feature Generation • Sentence Features • Token Features • N-gram Features • Linear Context and Distance between Entities • Dependency Features • Shortest path between the two entities • Path constituents • Root word • Path to root word 20 Feature example: receptor -> interacts SMRT -> with -> interacts
  • 35. Evaluation Modes • Non-Unique • SVM (or Edge) Performance • Repeated relations • Only if offsets AND text of entities match • Unique • Document Performance • No repetitions • If texts of entities match 22 Non-Unique: 1. CDK9, STAT3 (2nd line) 2. CDK9, STAT3 (4th line) Unique: 1. CDK9, STAT3
  • 36. Training and Cross Validation • Two methods • 80:20 • 60:20:20 23Train Test Development
  • 38. Results – Comparison of Methods 25 0.396 0.806 0.918 0.719 0.691 0.980 0.478 0.214 0.653 0.699 0.562 0.599 0.340 0.683 0.693 0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000 Baseline Initial With Tree Kernels Undersampling Hyperparameter Tuning Precision Recall F-Measure
  • 39. Results 26 10 15 20 25 30 Fold1 Fold2 Fold3 Fold4 Fold5 Time(inseconds) Time Taken per Fold (in seconds) – lower is better Without Feature Selection With Feature Selection 0.55 0.6 0.65 0.7 0.75 0.8 Fold1 Fold2 Fold3 Fold4 Fold5 F-Measure Performance per Fold (F-Measure) – higher is better Without Feature Selection With Feature Selection
  • 42. Conclusion 29 0.396 0.691 0.98 0.699 0.562 0.693 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Baseline Final Precision Recall F-Measure Performance Percentage Precision (P) 69.1 ± 0.61% Recall (R) 69.9 ± 0.56% F-Measure (F) 69.3 ± 0.59%
  • 43. 42
  • 44. Conclusion • Development of new corpus with semi-automatic annotation • Published at PubAnnotation: http://pubannotation.org/projects/relna 30
  • 45. Conclusion • Development of new corpus with semi-automatic annotation • Published at PubAnnotation: http://pubannotation.org/projects/relna • Development of new method for extracting relations of transcription factors • Registered with Elixir Tools: https://bio.tools/tool/RostLab/relna/0.1.0 • Available on GitHub: https://github.com/Rostlab/relna 30
  • 46. Conclusion • Development of new corpus with semi-automatic annotation • Published at PubAnnotation: http://pubannotation.org/projects/relna • Development of new method for extracting relations of transcription factors • Registered with Elixir Tools: https://bio.tools/tool/RostLab/relna/0.1.0 • Available on GitHub: https://github.com/Rostlab/relna • Integration into nalaf and building a generalized relation extraction tool 30
  • 47. Future Work • Coreference resolution techniques • Generalizing method for spanning multiple sentences • Further testing with neural networks 31
  • 51. Corpus Statistics 1196 30 3 1 2 0 200 400 600 800 1000 1200 1400 Category 1 NumberofRelations Sentence Distance Sentence Distance for Relations 0 1 2 3 4 48
  • 52. Performance on LocText 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Original Relna Comparision with LocText Precision Recall F-Measure 49
  • 53. Parsing • Dependency Parsing • Identify the relations between words • O(n3) algorithm, with n as the number of words in the sentence • Constituency Parsing • Identify phrases (noun chunks, verb chunks etc.) and their relative structure and hierarchy in the sentence • O(n5) algorithm, with n as the number of words in the sentence 36
  • 54. Exhaustive List of Features • Sentence Features • BOW, Stem • #entities, #BOW count • Token Features • Token Text, Masked Text • Stem, POS • Capitalization, Digits, Hyphens and other Punctuations • Char bigrams and trigrams • Dependency Feature for Shortest Paths • Path direction (eg. FFRFR) • Dependency types in path • Path length • Intermediate Tokens • Path Constituents (eg. “interact”, “bind” etc.) • Root word of the sentence • Linear Context 33
  • 55. Other Features • Linear distance between entities • Presence of specific words in the sentence • Prior tokens • Intermediate tokens • Post tokens • N-gram features • Bigram • Trigram • Relative Entity Order • Conjoint Entity Text 34
  • 56. N-gram Features • The cow jumped over the moon • “the”, “cow”, “jumped”, “over”, “moon” • “the cow”, “cow jumped”, “jumped over”, “over the”, “the moon” • “the cow jumped”, “cow jumped over”, “jumped over the”, “over the moon” • … 35