Relation.Extraction.of.TF.and.GGP

Text Mining of
Transcription Factors to Proteins Interactions
Ashish Baghudana
1
Supervisor: Prof. Dr. Burkhard Rost
Advisor: Juan Miguel Cejuela

Transcription Factors Interactions
• Binds to specific DNA sequences and controls transcription
2

• Two types of interactions
• Transcription factor transcribes gene
2

• Protein modifies or interacts with transcription factor
2

• Protein modifies or interacts with transcription factor
• Estimated 45,000+ such interactions
2

• TRANSFAC
• Manually curated DB
• Eukaryotic TF and genomic binding
sites
• Commercial version contains
reports for 21,000 transcription
factors[1]
• Public version contains reports for
7,000 transcription factors[1]
3
Related Work
[1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf
[2]: http://cbrc.kaust.edu.sa/tcof/
[3]: http://itfp.biosino.org/itfp/

• TRANSFAC
sites
factors[1]
• TcoF – Transcription co-Factor
Database
• 1365 transcription factors[2]
• Manually curated from BioGrid,
MINT and EBI
3
Related Work

• TRANSFAC
sites
factors[1]
• TcoF – Transcription co-Factor
Database
• 1365 transcription factors[2]
• Manually curated from BioGrid,
MINT and EBI
• Integrated Transcription Factor
Platform
• Predicted interactions using
sequence data[3]
• SVMs
3
Related Work

Problem Motivation
4
MEDLINE currently has more than 24M articles and adds over 1M articles per year

6
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Filter Swissprot for transcription-factor activity, sequence specific DNA-
binding (GO Term GO:0003700) and Homo sapiens

7
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for
“Interaction with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
For each protein, obtain list of publications cited for “INTERACTION WITH”

8
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Run GNormPlus, a gene tagger, on each of these abstracts – giving Gene or
Gene Product (GGP)

9
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Entrez Gene IDs normalized to Uniprot IDs using priority selection
(first Swissprot, then TrEMBL)

10
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
All GGPs cross-referenced with GO Term GO:0003700 and its descendants. If
Uniprot ID contains annotation for transcription factor activity, labeled as

11
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
Offset and
Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Correct entity boundaries and offsets, and add abbreviations

12
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Annotation of relations on tagtog manually

Relation Extraction
• Binary Classification
15
Entity 1 Entity 2 Feature 1 … Feature n Class
RanBP10 Glucocorticoid
Receptor
1 … 0 True
RanBP10 Estrogen
receptor alpha
0 … 1 False

Method Development
• Pipeline (based on nalaf)
16
Data
Reader
Annotation
Recognition
Feature
Generator
ParserTokenizerSplitter Learning Evaluator
Dataset object gets created and passed around from one module to the next in the pipeline
Read in textual
data from
some input
(PMID, HTML)
to create a
Dataset object
Read in
Annotations
from PubTator
or ann.json
and augment it
to Dataset
Splits the text
of each
Document in
the dataset
into sentences
Creates Tokens
representing
the smallest
processing
unit, usually
words
Parses each
sentence and
store syntactic
and
dependency
parse trees
Generate
features for
learning a
model
Handles
learning and
prediction
using
SVMLight
Evaluates
performance
of prediction
at the edge
and document
level
Edge
Generator
Generates
potential
relations and
reduces relation
extraction to
binary
classification
Writer
Writes the
predicted
annotations in
tagtog
compatible
ann.json
format

(ROOT
(NP
(NP (DT The) (NN interaction))
(PP (IN between)
(NP (NNP Androgen) (NNP Receptor)
(CC and)
(NNP SMRT)))))
19
Constituency Parsing

Feature Generation
• Sentence Features
20
Feature example:
“Named Entity Count”: 2

Feature Generation
• Token Features
20
Feature example:
“interacts_stem” : “interact”
“interacts_pos” : “VBZ”
“interacts_lem” : “interact”

Feature Generation
• Token Features
• N-gram Features
20
Feature example:
“androgen receptor”,
“receptor interacts”,
“interacts with”, “with SMRT”

Feature Generation
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
20
Feature example:
“linear_distance” : 2

Feature Generation
• Token Features
• N-gram Features
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
receptor -> interacts -> with -> SMRT

Feature Generation
• Token Features
• N-gram Features
• Root word
20
Feature example:
[“interacts”, “with”]

Feature Generation
• Token Features
• N-gram Features
• Root word
20
Feature example:
“root_word”: “interacts”

Feature Generation
• Token Features
• N-gram Features
• Root word
20
Feature example:
receptor -> interacts
SMRT -> with -> interacts

Evaluation Modes
• Non-Unique
• SVM (or Edge) Performance
• Repeated relations
• Only if offsets AND text of entities
match
• Unique
• Document Performance
• No repetitions
• If texts of entities match
22
Non-Unique:
1. CDK9, STAT3
(2nd line)
2. CDK9, STAT3
(4th line)
Unique:
1. CDK9, STAT3

Training and Cross Validation
• Two methods
• 80:20
• 60:20:20
23Train Test Development

Results – Comparison of Methods
25
0.396
0.806
0.918
0.719
0.691
0.980
0.478
0.214
0.653
0.699
0.562
0.599
0.340
0.683 0.693
0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
Baseline Initial With Tree Kernels Undersampling Hyperparameter Tuning
Precision Recall F-Measure

Results
26
10
15
20
25
30
Fold1 Fold2 Fold3 Fold4 Fold5
Time(inseconds)
Time Taken per Fold (in seconds) – lower is better
Without Feature Selection With Feature Selection
0.55
0.6
0.65
0.7
0.75
0.8
Fold1 Fold2 Fold3 Fold4 Fold5
F-Measure
Performance per Fold (F-Measure) – higher is better
Without Feature Selection With Feature Selection

Conclusion
29
0.396
0.691
0.98
0.699
0.562
0.693
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Baseline Final
Performance Percentage
Precision (P) 69.1 ± 0.61%
Recall (R) 69.9 ± 0.56%
F-Measure (F) 69.3 ± 0.59%

Conclusion
• Development of new corpus with semi-automatic annotation
• Published at PubAnnotation: http://pubannotation.org/projects/relna
30

Conclusion
• Development of new method for extracting relations of transcription
factors
• Registered with Elixir Tools: https://bio.tools/tool/RostLab/relna/0.1.0
• Available on GitHub: https://github.com/Rostlab/relna
30

Conclusion
• Development of new method for extracting relations of transcription
factors
• Registered with Elixir Tools: https://bio.tools/tool/RostLab/relna/0.1.0
• Available on GitHub: https://github.com/Rostlab/relna
• Integration into nalaf and building a generalized relation extraction
tool
30

Future Work
• Coreference resolution techniques
• Generalizing method for spanning multiple sentences
• Further testing with neural networks
31

Corpus Statistics
1196
30 3 1 2
0
200
400
600
800
1000
1200
1400
Category 1
NumberofRelations
Sentence Distance
Sentence Distance for Relations
0 1 2 3 4
48

Performance on LocText
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Original Relna
Comparision with LocText
49

Parsing
• Dependency Parsing
• Identify the relations between words
• O(n3) algorithm, with n as the number of words in the sentence
• Constituency Parsing
• Identify phrases (noun chunks, verb chunks etc.) and their relative structure
and hierarchy in the sentence
• O(n5) algorithm, with n as the number of words in the sentence
36

Exhaustive List of Features
• BOW, Stem
• #entities, #BOW count
• Token Features
• Token Text, Masked Text
• Stem, POS
• Capitalization, Digits, Hyphens and
other Punctuations
• Char bigrams and trigrams
• Dependency Feature for
Shortest Paths
• Path direction (eg. FFRFR)
• Dependency types in path
• Path length
• Intermediate Tokens
• Path Constituents (eg. “interact”,
“bind” etc.)
• Root word of the sentence
• Linear Context
33

Other Features
• Linear distance between entities
• Presence of specific words in the
sentence
• Prior tokens
• Intermediate tokens
• Post tokens
• N-gram features
• Bigram
• Trigram
• Relative Entity Order
• Conjoint Entity Text
34

N-gram Features
• The cow jumped over the moon
• “the”, “cow”, “jumped”, “over”, “moon”
• “the cow”, “cow jumped”, “jumped over”, “over the”, “the moon”
• “the cow jumped”, “cow jumped over”, “jumped over the”, “over the moon”
• …
35

Relation.Extraction.of.TF.and.GGP

Recommandé

Recommandé

Contenu connexe

Similaire à Relation.Extraction.of.TF.and.GGP

Similaire à Relation.Extraction.of.TF.and.GGP (20)

Relation.Extraction.of.TF.and.GGP