3. Transcription Factors Interactions
• Binds to specific DNA sequences and controls transcription
• Two types of interactions
• Transcription factor transcribes gene
2
4. Transcription Factors Interactions
• Binds to specific DNA sequences and controls transcription
• Two types of interactions
• Transcription factor transcribes gene
• Protein modifies or interacts with transcription factor
2
5. Transcription Factors Interactions
• Binds to specific DNA sequences and controls transcription
• Two types of interactions
• Transcription factor transcribes gene
• Protein modifies or interacts with transcription factor
• Estimated 45,000+ such interactions
2
6. • TRANSFAC
• Manually curated DB
• Eukaryotic TF and genomic binding
sites
• Commercial version contains
reports for 21,000 transcription
factors[1]
• Public version contains reports for
7,000 transcription factors[1]
3
Related Work
[1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf
[2]: http://cbrc.kaust.edu.sa/tcof/
[3]: http://itfp.biosino.org/itfp/
7. • TRANSFAC
• Manually curated DB
• Eukaryotic TF and genomic binding
sites
• Commercial version contains
reports for 21,000 transcription
factors[1]
• Public version contains reports for
7,000 transcription factors[1]
• TcoF – Transcription co-Factor
Database
• 1365 transcription factors[2]
• Manually curated from BioGrid,
MINT and EBI
3
Related Work
[1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf
[2]: http://cbrc.kaust.edu.sa/tcof/
[3]: http://itfp.biosino.org/itfp/
8. • TRANSFAC
• Manually curated DB
• Eukaryotic TF and genomic binding
sites
• Commercial version contains
reports for 21,000 transcription
factors[1]
• Public version contains reports for
7,000 transcription factors[1]
• TcoF – Transcription co-Factor
Database
• 1365 transcription factors[2]
• Manually curated from BioGrid,
MINT and EBI
• Integrated Transcription Factor
Platform
• Predicted interactions using
sequence data[3]
• SVMs
3
Related Work
[1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf
[2]: http://cbrc.kaust.edu.sa/tcof/
[3]: http://itfp.biosino.org/itfp/
11. 6
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Filter Swissprot for transcription-factor activity, sequence specific DNA-
binding (GO Term GO:0003700) and Homo sapiens
12. 7
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for
“Interaction with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
For each protein, obtain list of publications cited for “INTERACTION WITH”
13. 8
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Run GNormPlus, a gene tagger, on each of these abstracts – giving Gene or
Gene Product (GGP)
14. 9
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Entrez Gene IDs normalized to Uniprot IDs using priority selection
(first Swissprot, then TrEMBL)
15. 10
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
All GGPs cross-referenced with GO Term GO:0003700 and its descendants. If
Uniprot ID contains annotation for transcription factor activity, labeled as
transcription factor
16. 11
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and
Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Correct entity boundaries and offsets, and add abbreviations
17. 12
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Annotation of relations on tagtog manually
21. Method Development
• Pipeline (based on nalaf)
16
Data
Reader
Annotation
Recognition
Feature
Generator
ParserTokenizerSplitter Learning Evaluator
Dataset object gets created and passed around from one module to the next in the pipeline
Read in textual
data from
some input
(PMID, HTML)
to create a
Dataset object
Read in
Annotations
from PubTator
or ann.json
and augment it
to Dataset
Splits the text
of each
Document in
the dataset
into sentences
Creates Tokens
representing
the smallest
processing
unit, usually
words
Parses each
sentence and
store syntactic
and
dependency
parse trees
Generate
features for
learning a
model
Handles
learning and
prediction
using
SVMLight
Evaluates
performance
of prediction
at the edge
and document
level
Edge
Generator
Generates
potential
relations and
reduces relation
extraction to
binary
classification
Writer
Writes the
predicted
annotations in
tagtog
compatible
ann.json
format
22. Method Development
• Pipeline (based on nalaf)
16
Data
Reader
Annotation
Recognition
Feature
Generator
ParserTokenizerSplitter Learning Evaluator
Dataset object gets created and passed around from one module to the next in the pipeline
Read in textual
data from
some input
(PMID, HTML)
to create a
Dataset object
Read in
Annotations
from PubTator
or ann.json
and augment it
to Dataset
Splits the text
of each
Document in
the dataset
into sentences
Creates Tokens
representing
the smallest
processing
unit, usually
words
Parses each
sentence and
store syntactic
and
dependency
parse trees
Generate
features for
learning a
model
Handles
learning and
prediction
using
SVMLight
Evaluates
performance
of prediction
at the edge
and document
level
Edge
Generator
Generates
potential
relations and
reduces relation
extraction to
binary
classification
Writer
Writes the
predicted
annotations in
tagtog
compatible
ann.json
format
28. Feature Generation
• Sentence Features
• Token Features
• N-gram Features
20
Feature example:
“androgen receptor”,
“receptor interacts”,
“interacts with”, “with SMRT”
29. Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
20
Feature example:
“linear_distance” : 2
30. Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
receptor -> interacts -> with -> SMRT
31. Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
[“interacts”, “with”]
32. Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
“root_word”: “interacts”
33. Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
receptor -> interacts
SMRT -> with -> interacts
44. Conclusion
• Development of new corpus with semi-automatic annotation
• Published at PubAnnotation: http://pubannotation.org/projects/relna
30
45. Conclusion
• Development of new corpus with semi-automatic annotation
• Published at PubAnnotation: http://pubannotation.org/projects/relna
• Development of new method for extracting relations of transcription
factors
• Registered with Elixir Tools: https://bio.tools/tool/RostLab/relna/0.1.0
• Available on GitHub: https://github.com/Rostlab/relna
30
46. Conclusion
• Development of new corpus with semi-automatic annotation
• Published at PubAnnotation: http://pubannotation.org/projects/relna
• Development of new method for extracting relations of transcription
factors
• Registered with Elixir Tools: https://bio.tools/tool/RostLab/relna/0.1.0
• Available on GitHub: https://github.com/Rostlab/relna
• Integration into nalaf and building a generalized relation extraction
tool
30
47. Future Work
• Coreference resolution techniques
• Generalizing method for spanning multiple sentences
• Further testing with neural networks
31
53. Parsing
• Dependency Parsing
• Identify the relations between words
• O(n3) algorithm, with n as the number of words in the sentence
• Constituency Parsing
• Identify phrases (noun chunks, verb chunks etc.) and their relative structure
and hierarchy in the sentence
• O(n5) algorithm, with n as the number of words in the sentence
36
54. Exhaustive List of Features
• Sentence Features
• BOW, Stem
• #entities, #BOW count
• Token Features
• Token Text, Masked Text
• Stem, POS
• Capitalization, Digits, Hyphens and
other Punctuations
• Char bigrams and trigrams
• Dependency Feature for
Shortest Paths
• Path direction (eg. FFRFR)
• Dependency types in path
• Path length
• Intermediate Tokens
• Path Constituents (eg. “interact”,
“bind” etc.)
• Root word of the sentence
• Linear Context
33
55. Other Features
• Linear distance between entities
• Presence of specific words in the
sentence
• Prior tokens
• Intermediate tokens
• Post tokens
• N-gram features
• Bigram
• Trigram
• Relative Entity Order
• Conjoint Entity Text
34
56. N-gram Features
• The cow jumped over the moon
• “the”, “cow”, “jumped”, “over”, “moon”
• “the cow”, “cow jumped”, “jumped over”, “over the”, “the moon”
• “the cow jumped”, “cow jumped over”, “jumped over the”, “over the moon”
• …
35