1. DÉCOUVERTE ET EXPLORATION
DES MODULES CONSERVÉS DE
TRANSFORMATIONS CHIMIQUES
DANS LE MÉTABOLISME
MARIA SOROKINA
3 FÉVRIER 2016
Ecole doctorale
Structure et Dynamique des Systèmes Vivants
3. What is the metabolism?
Metabolism is the overall biochemical processes by which
living organisms are maintained in life, grow, reproduce and
interact with the environment
3
4. What is the metabolism?
Metabolism is the overall biochemical processes by which
living organisms are maintained in life, grow, reproduce and
interact with the environment
« Μεταβολή » (metabôlé) – greek – change, transformation
4
5. What is the metabolism?
Metabolism is the overall biochemical processes by which
living organisms are maintained in life, grow, reproduce and
interact with the environment
« Μεταβολή » (metabôlé) – greek – change, transformation
Chemical transformations mainly concern small molecules –
metabolites- which are modified by (bio)chemical reactions
5
6. What is the metabolism?
Metabolism is the overall biochemical processes by which
living organisms are maintained in life, grow, reproduce and
interact with the environment
« Μεταβολή » (metabôlé) – greek – change, transformation
Chemical transformations mainly concern small molecules –
metabolites- which are modified by (bio)chemical reactions
6
Successive reactions aiming the production or degradation of a target metabolite are
described in metabolic pathways
7. What is the metabolism?
Metabolism is the overall biochemical processes by which
living organisms are maintained in life, grow, reproduce and
interact with the environment
« Μεταβολή » (metabôlé) – greek – change, transformation
Chemical transformations mainly concern small molecules –
metabolites- which are modified by (bio)chemical reactions
Biochemical reactions are often catalysed by enzymes – proteins encoded in the
organism genome and having the ability to facilitate specific reactions
7
Successive reactions aiming the production or degradation of a target metabolite are
described in metabolic pathways
8. Biochemical reactions are often catalysed by enzymes – proteins encoded
in the organism genome and having the ability to facilitate specific
reactions
8
Enzymes
Genome
Reactions
Transforming metabolites
9. Presentation Outline
Introduction: From Genome to Metabolism
Part I: Orphan Enzymes
Part II: Reaction Molecular Signature Network and Conserved Modules
Part III: Combining Genomic and Metabolic Contexts
Conclusions & Perspectives
9
10. From Genome To Metabolism
Enzymes
Genome
Reactions
Transforming metabolites
11. From Genome To Metabolism
Sequencing
Enzymes
Genome
Reactions
Transforming metabolites
12. From Genome To Metabolism
Finding CDS
(protein-coding genes)
Sequencing
Enzymes
Genome
Reactions
Transforming metabolites
13. From Genome To Metabolism
Sequencing
Functional annotation
Finding CDS
(protein-coding genes)
Enzymes
Genome
Reactions
Transforming metabolites
14. Functional Annotation
Assigning a biological function to a protein
« Through experimentation (high confidence)
« Homology detection through sequence similarity (BLAST….)
« Genomic context
« Protein structural analysis
« Rules-based annotation systems
« Community annotation systems
14
19. Metabolic Networks
Bipartite network of metabolites and reactions
o Nodes = metabolites and reactions
19
Pyruvate
Formate
Acetyl-CoA
Acetaldehyde
Ethanol
Coenzyme A
NADH
NAD+
Reaction
2.3.1.54
Reaction
1.2.1.10
Reaction 1.1.1.1
20. Metabolic Networks
Metabolite network
o Nodes = metabolites
o Edge between two nodes if
there is a reaction where one
of the metabolites is the
substrate and the other is
the product
20
Pyruvate
Formate
Acetyl-CoA
Acetaldehyde
Ethanol
Coenzyme A
NADH
NAD+
21. Metabolic Networks
Reaction network
o Nodes = reactions
o Edge between two nodes if
there is a metabolite produced
by a reaction substrate of the
other reaction
21
Reaction 2.3.1.54
Reaction 1.2.1.10
Reaction 1.1.1.1
22. Metabolic Networks
Enzyme network
o Nodes = enzymes
o Edge between two nodes if
there is a metabolite produced
by an enzyme substrate of the
other enzyme
o Limitations :
o An enzyme can catalyse several reactions
o A reaction can be catalysed by several enzymes
o Incomplete knowledge of enzymes (orphan enzymes)
22
Pyruvate formate
lyase
Acetaldehyde
dehydrogenase
Alcohol
dehydrogenase
23. Metabolic Networks
Ubiquitous compounds problem
CO2 ATP/ADP H2O H+ NAD(P)+/NAD(P)H ….
Create important hubs in the metabolic network
! need to take them into account!
“Primary” and “secondary” metabolites
in reactions in pathways
23
Ubiquitous: existing or being everywhere,
especially at the same time; omnipresent:
Hub: A highly connected node in a graph
24. Main difficulties in metabolic network reconstruction
from whole genomes:
« Gene functional annotation issues
24
25. Main difficulties in metabolic network reconstruction
from whole genomes:
« Gene functional annotation issues
25
>60% of functional
annotations in UniProt
may be erroneous
2009
26. Main difficulties in metabolic network reconstruction
from whole genomes:
« Gene functional annotation issues
« Orphan enzymes
26
28. What is an orphan enzyme?
An “orphan enzyme activity” (or “orphan enzyme” for short) is a known
biochemical activity for which there is any associated sequence (yet)
28
29. Orphan enzymes
2004: Karp: Call for an enzyme genomics initiative. (38% of orphan enzymes)
2005: Lespinet & Labedan: Orphan enzymes? (42% of orphan enzymes)
2006: Lespinet & Labedan: ORENZA database. (36% of orphan enzymes)
2007: Chen & Vitkup: Distribution of orphan metabolic activities. (34% of orphan enzymes)
2007: Pouliot & Karp: A survey of orphan enzyme activities. (34% of orphan enzymes)
29
30. Orphan enzymes
22%
78%
>5,000 enzymatic activities
IUBMB - EC numbers
Orphan
enzymes
30
Enzyme Commission (EC) number:
Official classification of enzyme activities
Reaction class
Metabolite type
Reaction nature
Serial number
31. Enzyme activities and annotated proteins over years
Limited number of recently discovered activities
Protein
sequencing
DNA sequencing
Expression cloning
Genomics
31
32. Enzyme discovery and protein families
23%
77%
>14,000 protein families
Pfam
22%
78%
>5,000 enzymatic activities
IUBMB - EC numbers
Unknown
functionOrphan
enzymes
32
33. Enzyme discovery and protein families
Newly discovered enzymatic activities are mostly associated with already
known enzyme families 33
34. Local Orphan Enzymes
Enzymatic activities that have been observed
in at least one organism of a given clade and
having a sequence associated in an other clade
but not in this one
34
Local orphan EC
numbers
Achaea Bacteria Eukaryotes
Total number of
concerned EC
numbers
79 133 299
% of EC retrieved
with PRIAM
(significant hit with a
detected protein)
30% 30% 59%
35. Main difficulties in metabolic network reconstruction
from whole genomes:
« Gene functional annotation issues
« Orphan enzymes
« Lack of knowledge on organism metabolic diversity
35
43. 43
Lack of knowledge about metabolism diversity in non-model organisms
What strategy can be adopted to counter this lack of knowledge?
44. 44
All main hypotheses on metabolic pathway evolution agree about the
importance of enzyme promiscuity, i.e. the capacity of enzymes to catalyze
one or several reactions on more or less different substrates…
…we should look at the conservation of chemical transformations in
pathways and not only the conservation of enzymatic reaction
50. 50
Molecular signature
set of sub-graphs of given diameter (height) centered on each atom of the
molecule
Carbonell, P., Carlsson, L., Faulon, J.-L.: Stereo signature molecular descriptor. Journal of Chemical Information and Modeling
53(4), 887–97 (2013)
51. 51
Molecular signature
set of sub-graphs of given diameter (height) centered on each atom of the
molecule
Carbonell, P., Carlsson, L., Faulon, J.-L.: Stereo signature molecular descriptor. Journal of Chemical Information and Modeling
53(4), 887–97 (2013)
52. 52
Molecular signature
set of sub-graphs of given diameter (height) centered on each atom of the
molecule
Carbonell, P., Carlsson, L., Faulon, J.-L.: Stereo signature molecular descriptor. Journal of Chemical Information and Modeling
53(4), 887–97 (2013)
53. 53
Molecular signature
set of sub-graphs of given diameter (height) centered on each atom of the
molecule
Carbonell, P., Carlsson, L., Faulon, J.-L.: Stereo signature molecular descriptor. Journal of Chemical Information and Modeling
53(4), 887–97 (2013)
54. 54
Molecular signature
set of sub-graphs of given diameter (height) centered on each atom of the
molecule
Carbonell, P., Carlsson, L., Faulon, J.-L.: Stereo signature molecular descriptor. Journal of Chemical Information and Modeling
53(4), 887–97 (2013)
55. How To Represent Reactions And Their Chemical
Transformation Type?
55
56. 56
Reaction molecular signature (RMS)
difference between molecular signatures of products and substrates of the
reaction
Carbonell, P., Carlsson, L., Faulon, J.-L.: Stereo signature molecular descriptor. Journal of Chemical Information and Modeling
53(4), 887–97 (2013)
57. 57
Reaction molecular signature (RMS)
difference between molecular signatures of products and substrates of the
reaction
… specifically, it consists in keeping changing substructures, or, a way to encode the chemical
transformation
Carbonell, P., Carlsson, L., Faulon, J.-L.: Stereo signature molecular descriptor. Journal of Chemical Information and Modeling
53(4), 887–97 (2013)
58. 58
Reaction molecular signature (RMS)
difference between molecular signatures of products and substrates of the
reaction
… specifically, it consists in keeping changing substructures, or, a way to encode the chemical
transformation
Carbonell, P., Carlsson, L., Faulon, J.-L.: Stereo signature molecular descriptor. Journal of Chemical Information and Modeling
53(4), 887–97 (2013)
59. 59
Reaction molecular signature (RMS)
difference between molecular signatures of products and substrates of the
reaction
… specifically, it consists in keeping changing substructures, or, a way to encode the chemical
transformation
Carbonell, P., Carlsson, L., Faulon, J.-L.: Stereo signature molecular descriptor. Journal of Chemical Information and Modeling
53(4), 887–97 (2013)
66. « Nodes represent reactions
« Two nodes are linked by a
directed edge if there is a
metabolite produced by the first
reaction that is consumed by the
second reaction
« 5,830 nodes
« 11,197 edges
66
71. Transformation of a reaction network in a RMS network
71
Markov chains transition probabilities of order 1 between connected RMSMarkov chains transition probabilities of order 1 between RMSi and RMSj
75. 75
Pathway conservation index (PCI)
✴ Computed for each RMS path present in at least one known metabolic pathway
✴ Represents the number of corresponding reaction paths that are present in at
least one MetaCyc pathway
… captures the chemical redundancy across the known metabolism
86. RMS path scores
86
wRea
Number of reactions described by a RMS
scoreRea
diversity of reactions performing the same chemical transformation
87. RMS path scores
87
wPageRank
Feedback centrality: the more neighbours a node
has, the more it is central. The more a node is
central, the more its neighbours are central
scorePageRank
topological importance of the module in the network by highlighting
chemical hubs
88. RMS path scores
88
wProt
Estimation of the number of proteins
associated to a given RMS
scoreProt
diversity of enzymes performing the same chemical transformation
89. RMS path scores
89
wProt
Estimation of the number of proteins
associated to a given RMS
scoreProt
diversity of enzymes performing the same chemical transformation
30% of RMS with weightProt=0
Link with orphan enzymes?
!
90. RMS path scores
90
Significant difference between scores distributions of known metabolic pathways and all/random
paths in the RMS network
(Kruskall-Wallis & Tuckey HSD tests for validation: p-value<<0.05)
91. RMS path scores
91
Learning pathway types from known metabolic pathways using rules combining
scoreProt, scoreRea and scorePageRank
NNge algorithm
Pathway type prediction with an accuracy of 89% for RMS paths
5 metabolic pathway types:
✴ biosynthesis
✴ degradation
✴ detoxification
✴ energy creation
✴ other
96. Gene Clusters: Operons
96
Operon: genomic unit containing a group of genes:
« co-localised on the same strand
« controlled by the same promoter
« co-transcripted in a polycistronic ARNm
« often associated to a same cellular function
99. 99
Linking directon genes to RMS
RMS1
RMS2
RMS3
RMS4
A Pfam is often associated to several RMS
! A gene is therefore often associated to several RMS
103. 103
Best paths selection for the directon
« Max number of gene colours
« High path scores (scoreRea,
scorePageRank, scoreProt)
104. 104
Protein Family Case
Case study for the Baeyer-Villiger MonoOxygenases protein family
A protein family is a group of
proteins that share a common
evolutionary origin, reflected by
their related functions and
similarities in sequence or structure.
106. 106
★All RMS
catalysed by the
protein family
★All directons
containing a
member of the
protein family
Directon clustering
based on their RMS
content
Network projection
of common RMS
from each directon
cluster
Path selection: max
colours, high scores
108. 108
★All RMS catalysed
by the protein
family
★All directons
containing a
member of the
protein family
Directon clustering
based on their RMS
content
Network projection
of common RMS
from each directon
cluster
Path selection: max
colours, high scores
109. Directons containing a BVMO
« 814 BVMO sequences
« 812 directons
« 468 organisms – only bacteria
109
110. 110
★All RMS catalysed
by the protein
family
★All directons
containing a
member of the
protein family
Directon
clustering based
on their RMS
content
Network projection
of common RMS
from each directon
cluster
Path selection: max
colours, high scores
111. 111
Clustering of BVMO-containing directons according
their content in RMS
Cluster 1
• 251 directons
• 0 common RMS
Cluster 2
• 308 directons
• 32 common RMS
Cluster 3
• 125 directons
• 10 common RMS
Cluster 4
• 69 directons
• 86 common RMS
Cluster 5
• 59 directons
• 5 common RMS
112. 112
★All RMS catalysed
by the protein
family
★All directons
containing a
member of the
protein family
Directon clustering
based on their RMS
content
Network
projection of
common RMS
from each
directon cluster
Path selection: max
colours, high scores
113. 113
Cluster projection on the RMS network
& selection of maximal connected components
Cluster 2
« Pink nodes: RMS BVMOs
« Grey nodes: RMS known to be in
BVMOs metabolic context
« Blue nodes: RMS never seen in BVMO
metabolic context
« Green edges: links between RMS from
known metabolic paths where BVMOs
are involved
115. 115
★All RMS catalysed
by the protein
family
★All directons
containing a
member of the
protein family
Directon clustering
based on their RMS
content
Network projection
of common RMS
from each directon
cluster
Path selection:
max colours, high
scores
119. What has been done?
« Orphan enzyme survey
« Update of statistics
« Protein families and local orphan enzymes
« A new representation of metabolism using a network of chemical
transformations
« Definition and detection of conserved modules
« Rules for module type prediction
« Network exploration using genomic and metabolic contexts
« Definition of a strategy to explore the functional diversity of enzyme families
« Application to the Baeyer-Villiger Monooxygenases
119
120. What’s next?
Method improvements
« Detect branched and cyclic conserved modules
« Determine specific domains/profiles for RMS: using PRIAM/MKDOM-like
methods
« Improve gene cluster projection on the RMS network
Applications
« RMS to classify enzyme activities
« Assign sequences for orphan enzymes and reactions for orphan metabolites
« Application on other protein families
« A way to study biological systems from a chemical point of view
120
121. David Vallenet
Claudine Médigue
Systems biology team:
Karine Bastard
Mark Stam
Jonathan Mercier
Guillaume Reboul
And all LABGeM
Jean-Loup Faulon
Olivier Lespinet
121
Acknowledgements
124. Metabolic Networks
Metabolites hypergraph
o Nodes = metabolites
o Hyperedge linking all metabolites implied in the reaction
124
Pyruvate
FormateAcetyl-CoA
Acetaldehyde
Ethanol
Coenzyme A
NADH
NAD+