2. In modern systems biology we have three main data domains.
1) Experimental data from genomics types of experiments like in the example,
(bottom right) microarrays. Note that this type requires intensive
precalculations (quality control, filtering, clustering, annotation) but that is
not enough to really understand the data. You see patterns in the data, but
you do not really know what they mean. Large scale genomics data has
been available over the pas 15 years or so, and although technologies
used are now being replaced that doesn’t really change this field.
2) Existing knowledge (see next slide), that can be used to better understand
the two other types of data
3) Genetics (sequence based) data that rapidly becomes more important with
the decrease of sequencing cost. The addition of the leftmost corner to the
triangle is relatively new, and I will only discuss it in the last few slides
2
3. Huge amounts of existing knowledge can be found hidden in the literature or in
the heads of people. The hard task is to collect it from there and to make it
available for analysis. (People on the slide are Ben van Ommen - NuGO
director, Hannelore Daniel – nutrigenomics chair from Munich and a Thai
Princess and institute director.
Note that a lot of information is also available in curated databases, but that
was left out of the talk for brevity reasons. You could say that structuring of the
other knowledge is needed to provide these databases that can then be used
for analysis.
3
4. An historical example of a microarray result. Again note the intensive
preprocessing done. (clustering to the left, annotation to the right).
Nevertheless the data is very hard to understand. Especially if you take into
account that there are about 20,000 genes on a typical array. About as much
as there are words in a dictionary.
5. But if you are willing to make the effort you can actually see meaningful groups
of genes within specific coexpression clusters. Like the fatty acid degradation
genes shown here. But it is hard to find (or easy to miss) all relevant pathways.
6. Probably not an iPAD, those microarrays were at least 10 years old.
6
7. The problem is not only the long list of resulting genes, but also the
oversampling that occurs. In genomics experiments you typically get large
numbers of false positives at useful levels of significance. Of course false
discovery rate corrections exist but they will usually also loose information.
Pathway or function group (ontology) analysis helps since it is not likely that a
larger set of genes occur as false positives within a smaller functional group.
On the other hand the meaning of pathway statistics should not be
overestimated There are many aspects in real biology and in the way the
groups are build that influence the statistical outcome.
For instance when you have two metabolic reactions where one is catalyzed
by a single enzyme and the other by 4. Are all enzymes of the same
importance? Or are the four together as important as the single one? Or are 3
of the 4 not important in reality and the other one is? All these situations can
occur and the statistics just doesn’t know.
Also suppose you 10 non-regulated genes to a pathway. That will change
significance of your result, but it doesn’t change the biology behind it.
7
8. Example of a pathway that can be used for the purposes described.
9. A closer look at the same pathway.
Note that this uses MIM notation from the MIM PathVisio plugin.
In general the connections between different genes and metabolites describe
the network underlying the pathway. Note that this is already quite complex
since there are different ways to show what interacts with what.
Graphical methods to capture this like MIM and SBGN definitely help. The
result can be captures in descriptive relationships in BioPax,
9
11. PathVisio can do a combined visualization of different omics results. Here
proteomics and transcriptomics both shown on the same gene product boxes.
It can also show effects from metabolomics.
13. This talk is not really about WikiPathways. Check out the information in the
paper or the information on the wiki itself. (www.wikipathways.org) developer
information is mainly on the www.pathvisio.org website.
13
14. You obtain microarray data (e.g. affymetrix)
You can visualize micorarray data
Each color corresponds to a measured datapoint
For example, green is up, red is down, grey is constant
And now? How do you make sure the Affymetrix probeset IDs related to the
measurements can be mapped to the gene products in the pathway?
14
15. On WikiPathways (or in pathvisio) you can attach identifiers to each gene. A
click opens up the corresponding page on (this specific case) the worm
database.
You can download the corresponding transcript sequence in two clicks
This makes it for instance really easy to design primers
15
16. As soon as you have entered one (and only one) identifier to describe what
gene product or metabolite you really mean this information is linked to many
other identifiers from other databases and links to these respective pages are
shown in the so called “backpage” (actually one of the pages under the tabs at
the righthand side of the pathway).
16
17. BridgeDB (see www.bridgedb.org and the paper mentioned on the slide)
provides the mechanism needed for that identifier mapping.
17
18. Pathways can be downloaded to be used in different tools.
There is also a wikipathway webservice. See:
http://www.wikipathways.org/index.php/Help:WikiPathways_Webservice
Thomas Kelder, Alexander R Pico, Kristina Hanspers, Chris Evelo & Bruce R
Conklin. Mining biological pathways using WikiPathways web services.
PLoS One (2009) 4: 7 e644. http://dx.doi.org/10.1371/journal.pone.0006447
We also have semantic output in RDF which can be queried through a
SPARQL endpoint described at semantics.bigcat.unimaas.nl.
20. And a solution that isn’t really a solution. There are just too many things you
could add.
20
21. The PathVisio Regulatory Interaction plugin (author Stefan van Helden) has a
new approach where information is not really added to a pathway, but shown
in a separate page upon request.
21
22. The plugin can be found here:
http://chianti.ucsd.edu/cyto_web/plugins/displayplugininfo.php?name=GPML-Plugin
It can be used to read and write gpml pathway files used by WikiPathways and
PathVisio in Cytoscape
22
23. Example showing some more advanced usage of the GPML plugin.
Data from the NuGO proof of principle study with dietary challenged mice.
Three tissues were sampled and in the other two tissues relatively many
genes showed expression changes on Affymetrix arrays but not many
pathways were found.
For liver the number of genes affected was lower but the number of pathways
found to be affected was found to be higher (how come)?
The pathway based network analysis showed that there was a set of stronger
affected pathway (more reguated genes, large blue circles) that share
regulated genes (the red diamonds). When looking at the highlighted group of
pathways it became clear that these all belong to the same superste of
biologically relevant pathways (fatty acid metabolism and inflammation).
23
24. A paper that we published with a more extensive pathway relationship
approach. It takes into account relations between pathways through affected
genes not necessarily showing up in either pathway.
24
26. The approach takes into account all data use (pathways, interactions and
experimentally determined weight). Check out the original paper for details.
26
28. And you can do the same for relatively large sets of pathways “driving” a
process like apoptosis.
28
29. CyTargetLinker is a Cytoscape plugin that can be used to extend one network
with information about things targeting entities in that network from databases
that are created as a network. It already provides a number of target relation
databases as mentioned on the slide.
29
30. Example of a target network. (You will normally see this, it contains the
information that is used to extend your source network).
30
32. You can drive it from a gene set, that isn’t even a network at the start. But
when miRNAs are found to target more than one gene in the ggroup the
network is created on the fly.
32
33. Or you can bootstrap the approach from an existing network. Which can be a
pathway based one imported with the GPML plugin like shown here.
33
34. An overview of the Open Phacts project that pulls in lots of information in a
semantic web triple store (including information from WikiPathways RDF) and
then provides that for use in other tools. In WikiPathways we use that to
suggest possible pathway extensions to curators
34
35. This show the PathVisio Loom plugin in action. A gene or metabolite in a
pathway under development (left side) is right clicked and the LOOM is
activated to pull related genes or metabolites from another resource
(database, text mining result or Open Phacts API). The suggested interactions
are shown in the window on the right and the entities are added to the pathway
(two already shown on the left).
36. Talk so far focused on the genomics-knowledge relationship shown on the
right, So what about genetics?
36
38. This is the image was to us by Jim Kaput (at that time NTCR, now
Nestle).”Look people group those SNPs in gene groups, made sense of the
directions and showed them in a pathway. Can you do something like that?”
38
41. So it would really look like a bunch of jellies if we show these all on the genes
in a pathway, and you would not know what they mean.
41
42. There are loads of bioinformatics tools out there (like Sift and Polyphen) that
allow us to estimate functional effects of SNPs on coded protein (activity or
protein-protein interactions), binding site for transcription factors in the DNA, or
miRNA in RNA. Doing that we can decide what edges SNPs would affect (and
how much in what direction). Now as soon as you do that you can use the
result to strengthen SNP statistics (ie create groups that can be used for
supervised types of group based GWAS analysis) or to build predictive models
to estimate that specific (personal or tissue/tumor based) sets of variations
would do. That provides a need to use the pathways to link experimental
(genomics) data not only to the genetic variations occurring in there, but also
to modeling results
42
43. Showing the concept. Integrating flux predictions from modelling (of course
that could also be real fluxomics data)
43
44. And showing “real” results from the new flux data representation plugin.
The plugin is functional but we still need better mapping databases for reaction
identifiers
44
45. Many people involved in this work. (Really many if you count associated
groups like the plugin developers, pathway curators etc).
Most important
SF group (Kristina Hanspers, Bruce Conklin and Alex Pico) collaborating on
many things but primarily WikiPatwhays
Martijn van Iersel top left (PathVisio, BridgeDB). Thomas Kelder (top middle)
(WikiPathways including webservices, pathway integration networks for
nutrigenomics), Martina Kutmon (top right) (CyTargetLinker, PathVisio further
development), Andra Waagmeester (second row, right) (WikiPathways RDF),
Anwesha Dutta (bottom, 2nd from the left) (flux visualization), Stefan van
Helden (not on the picture) for the RI PathVisio plugin
45