I present cutting-edge concepts and tools drawn from algorithmic information theory (AIT) for new generation genetic sequencing, network biology and bioinformatics in general. AIT is the most advanced mathematical theory of information theory formally characterising the concepts and differences between simplicity, randomness and structure. Measures of AIT will empower computational medicine and systems biology to deal with big data, sophisticated analytics and a powerful new understanding framework.
Z Score,T Score, Percential Rank and Box Plot Graph
Algorithmic Information Theory and Computational Biology
1. Algorithmic Information Theory and
Computational Biology
Hector Zenil
Unit of Computational Medicine
Karolinska Institutet
Sweden
Hector Zenil AIT Tools for Biology and Medicine
3. Complexity is hard to quantify in biology
Mapping quantitative stimuli to qualitative behaviour
Hector Zenil AIT Tools for Biology and Medicine
4. Information Theory in Biology
Sequence alignment
Pattern recognition
Sequence logos
Binding site detection
Motif detection
Consensus sequences
Biological significance
[based on Claude Shannon’s Information Theory, 1940]
Hector Zenil AIT Tools for Biology and Medicine
5. Algorithmic Information Theory
Which sequence looks more random?
(a) AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
(b) AGGTCGTGAAGTGCGATGGCCTTACGTAGC
(c) GCGCGCGCGCGCGCGCGCGCGCGCGCGC
Classical probability theory vs. Kolmogorov Complexity
Definition
KU (s) = min{|p|, U(p) = s} (1)
Compressibility
A sequence with low Kolmogorov complexity is c-compressible if
|p| + c = |s|. A sequence is random if K (s) ≈ |s|.
[Kolmogorov (1965); Chaitin (1966)]
Hector Zenil AIT Tools for Biology and Medicine
6. Examples
Example 1
Sequences like (a) have low algorithmic complexity because they
allow a short description. For example, “20 times A”. No matter
how long (a) grows in length, the description increases only by
about log2 (k) (k times A).
Example 2
The sequence (b) is algorithmic random because it doesn’t seem to
allow a (much) shorter description other than the length of (b)
itself.
For example, for sequence (a), a proof of non-randomness implies
the exhibition of a short program. Compressibility is therefore a
sufficient test of non-randomness.
Hector Zenil AIT Tools for Biology and Medicine
7. Example of an evaluation of K
The sequence (b) GCGCGC...GC is not algorithmic random (or has
low K complexity) because it can be produced by the following
program (take G=0 and C=1):
Program A(i):
1: n:= 0
2: Print n mod 2
3: n:= n+1
4: If n=i Goto 6
5: Goto 2
6: End
The length of A (in bits) is an upper bound of K (GCGCGC ...GC ).
Hector Zenil AIT Tools for Biology and Medicine
8. The ultimate measure of pattern detection and optimal
prediction
Kolmogorov and Chaitin, Schnorr, and Martin-L¨fo
independently provided 3 different approaches to randomness
(compression, predictability and typicality).
They proved (for infinite sequences):
incompressibility ⇐⇒ unpredictability ⇐⇒ typicality
When this happens in mathematics a concept has objectively been
captured (randomness).
This is why prediction in biology is hard. AIT tells that no effective
statistical test will succeed to recognise all patterns and no
computable technique can fully predict all outcomes. The problem
is deeply connected to computability and algorithmic information
theory.
[Solomonoff (1964); Kolmogorov (1965); Chaitin (1969)]
Hector Zenil AIT Tools for Biology and Medicine
9. Information distances and similarity metrics
Measures waiting to be introduced in bioinformatics
Information Distance ID(x, y ) = max K (x|y ), K (y |x)
Universal Similarity Metric
USM(x, y ) = max K (x|y ), K (y |x)/ max K (x), K (y )
Normalised Information Distance:
NCD(x, y ) = K (xy ) − min K (x), K (y )/ max K (x), K (y ) and
NCD.
Normalized Compression Measure (NCM): NC (s) = K (s)/|s|
(asymptotic behaviour)
Bennett’s Logical Depth:
LDd (s) = min{t(p) : (|p| − |p ∗ | < d) and (U(p) = s)}
(e.g. of an app. see Zenil, Complexity 2011)
Hector Zenil AIT Tools for Biology and Medicine
10. Non-systematic but succesful attempts in biology
GenCompress is a compression algorithm to compress DNA
sequences: d(x, y ) = 1 − (K (x) − K (x|y ))/K (xy )
NCD applied to genetic similarity:
AIT looks at the genome as information, not as data (letters).
Counting: traditional Shannon-entropy style sequencing.
Interpreting: AIT. The full power of the theory hasn’t yet been
unleashed.
Hector Zenil AIT Tools for Biology and Medicine
11. To be or not to be...
Borel’s “Infinite Monkey” theorem
Input
1
0
1024 π
Syntax error
√2
∞
CH3
∞
“To be or not
to be, that is the
question.”
Hector Zenil AIT Tools for Biology and Medicine
13. Producing π
This C-language code produces the first 1000 digits of π (Gjerrit
Meinsma):
long k = 4e3, p, a[337], q, t = 1e3;
main(j){for (; a[j = q = 0]+ = 2, k; )
for (p = 1 + 2 ∗ k; j < 337; q = a[j] ∗ k + q%p ∗ t, a[j + +] = q/p)
k! = j > 2? : printf (“%.3d”, a[j2]%t + q/p/t); }
Producing non-random sequences:
If an object has low Kolmogorov complexity then it has a short description
and a greater probability to be produced by a random program. The less
random a string the more likely to be produced by a short program.
Hector Zenil AIT Tools for Biology and Medicine
14. Biological Big Data Analysis
The information bottleneck:
Small Data matters: Local measurements of information content
are a good indication of the global information content of an
object. Evidence: BDM Image classification. Compression works at
large scales looking for long regularities, while BDM is very local.
Yet both yield astonishing similar results for this object sizes.
Hector Zenil AIT Tools for Biology and Medicine
15. Complementary methods for different sequence lengths
The methods to approximate K coexist and complement each
other for different sequence lengths.
short strings long strings scalability
< 100 bits > 100 bits
Lossless compression
√ √
method ×
Coding Theorem
√
method × ×
Block Decomposition
√ √ √
method
[Zenil, Soler, Delahaye, Gauvrit, Two-Dimensional Kolmogorov
Complexity and Validation of the Coding Theorem Method by
Compressibility (2012)]
Hector Zenil AIT Tools for Biology and Medicine
16. Coding Theorem method and lossless compression
The transition between one method and the other. What is complex for
the Coding Theorem method is less compressible.
[Soler, Zenil, Delahaye, Gauvrit, Correspondence and Independence of
Numerical Evaluations of Algorithmic Information Measures (2012)]
Hector Zenil AIT Tools for Biology and Medicine
17. Online Algorithmic Complexity Calculator
Provides: Shannon’s entropy, lossless compression (Deflate) values,
Kolmogorov complexity approximations and relative frequency order
(algorithmic probability).
A Mathematica API and an R module.
Datasets available online at the Dataverse Network.
Basic data analysis tool for shorts sequence comparison.
[http://www.complexitycalculator.com]
Hector Zenil AIT Tools for Biology and Medicine
18. Online Algorithmic Complexity Calculator 2
[http://www.complexitycalculator.com]
Hector Zenil AIT Tools for Biology and Medicine
19. Simulation of natural systems w/complex symbolic systems
An elementary cellular automaton (ECA) is defined by a local
function f : {0, 1}3 → {0, 1},
f maps the state of a cell and its two immediate neighbours (range
= 1) to a new cell state: ft : r−1 , r0 , r+1 → r0 . Cells are updated
synchronously according to f over all cells in a row.
[Wolfram, (1994)]
Hector Zenil AIT Tools for Biology and Medicine
20. Behavioural classes of CA
Wolfram’s classes of behaviour:
Class I: Systems evolve into a stable state.
Class II: Systems evolve in a periodic (e.g. fractal) state.
Class III: Systems evolve into random-looking states.
Class IV: Systems evolve into localised complex structures.
e.g. Rule 110 or the Game of Life.
[Wolfram, (1994)]
Hector Zenil AIT Tools for Biology and Medicine
21. Block Decomposition method (BDM)
The Block Decomposition method uses the Coding Theorem
method. Formally, we will say that an object c has complexity:
K logm,2Dd×d (c) = (nu − 1) log2 (Km,2D (ru )) + Km,2D (ru )
(ru ,nu )∈cd×d
(2)
where cd×d represents the set with elements (ru , nu ), obtained
from decomposing the object into blocks of d × d with boundary
conditions. In each (ru , nu ) pair, ru is one of such squares and nu
its multiplicity.
[H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, (2012)]
Hector Zenil AIT Tools for Biology and Medicine
22. Classification of ECA by BDM versus lossless compression
Compressors have limitations (small sequences, time
complexity)
Applications to machine learning
Problems of classification and clustering
BDM is computationally efficient (runs in O(nd ) time, hence
linear (d = 1) time for sequences)
[H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, (2012)]
Hector Zenil AIT Tools for Biology and Medicine
23. Asymptotic behaviour of complex systems
[Zenil, Complex Systems (2010)]
Hector Zenil AIT Tools for Biology and Medicine
24. Rule space of 3-symbol 1D CA
[Zenil, Complex Systems (2011)]
Hector Zenil AIT Tools for Biology and Medicine
25. Phase transition detection
Definition
|C (Mt (i1 ))−C (Mt (i2 ))|+...+|C (Mt (in−1 ))−C (Mt (in ))|
ctn = t(n−1)
[Zenil, Complex Systems (2011)]
Hector Zenil AIT Tools for Biology and Medicine
26. A measure of programmability
∂f (ctn )
Ctn (M) = (3)
∂t
[Zenil, Complex Systems (2011)]
Hector Zenil AIT Tools for Biology and Medicine
27. Examples
Figure : ECA Rule 4 has a low Ctn for random chosen n and t (it doesn’t
react much to external stimuli). limn,t→∞ Ctn (R4) = 0
[H. Zenil, Philosophy & Technology, (2013)]
Hector Zenil AIT Tools for Biology and Medicine
28. Examples (cont.)
Figure : ECA R110 has large coefficient Ctn value for sensible choices of t
and n, which is compatible with the fact that it has been proven to be
capable of universal computation (for particular semi-periodic initial
configurations). limn,t→∞ Ctn (R110) = 1
Hector Zenil AIT Tools for Biology and Medicine
29. Classification of graphs
[Zenil, Soler, Dingle, Graph Automorphism Estimation and Complex
Network Topological Characterization by Algorithmic Randomness]
Hector Zenil AIT Tools for Biology and Medicine
30. Characterisation of complex networks
Complex Networks w/preferential attachment algorithms preserve
properties invariant under network size (connectedness, robustness)
at a low cost (unlike costly random nets in the number of links).
[Zenil, Soler, Dingle, Graph Automorphism Estimation and Complex
Network Topological Characterization by Algorithmic Randomness]
Hector Zenil AIT Tools for Biology and Medicine
31. Biological case study: Programmable Porphyrin molecules
Much about the dynamics of these molecules is known, one can perform
Monte-Carlo simulations based in these mathematical models and
establish a correspondence between Wang tiles and simple molecules.
[joint work with ICOS, U. of Nottingham] [G. Terrazas, H. Zenil and N.
Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based
Molecular Computing]
Hector Zenil AIT Tools for Biology and Medicine
32. Quantitative dynamics of living systems
Aggregations with similar Kolmogorov complexity cluster in similar
configurations.
[G. Terrazas, H. Zenil and N. Krasnogor, Exploring Programmable
Self-Assembly in Non DNA-based Molecular Computing]
Hector Zenil AIT Tools for Biology and Medicine
33. Mapping output behaviour to external stimuli: Parameter
discovery
Parameter Space P → Target Space T
Target space T : Set a configuration from P that triggers the
desired behaviour in T .
To investigate:
Reduction of the parameter space
Characterisation of the target space
[G. Terrazas, H. Zenil and N. Krasnogor, Exploring Programmable
Self-Assembly in Non DNA-based Molecular Computing]
Hector Zenil AIT Tools for Biology and Medicine
34. Robustness and pervasiveness
Concentration changes preserving behaviour:
Output parameters that have the highest impact can be tested in
silico before experiments in materio.
[G. Terrazas, H. Zenil and N. Krasnogor, Exploring Programmable
Self-Assembly in Non DNA-based Molecular Computing]
Hector Zenil AIT Tools for Biology and Medicine
35. Orthogonality
Specific concentrations producing certain behaviour using the
mathematical model to be tested against empirical data.
Hector Zenil AIT Tools for Biology and Medicine
36. Highlights and goals
Ultimate goal (a few years time): An information-theoretical
toolbox for systems and synthetic biology
[Complex3D Proteins Database (graph representation) &
Z Chen et al. Lung cancer pathways in response to treatments.]
Pushing boundaries.
A cutting-edge mathematical approach
Tools from Complexity theory.
Hector Zenil AIT Tools for Biology and Medicine
37. New Generation Sequence data analysis
Heavily driven by:
Explosion of experimental data
Difficulties in data interpretation
New paradigms for knowledge extraction
Data mining the behaviour of natural systems
Towards an AIT tool-kit for systems biology, a functional
library of programmable biological modules with a SBML
interface.
Hector Zenil AIT Tools for Biology and Medicine
38. J.P. Delahaye and H. Zenil, On the Kolmogorov-Chaitin complexity
for short sequences, in Cristian Calude (eds), Complexity and
Randomness: From Leibniz to Chaitin, World Scientific, 2007.
J.-P. Delahaye and H. Zenil, Numerical Evaluation of the Complexity
of Short Strings, Applied Mathematics and Computation, 2011.
H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit,
Two-Dimensional Kolmogorov Complexity and Validation of the
Coding Theorem Method by Compressibility, arXiv:1212.6745 [cs.CC]
F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,
Correspondence and Independence of Numerical Evaluations of
Algorithmic Information Measures, Numerical Algorithms (in 2nd
revision)
F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,
Calculating Kolmogorov Complexity from the Frequency Output
Distributions of Small Turing Machines, arXiv:1211.1302 [cs.IT]
H. Zenil, Compression-based Investigation of the Dynamical
Properties of Cellular Automata and Other Systems, Complex
Systems, Vol. 19, No. 1, pages 1-28, 2010.
Hector Zenil AIT Tools for Biology and Medicine
39. H. Zenil and J.A.R. Marshall, Some Aspects of Computation
Essential to Evolution and Life, Ubiquity, 2012.
H. Zenil, What is Nature-like Computation? A Behavioural Approach
and a Notion of Programmability, Philosophy & Technology (special
issue on History and Philosophy of Computing), 2013.
H. Zenil, On the Dynamic Qualitative Behavior of Universal
Computation Complex Systems, vol. 20, No. 3, pp. 265-278, 2012.
H. Zenil, A Turing Test-Inspired Approach to Natural Computation
In G. Primiero and L. De Mol (eds.), Turing in Context II (Brussels,
10-12 October 2012), Historical and Contemporary Research in
Logic, Computing Machinery and Artificial Intelligence, Proceedings
published by the Royal Flemish Academy of Belgium for Science and
Arts, 2013.
G.J. Chaitin A Theory of Program Size Formally Identical to
Information Theory, J. Assoc. Comput. Mach. 22, 329-340, 1975.
A. N. Kolmogorov, Three approaches to the quantitative definition
of information Problems of Information and Transmission, 1(1):1–7,
1965.
Hector Zenil AIT Tools for Biology and Medicine
40. L. Levin, Laws of information conservation (non-growth) and aspects
of the foundation of probability theory, Problems of Information
Transmission, 10(3):206–210, 1974.
M. Li, P. Vit´nyi, An Introduction to Kolmogorov Complexity and Its
a
Applications, Springer, 3rd. ed., 2008.
R.J. Solomonoff. A formal theory of inductive inference: Parts 1 and
2, Information and Control, 7:1–22 and 224–254, 1964.
S. Wolfram, A New Kind of Science, Wolfram Media, 2002.
Hector Zenil AIT Tools for Biology and Medicine