The document presents a general DNA base-nucleotide substitution model and discusses three special cases: the three-substitution-type (3ST) model, two-substitution-type (2ST) model, and the Jukes-Cantor model. The 3ST model considers transitions and two types of transversions, while the 2ST model and Jukes-Cantor model further simplify the substitution rates. Differential equations are derived to model the change in base probabilities over time under each model.
1. DNA nucleotide substitution models 1
Running head: DNA NUCLEOTIDE SUBSTITUTION MODELS
On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution
Justine Leon A. Uro
Ph. D. Graduate Student
Department of Biostatistics, University of Michigan
Ann Arbor, MI
2. DNA nucleotide substitution models 2
Abstract
We present a general DNA base-nucleotide substitution model and discuss three special
cases: three-substitution-type (3ST), two-substitution-type (2ST), and the Jukes-Cantor
models.
3. DNA nucleotide substitution models 3
On Some Measures of Genetic Distance Based on Rates of
Nucleotide Substitution
Introduction
The genetic distance between two populations is defined as a concept related to the
time since the two populations diverged from a common ancestral population (Weir,
1990). A number of methods have been proposed to estimate the genetic distance between
two populations and they are either based on the allele frequencies in the two populations,
the rate of amino acid substitution in protein sequence data from the two populations, or
the rates of base nucleotide substitution in DNA sequence data from the two populations.
Measures of genetic distance that utilize the allele frequencies are estimates based
on some geometric transformation of the allele frequencies (Cavalli-Sforza and Edwards,
1967; Cavalli-Sforza and Bodmer, 1971; Edwards, 1971; Nei, 1977, 1978; Li and Nei, 1977;
Smith, 1977). Some of these measures are purely geometric and do not involve any genetic
concept at all, e.g., the measure proposed by Cavalli-Sforza and Bodmer (Weir, 1990). On
the other hand, the ones proposed by Edwards (1971) and by Nei (1977) can be shown to
berelated to the concept of fixation index (Hartl and Clark, 1989).
A measure of genetic distance based on amino acid substitution from protein
sequence data was proposed by Jukes and Cantor in 1969. This method was partly due to
the abundance of amino acid sequence data available then. Some geneticists argue that
this measure should be preferred since proteins are the subject of mutations.
The discovery of DNA sequencing by Maxam and Gilbert and Sanger et al. in 1977
brought about more methods for measuring genetic distance. The estimates from these
methods are based on the rates of nucleotide substitution in DNA sequence data. These
are the methods which we will consider in this paper. We will formulate the general
4. DNA nucleotide substitution models 4
model, examine some special cases, give some numerical examples, and finally, examine
the validity of these models based on their assumptions.
The General Model
We now start by formulating the general model. Let S1 and S2 be two nucleotide
sequences with a common ancestral sequence. We consider a pair of homologous sites from
S1 and S2 and examine how much they have diverged from each other during their descent
from the ancestral sequence T years back (Figure 1).
The evolutionary base substitution model we are going to use is shown in Figure 2.
We have used RNA codes for the nucleotides so that the pyrimidines are uracil (U) and
cytosine (C), and the purines are adenine (A) and guanine (G). The types and rates of
base substitution are summarized in Table 1. A substitution of a purine by a purine or a
pyrimidine by a pyrimidine is called a transition (TS). If a pyrimidine is substituted by a
purine or vice-versa then the substitution is called a transversion (TV). We distinguish
between two types of transversion, TV1 and TV2, and each type is shown in Table 1. The
classification of the TV as to type becomes easier if we look at Figure 2. The TV which go
either vertically up or down are TV1 and those which go diagonally are TV2.
When comparing the homologous sites of S1 and S2 at any time t > 0, there are 16
possible nucleotide base pairings, 12 of which involve mismatched base pairs. If the
mismatch looks like a transition pair in Table 1, we call the mismatch a TS-type
mismatch. We have a TV1-type mismatch if the mismatch looks like a Type 1 tranversion
listed in Table 1. The TV2-type mismatch is defined in the same manner. We summarize
these in Table 2. In Table 2, for t > 0,
5. DNA nucleotide substitution models 5
4
S(t) = Si (t) = probability of no difference at a site (1)
i=1
4
P (t) = Pi (t) = probability of a TS-typedifference at a site (2)
i=1
4
Q(t) = Qi (t) = probability of a TV1-type difference at a site (3)
i=1
4
P (t) = Pi (t) = probability of a TTV2-type difference at a site (4)
i=1
Hence,
4
Q(t) + R(t) = (Ri (t) + Qi (t)) (5)
i=1
= probability of a TV-type difference at a site.
We sometimes refer to the probabilities above as the match probabilities.
We also define the following probabilities which we sometimes refer to as the base
probabilities.
U (t) = percentage frequency of uracil, (6)
C(t) = percentage frequency of cytosine, (7)
A(t) = percentgae frequency of adenine, (8)
T (t) = percentage frequency of thymine in a strand (9)
so that
U (t) + C(t) + A(t) + G(t) = 1. (10)
Note that the probabilities in (1) - (4) and (6) - (9) are all time-dependent. We also have
6. DNA nucleotide substitution models 6
the following relations:
S(t) = U 2 (t) + C 2 (t) + A2 (t) + G2 (t) (11)
P (t) = 2U (t)C(t) + 2A(t)G(t) (12)
Q(t) = 2U (t)A(t) + 2C(t)G(t) (13)
R(t) = 2U (t)G(t) + 2C(t)A(t) (14)
Using the rates of substitution and the match probabilities, the mean rate of substitution
at a specific site over the time interval (0,T] is given by
4 T
αi + βi + γi
k = Bi (t) dt (15)
T 0
i=1
where B1 (t) = U (t), B2 (t) = C(t), B3 (t) = A(T ) and B4 (t) = G(t) and the integrals are
the average probabilities of finding a given base at a given site during the time interval
(0, T ].
A measure of genetic distance is therefore given by
K = 2T k (16)
where k is as defined in (15), T is the time since the two sequences started diverging from
the ancestral sequence and the factor of 2 is due to the fact that we are considering two
branches that diverged.
We now formulate the general model and proceed in a manner similar to that of
Takahata and Kimura (1981). At any time t ∈ [0, T ], consider a short time interval ∆t,
short enough so that if the mutation rate is small then higher order terms of ∆t and the
occurrence of a double substitution at a specific site may be neglected. We have
U (t + ∆t) = U (t) − α1 (∆t)U (t) + α2 (∆t)C(t) + β2 (∆t)A(t) +
γ2 (∆t)U (t) − γ1 (∆t)U (t) − β1 (∆t)U (t) (17)
7. DNA nucleotide substitution models 7
which we can rewrite as
U (t + ∆t) − U (t)
= − (α1 + β1 + γ1 ) U (t) + α2 C(t) + β2 A(t) + γ2 G(t). (18)
∆t
Getting the limit as ∆t approaches zero, (18) gives
dU (t)
= − (α1 + β1 + γ1 ) U (t) + α2 C(t) + β2 A(t) + γ2 G(t). (19)
dt
Doing this for the other three probabilities we get the following system of differential
equations:
dU (t)
= −(α1 + β1 + γ1 )U (t) + α2 C(t) + β2 A(t) + γ2 G(t) (20)
dt
dC(t)
= α1 U (t) − (α2 + β3 + γ3 )C(t) + γ4 A(t) + β4 G(t) (21)
dt
dA(t)
= β1 U (t) − γ3 C(t) − (α3 + β2 + γ4 )A(t) + α4 G(t) (22)
dt
dG(t)
= γ1 U (t) + β3 C(t) + α3 A(t) − (α4 + β4 + γ2 )G(t). (23)
dt
Writing (20) – (23) in matrix form gives
U (t) −(α1 + β1 + γ1 ) α2 β2 γ2 U (t)
d C(t) α1 −(α2 + β3 + γ4 ) γ4 β4 C(t)
= . (24)
dt A(t)
β1 γ3 −(α3 + β2 + γ4 ) α4 A(t)
G(t) γ1 β3 α3 −(α4 + β4 + γ2 ) G(t)
Using fact that the sum of the base probabilities is equal to 1, the matrix equation
reduces to
U (t) −(α1 + β1 + γ1 + γ2 ) α2 − γ2 β2 − γ2 U (t)
d
C(t) = α 1 − β4 −(α2 + β3 + γ4 + β4 ) γ4 − β4 C(t) . (25)
dt
A(t) β1 − α 4 γ3 − α4 −(α3 + β2 + γ4 + α4 ) A(t)
which can be written as
d
B1 (t) = Q1 B1 (t) + C1 . (26)
dt
Solving this system of differential equations entails solving for the eigenvalues of B1 .
Although it is easy to get the eigenvalues of the 3 × 3 matrix B1 , the matrix equation in
(26) is still difficult to solve since only the final conditions of the baseprobabilities can be
approximated and the initial conditions are unknown. One way to avoid this problem is to
8. DNA nucleotide substitution models 8
express the base probabilities in terms of the match probabilities. The matrix equation
involving the match probabilities is easier to solve since the initial conditions for the
match probabilities are Pi (0) = Qi (0) = Ri (0) = 0, i = 1, . . . , 4 and S(0) = 1. After the
expressions for the match probabilities have been solved, we can solve for the mean rate of
base substitution k and hence the estimate of genetic distance K.
Inherent in these models of evolutionary base nucleotide substitutions are the
following four assumptions:
(1) The two sequences diverged from a common ancestor, that is, Pi (0) = Qi (0) =
Ri (0) = 0, i = 1, . . . , 4 and S(0) = 1.
(2) The two sequences are stochastically identical and independent, and within each
sequence, as substitution in one site in no way affects a substitution in some other site.
(3) The homologous sites chosen from the two sequences are of the same fixed length
during their descent from the common ancestor.
(4) (The fourth assumption reduces the number of parameters in the model by
assuming that some of the rates are equal. Since this differs among the three models that
we are going to consider, rather than stating it here, it will be stated as each model is
being considered.)
The 3ST Model
The first special case that we are going to consider is the three-substitution-type
(3ST) model. This model is due to Kimura (1981) and is the most general of the three
models we are going to consider in detail in this paper. The two other models we
considerlater are special cases of this model. The fourth assumption in the 3ST model is
that the TS-type substitutions all have rates α, and that the TV-type substitutions have
rates β and γ depending on the specific type as shown in Figure 3. Under the 3ST model,
Tables 1 and 2 can be simplified and their simplified forms are given below as Tables 3
and 4, respectively.
9. DNA nucleotide substitution models 9
The system of differential equations in (20) – (23) simplifies to
dU (t)
= −(α + β + γ)U (t) + αC(t) + βA(t) + γG(t) (27)
dt
dC(t)
= αU (t) − (α + β + γ)C(t) + γA(t) + βG(t) (28)
dt
dA(t)
= βU (t) = γC(t) − (α + β + γ)A(t) + αG(t) (29)
dt
dG(t)
= γU (t) + βC(t) + αA(t) − (α + β + γ)G(t). (30)
dt
and its corresponding matrix form is
U (t) −(α + β + γ) α β γ U (t)
d C(t) α −(α + β + γ) γ β C(t)
= , (31)
dt A(t)
β γ −(α + β + γ) α A(t)
G(t) γ β α −(α + β + γ) G(t)
which again can be written in the form of (25). Considering the fact that the sum of the
base probabilities is 1, we can simplify (31) to
U (t) −(α + β + 2γ) α−γ β−γ U (t)
d
C(t) = α−β −(α + 2β + γ) γ−β C(t) . (32)
dt
A(t) β−α γ−α −(2α + β + γ) A(t)
We can also rewrite (32) in the form of (25). The matrix equation in (32) is not
difficult to solve since the eigenvalues are easily obtainable. The problem here is that we
do not know the initial conditions for the base probabilities since we do not know the base
frequencies of the ancestral sequence. As we have mentioned before, a way to avoid this
problem is to consider the match probabilities instead. It is easier to use the match
probabilities since we have the initial conditions for this set of probabilities given by the
first assumption (A1) of our model.
Using the relationships between the base probabilities and the match probabilities
given in (11) – (14) it can be shown that
P (t) −2(2α + β + 2γ) −2(α − γ) −2(α − β) P (t) 2α
d
Q(t) = −2(α − β) −2(α + 2β + γ) −2(β − α) Q(t) + 2β . (33)
dt
R(t) −2(γ − β) −2(γ − α) −(α + β + 2γ) R(t) 2γ
which in matrix form is
d
T(t) = Q2 T(t) + C2 . (34)
dt
10. DNA nucleotide substitution models 10
We now derive the expression for P (t) in (33). The expressions for Q(t) and R(t) can be
obtained in very much the same manner.
Recall that in (11) – (14) we have
P (t) = probability of a TS-type difference at a homologous site (35)
= 2C(t)U (t) + 2A(t)G(t). (36)
Using the product-rule for the derivative of a product,
dP (t) dU (t) dC(t) dG(t) dA(t)
= 2 C(t) + U (t) + 2 A(t) + G(t) . (37)
dt dt dt dt dt
If we substitute the expressions for the derivatives of the match probabilities we obtained
in (33) we have
dP (t)
= 2 {−2 (C(t)U (t) + A(t)G(t)) (α + β + γ) + 2β (A(t)C(t) + G(t)U (t)) +
dt
2γ (A(t)U (t) + G(t)C(t)) + α A2 (t) + C 2 (t) + U 2 (t) + G2 (t) (38)
Using the fact that A2 (t) + C 2 (t) + U 2 (t) + G2 (t) = 1- P (t) - Q(t) -R(t) we can
simplify (38) to obtain
dP (t)
= 2 − {−(2α + β + γ)P (t) + (β − α)R(t) + (γ − α)Q(t) + 2α} (39)
dt
which is what we want.
We now solve the matrix equation in (34). Define the following Laplace transform:
P (t) p(s)
L[T(t)] = L Q(t) = q(s) = T (s). (40)
R(t) r(s)
Applying the Laplace transform to (34), we get
1
sT (s) − T(0) = Q3 T (s) + C3 (41)
s
which we can rewrite as
1
− C3 = (Q − sI3 )T (s), (42)
s
11. DNA nucleotide substitution models 11
where we have used the fact that T(0)= 0 and I3 is the 3 × 3 identity matrix. The
problem of solving the system of differential equations in (34) is now reduced to solving a
system of algebraic equations in the three unknowns p(s), q(s), and r(s). We now solve for
these three unknowns and then apply the inverse Laplace transform to get the solutions
for P (t), Q(t), and R(t). Using Cramer’s rule, we get
−2α/s −2(α − γ) −2(α − β)
−2β/s −2(α + 2β + γ) −2(β − α)
−2γ/s −2(γ − α) −2(α + β + 2γ) − s
p(s) = (43)
∆
−2(2α + β + γ) −2α/s −2(α − β)
−2(β − γ) −2β/s −2(β − α)
−2(γ − β) −2γ/s −2(α + β + 2γ) − s
q(s) = (44)
∆
−2(2α + β + γ) − s −2(α − γ) −2α/s
−2(β − γ) −2(α + 2β + γ) −2β/s
−2(γ − β) −2(γα) −2γ/s
r(s) = (45)
∆
where,
−2(2α + β + γ) −2(α − γ) −2(α − β)
∆ = −2(β − γ) −2(α + 2β + γ) −2(β − α) . (46)
−2(γ − β) −2(γ − α) −2(α + β + 2γ)
Upon simplifying and expressing the results in partial fractions we get,
1 1 1
1 4 4 4
p(s) = − − + (47)
4s s + 4(α + β) s + 4(α + γ) s + 4(β + γ)
1 1 1
1 4 4 4
q(s) = − + − (48)
4s s + 4(α + β) s + 4(α + γ) s + 4(β + γ)
1 1 1
1 4 4 4
r(s) = + − − . (49)
4s s + 4(α + β) s + 4(α + γ) s + 4(β + γ)
12. DNA nucleotide substitution models 12
Applying the inverse Laplace transform, we get the following as solutions to the
system in (49),
1
P (t) = L−1 {p(s)} = 1 − eλ1 t − eλ2 t + eλ3 t (50)
4
1
Q(t) = L−1 {q(s)} = 1 − eλ1 t + eλ2 t − eλ3 t (51)
4
1
R(t) = L−1 {r(s)} = 1 + eλ1 t − eλ2 t − eλ3 t , (52)
4
where λ1 = −4(α+β), λ2 = −4(α+γ), λ3 = −4(β+γ).
Under the 3ST model, the equation for k in (15) can be expressed as
4 T
α+β+γ
k = Bi (t) dt = α + β + γ, (53)
T 0
i=1
where we have used the fact that the sum of the base probabilities is equal to 1. Note that
the assumption on some of the rates being equal played a crucial role in being able to
factor α+β+γ out of the summation to get a simple expression for k. For K, we obtain
K = 2T (α + β + γ). (54)
We can solve (52) for λ1 , λ2 , and λ3 to get
4(α + β)t = − ln(1 − 2P (t) − 2Q(t)) (55)
4(α + γ)t = − ln(1 − 2P (t) − 2R(t)) (56)
4(β + γ)t = − ln(1 − 2Q(t) − 2R(t)), (57)
and hence, for any time t ∈ [0, T ],
8(α + β + γ)t = − ln {[1 − 2P (t) − 2Q(t)][1 − 2P (T ) − 2R(t)][1 − 2Q(t) − 2R(t)]} (58)
K = 2kt (59)
1
= − ln {[1 − 2P (t) − 2Q(t)][1 − 2P (T ) − 2R(t)][1 − 2Q(t) − 2R(t)]} . (60)
4
The variance for this estimate of K is also given in the paper of Kimura (1981). We
have,
2 1 2
σK = a P (t) + b2 Q(t) + c2 R(t) − (aP (t) + bQ(t) + cR(t))2 (61)
n
13. DNA nucleotide substitution models 13
where,
1 1 1
a = + (62)
2 1 − 2P (t) − 2Q(t) 1 − 2P (t) − 2Q(t)
1 1 1
b = + (63)
2 1 − 2P (t) − 2Q(t) 1 − 2Q(t) − 2R(t)
1 1 1
c = + . (64)
2 1 − 2P (t) − 2R(t) 1 − 2Q(t) − 2R(t)
The 2ST Model
We now proceed to a special case of this model which again is due to Kimura
(1980). We will call this model the two-substitution type model. The
two-substitution-type (2ST) was discussed by Kimura in a paper which was published a
year previous to the 3ST model. The 2ST model is a special case of the 3ST model and
hence we just give the results and do not gointo the details. (In the original paper, this
model is actually nameless. We just call it the 2ST model for convenience). The fourth
assumption here is that the transition rate is α and the transversion rate is β. Under this
assumption the diagram in Figure 3 simplifies further to the diagram in Figure 4.
The tables for the base substitution and the match probabilities are given as Tables
5 and 6 below. The probability of a TS-type mismatch is given by P (t) and the
probability of a TV-type mismatch is given by QR(t) = Q(t)+ R(t). That is, we have
lumped together the TV1-type and TV2-type mismatches.
The matrix equation in (24) under the 2ST model is
U (t) −(α + 2β) α β β
C(t) α −(α + 2β) β β
d
=
(65)
dt
A(t)
β β −(α + 2β) α
G(t) β β α −(α + 2β)
and the corresponding matrix equation involving the match probabilities is
P (t) −2(2α + 2β) −2(α − β) −2(α − β)
d
Q(t) = −2(α + 3β) −2(β − α) . (66)
0
dt
R(t) 0 −2(β − α) −2(α + 3β)
14. DNA nucleotide substitution models 14
If we now lump Q(t) and R(t) together as QR(t) we have the matrix equation in (67)
which only involves a 2 × 2 matrix instead of the previous 3 × 3 matrix.
P (t) −2(2α + β + γ −2(α − β) P (t) 2α
= + (67)
QR(t) 0 8β QR(t) 2β
To solve (67), we use the initial conditions: P (0) = QR(0) = 0. As solutions we have
1 1 λ1 t 1 λ2 t
P (t) = − e + e (68)
4 2 4
1 1 λ2 t
QR(t) = − e (69)
2 2
where λ1 = −4(α+β) and λ2 = −8β.
Under the 2ST model k = α + 2β. We can solve (69) for αt and βt and therefore
obtain our estimate K. We have
K = 2kt = 2(α + 2β) (70)
1
= − ln [1 − 2P (t) − QR(t)]2 [1 − 2QR(t)] . (71)
4
The variance of this estimate is given
2 1 2
σK = a P (t) + b2 QR(t) − (aP (t) + bQR(t))2 (72)
n
where
1
a = (73)
1 − 2P (t) − 2QR(t)
1 1 1
b = + . (74)
2 1 − 2P (t) − 2QR(t) 1 − 2QR(t)
The Jukes-Cantor Model
The simplest possible model is due to Jukes and Cantor (1969). The model was
primarily formulated to describe protein evolution by looking at the rate of amino acid
substitution. It turns out that this model can also be used to describe base substitution.
The fourth assumption here is that all the rates of substitution are equal, i.e., α =
αi = βi = γi , i = 1, . . ., 4. Figure 2 then becomes Figure 5 below. Under the
Jukes-Cantor model, Tables 1 and 2 can be simplified to Tables 7 and 8, respectively.
15. DNA nucleotide substitution models 15
The matrix equation in (24) under the Jukes-Cantor model is
U (t) −3α α α α U (t)
C(t) α −3α α α C(t)
d = (75)
dt
A(t)
α
α −3α α
A(t)
G(t) α α α −3α G(t)
and the matrix equation involving the match probabilities is
P (t) −8β 0 0 P (t) 2α
d
Q(t) = 0 0 Q(t) + 2α (76)
−8β
dt
R(t) 0 0 −8β R(t) 2α
If we define P QR(t) = P (t) + Q(t) + R(t) we have the differential equation
d
P QR(t) = −8αP QR(t) + 6α (77)
dt
which has as a solution
3
P QR(t) = 1 − e−8αt . (78)
4
Under the Jukes-Cantor model, k = 3α and the estimate K is
3 4
K = 2kt = 6αt = − ln(1 − P QR(t)) (79)
4 3
which can be obtained by solving for α in (78).
The variance for K under the Jukes-Cantor model was derived by Kimura and Ohta
(1972) and is given by
2 1 (1 − P QR(t))P QR(t) (1 − P QR(t))P QR(t)
σJC = = . (80)
n 1 − 4P QR(t)/3 n(1 − 4P QR(t)/3)
We are going to illustrate the three models by comparing the human and protein
kinase inhibitor. These two nucleotide sequences were recently sequenced by Olsen and
Uhler (1991). The sequences are more than a thousand base pairs long but only 231 of
these are part of the coding region. Our analysis is limited to these 231 base pairs. The
sequences are shown in Figure 6. Of the 231 bp, only 15 show mismatches. These are
16. DNA nucleotide substitution models 16
summarized in Table 9. Usually, the estimate K is computed by codon position since
there is that assumption that the substitution are independent of each other but there is
evidence that adjacent substitutions are actually not independent of each other. This will
not be done here since we have quite a small amount of base pairs and the mismatches are
quite far apart (except for the ones occurring at positions 200 and 201).
The estimate under each model is shown in Table 10. It is seen here that the
estimates do not differ so much from one model to the other. The variances are also not
that different from each other.
Estimates of genetic distance using some other nucleotide sequences are also
available. Tavar (1986) obtained estimates using human and mouse a-fetoprotein and
serum albumin nucleotide sequences. The results he got for the human-mouse
α-fetoprotein nucleotide sequences are reproduced below as Table 11. The data consist of
1824 base pairs and hence it was possible for him to compute the estimates by codon
positions.
Note that the estimates tend to be bigger for the third codon position and smallest
for the second codon position. Tavar in his paper showed that the estimates are not
homogeneous if we consider the codon positions as strata. Unfortunately, we cannot do
the same thing in our analysis here since we just have 231 bp and 15 mismatches.
All three models of evolutionary base substitutions that we have discussed here are
far from perfect and their weaknesses lie on the second and third assumptions made to
formulate the models.
The second assumption states that the nucleotide sequences are stochastically
identical and independent of each other. It is most possibly true that nucleotide sequences
evolve in a manner stochastically independent from each other but there are evidences
that they are in fact not stochastically identical. For example, Wu and Li (1985) noticed
that the substitution rates in rodent is much higher than that in humans. Even within a
sequence, there is evidence that that rates are much higher in some spots (“hot spots”)
than in others (Miyata and Yasunaga, 1981; Brown and Clegg, 1983) and that the rates
differ between the sense and antisense strand (Wu and Maeda, 1987). There are also
evidences showing that a substitution in one site does a affect the rate of substitution in
an adjacent site in phage T4 (Koch, 1971). It would be interesting to know if the same
17. DNA nucleotide substitution models 17
holds for higher organisms. This last fact is also one of the reasons why substitution rates
are computed by codon sites if the data allow.
The third assumption assumes that the diverging nucleotide sequences are both of a
fixed length and hence it doesn’t take into account mutations resulting from deletions and
insertions. These assumption also does not take into account the possibility of concerted
evolution, which brings about the presence of multigene families, and the duplication and
divergence in multigene families.
There have been efforts to consider models which incorporate these shortcomings
but at the same time still make the models mathematically tractable. Needleman and
Wunsch (1970), for example, proposed a model which assigns weights to substitutions,
insertions and deletions. Unfortunately, the weights assigned were arbitrary and had no
genetic basis.
The main problem that these models of evolutionary base nucleotide substitution
face is that when all of the mechanisms of evolution are included in the model, the model
becomes mathematically intractable with the present computer technology. Considering
the fact that computer technology is still advancing, it is hoped that a model incorporating
most, if not all, of the mechanisms discussed can be formulated in the near future.
18. DNA nucleotide substitution models 18
References
Brown, A., & Clegg, M. (1983). Analysis of variation in related DNA sequences. In
B. Weir (Ed.), Statistical data analysis (pp. 107–132). New York: Marcel-Dekker.
Cavalli-Sforza, L., & Bodmer, W. (1971). The genetics of human populations. San
Francisco: W. H. Freeman.
Cavalli-Sforza, L., & Edwards, A. (1967). Phylogenetic analysis: models and estimation
procedures. American Journal of Human Genetics, 19 , 233–257.
Edwards, A. (1971). The distance between populations on the basis of gene frequencies.
Biometrics, 27 , 873–881.
Jukes, T., & Cantor, C. (1969). Evolution of protein molecules. In H. N. Munro (Ed.),
Mammalian protein metabolism (pp. 21–123). New York: Academic Press.
Kimura, M. (1980). A simple method for estimating evolutionary rates of base
substitutions through comparative studies of nucleotide sequences. Journal of
Molecular Evolution, 16 , 11–120.
Kimura, M. (1981). Estimation of evolutionary distances between homologous nucleotide
sequences. Proceedings of the National Academy of Sciences USA, 78 , 454–458.
Kimura, M., & Ohta, T. (1972). On the stochastic model for estimation of mutational
distance between homologous proteins. Journal of Molecular Evolution, 2 , 87–90.
Koch, R. (1971). The influence of neighbouring base pairs upon base-pair substitution
mutation rates. Proceedings of the National Academy of Sciences USA, 68 , 773–776.
Maxam, A., & Gilbert, W. (1977). A new method for sequencing DNA. Proceedings of the
National Academy of Sciences USA, 74 , 560–564.
Miura, R. (Ed.). (1986). Lectures on mathematics in the life sciences. Rhode Island:
American Mathematical Society.
Miyata, T., & Yasunaga, T. (1981). Rapidly evolving mouse α-globin-related
pseudogenes. Proceedings of the National Academy of Sciences USA, 78 , 450–453.
19. DNA nucleotide substitution models 19
Munro, H. N. (Ed.). (1969). Mammalian protein metabolism. New York: Academic Press.
Needleman, S., & Wunsch, C. (1970). A general method applicable to the search for
similarities in the amino acid sequence of two proteins. Journal of Molecular
Biology, 48 , 443–453.
Nei, M. (1977). F-statisitcs and analysis of gene diversity in subdivided populations.
Annals of Human Genetics, 41 , 225–233.
Olsen, S., & Uhler, M. (1991a). (nucleotide sequence of the human protein kinase
inhibitor). Molecular Endocrinology. (manuscript submitted)
Olsen, S., & Uhler, M. (1991b). (nucleotide sequence of the mouse protein kinase
inhibitor). Journal of Biological Chemistry. (in press)
Sanger, F., Nicklen, S., & Coulson, A. (1977). DNA sequencing with chain-terminating
inhibitors. Proceedings of the National Academy of Sciences USA, 74 , 4563–4567.
Takahata, N., & Kimura, M. (1981). A model of evolutionary base substitutions and its
application with special reference to rapid change in pseudogenes. Genetics, 98 ,
641–657.
Tavar´, S. (1986). Some probabilistic and statistical problems in the analysis of DNA
e
sequences. In R. Miura (Ed.), Lectures on mathematics in the life sciences (pp.
57–86). Rhode Island: American Mathematical Society.
Weir, B. (Ed.). (1983). Statistical data analysis. New York: Marcel-Dekker.
Weir, B. (1990). Genetic data analysis: methods for discrete population data. Sunderland,
Massachussetts: Sinauer Associates.
Wu, C., & Li, W. (1985). Evidence for higher rates of nucleotide substitution in rodents
than in man. Proceedings of the National Academy of Sciences USA, 82 , 1741–1745.
Wu, C., & Maeda, N. (1987). Inequality in mutation rates of the two strands of DNA.
Nature, 327 , 169–170.
20. DNA nucleotide substitution models 20
Table 1
Types and rates of nucleotide sustitution.
Types
Transition (TS) Transversion (TV1) Transversion (TV2)
Initial base U C A G U A C G U G C A
New Base C U G A A U G C G U A C
Rates α1 α2 α3 α4 β1 β2 β3 β4 γ1 γ2 γ3 γ4
21. DNA nucleotide substitution models 21
Table 2
Possible nucleotide base pairings at a specific homologius site for t > 0.
Types
Sequence Same TS-type TV1-type TV2-type
1 U C A G U C A G U A C G U G C A
2 U C A G C U G A A U G C G U A C
Probabilities S1 S2 S3 S4 P1 P2 P3 P4 Q1 Q2 Q3 Q4 R1 R2 R3 R4
22. DNA nucleotide substitution models 22
Table 3
Types and rates of nucleotide sustitution under the 3ST model.
Types
Transition (TS) Transversion (TV1) Transversion (TV2)
Initial base U C A G U A C G U G C A
New Base C U G A A U G C G U A C
Rates α α α α β β β β γ γ γ γ
23. DNA nucleotide substitution models 23
Table 4
Possible nucleotide base pairings at a specific homologius site for t > 0 under the 3ST model.
Types
Sequence Same TS-type TV1-type TV2-type
1 U C A G U C A G U A C G U G C A
2 U C A G C U G A A U G C G U A C
Probabilities S P Q R
24. DNA nucleotide substitution models 24
Table 5
Types and rates of nucleotide sustitution under the 2ST model.
Types
Transition (TS) Transversion (TV1) Transversion (TV2)
Initial base U C A G U A C G U G C A
New Base C U G A A U G C G U A C
Rates α α α α β β β β β β β β
25. DNA nucleotide substitution models 25
Table 6
Possible nucleotide base pairings at a specific homologius site for t > 0 under the 2ST model.
Types
Sequence Same TS-type TV1-type TV2-type
1 U C A G U C A G U A C G U G C A
2 U C A G C U G A A U G C G U A C
Probabilities S P QR
26. DNA nucleotide substitution models 26
Table 7
Types and rates of nucleotide sustitution under the Jukes-Cantor model.
Types
Transition (TS) Transversion (TV1) Transversion (TV2)
Initial base U C A G U A C G U G C A
New Base C U G A A U G C G U A C
Rates α α α α α α α α α α α α
27. DNA nucleotide substitution models 27
Table 8
Possible nucleotide base pairings at a specific homologius site for t > 0 under the Jukes-
Cantor model.
Types
Sequence Same TS-type TV1-type TV2-type
1 U C A G U C A G U A C G U G C A
2 U C A G C U G A A U G C G U A C
Probabilities S P QR
28. DNA nucleotide substitution models 28
Table 9
Nucleotide mismatches observed after time T since divergence between human and mouse
protein kinase inhibitor (pki).
Types
Transition (TS) Transversion (TV1) Transversion (TV2)
Human pki U C A G U A C G U G C A
Mouse pki C U G A A U G C G U A C
Numbers observed 5 0 3 2 0 1 1 6 0 1 1 2
29. DNA nucleotide substitution models 29
Table 10
Estimates of the genetic distance K under the different models being considered.
Model K standard error
Jukes-Cantor 0.0682288 0.0178312
2ST 0.0686475 0.0180611
3ST 0.0686535 0.0180644
30. DNA nucleotide substitution models 30
Table 11
Estimates of the genetic distance Ki , where i = 1, 2, or 3, is the ith codon position, under
the different models considered in Tavar´ (1986). The sequence data are that of human and
e
mouse α-fetoprotein.
Model K1 K2 K3
Jukes-Cantor 0.1752 (.0186) 0.1387 (.0162) .6566 (.0483)
3ST 0.1760 (.0188) 0.1389 (.0163) .7230 (.0642)
(The parenthesized quantities are standard errors.)
31. DNA nucleotide substitution models 31
Figure Captions
Figure 1. Divergence of sequences S1 and S2 from some common ancestor.
Figure 2. Types and rates of nucleotide substitutions.
Figure 3. Types and rates of nucleotide substitutions: 3ST Model.
Figure 4. Types and rates of nucleotide substitutions: 2ST Model.
Figure 5. Types and rates of nucleotide substitutions: Jukes-Cantor Model.
Figure 6. The nucleotide sequences of the coding region of the mouse protein kinase
inhibitor (Mpki.M) and the human protein kinase inhibitor (Hpki.2) are shown above.
The 15 mismatches are indicated with bars (Olsen and Uhler, 1991a, 1991b).