DNA Nucleotide Substitution Models Explained

DNA nucleotide substitution models 1

Running head: DNA NUCLEOTIDE SUBSTITUTION MODELS

On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

Justine Leon A. Uro

Ph. D. Graduate Student

Department of Biostatistics, University of Michigan

Ann Arbor, MI


Abstract

We present a general DNA base-nucleotide substitution model and discuss three special

cases: three-substitution-type (3ST), two-substitution-type (2ST), and the Jukes-Cantor

models.


On Some Measures of Genetic Distance Based on Rates of

Nucleotide Substitution

Introduction

The genetic distance between two populations is deﬁned as a concept related to the

time since the two populations diverged from a common ancestral population (Weir,

1990). A number of methods have been proposed to estimate the genetic distance between

two populations and they are either based on the allele frequencies in the two populations,

the rate of amino acid substitution in protein sequence data from the two populations, or

the rates of base nucleotide substitution in DNA sequence data from the two populations.

Measures of genetic distance that utilize the allele frequencies are estimates based

on some geometric transformation of the allele frequencies (Cavalli-Sforza and Edwards,

1967; Cavalli-Sforza and Bodmer, 1971; Edwards, 1971; Nei, 1977, 1978; Li and Nei, 1977;

Smith, 1977). Some of these measures are purely geometric and do not involve any genetic

concept at all, e.g., the measure proposed by Cavalli-Sforza and Bodmer (Weir, 1990). On

the other hand, the ones proposed by Edwards (1971) and by Nei (1977) can be shown to

berelated to the concept of ﬁxation index (Hartl and Clark, 1989).

A measure of genetic distance based on amino acid substitution from protein

sequence data was proposed by Jukes and Cantor in 1969. This method was partly due to

the abundance of amino acid sequence data available then. Some geneticists argue that

this measure should be preferred since proteins are the subject of mutations.

The discovery of DNA sequencing by Maxam and Gilbert and Sanger et al. in 1977

brought about more methods for measuring genetic distance. The estimates from these

methods are based on the rates of nucleotide substitution in DNA sequence data. These

are the methods which we will consider in this paper. We will formulate the general


model, examine some special cases, give some numerical examples, and finally, examine

the validity of these models based on their assumptions.

The General Model

We now start by formulating the general model. Let S1 and S2 be two nucleotide

sequences with a common ancestral sequence. We consider a pair of homologous sites from

S1 and S2 and examine how much they have diverged from each other during their descent

from the ancestral sequence T years back (Figure 1).

The evolutionary base substitution model we are going to use is shown in Figure 2.

We have used RNA codes for the nucleotides so that the pyrimidines are uracil (U) and

cytosine (C), and the purines are adenine (A) and guanine (G). The types and rates of

base substitution are summarized in Table 1. A substitution of a purine by a purine or a

pyrimidine by a pyrimidine is called a transition (TS). If a pyrimidine is substituted by a

purine or vice-versa then the substitution is called a transversion (TV). We distinguish

between two types of transversion, TV1 and TV2, and each type is shown in Table 1. The

classification of the TV as to type becomes easier if we look at Figure 2. The TV which go

either vertically up or down are TV1 and those which go diagonally are TV2.

When comparing the homologous sites of S1 and S2 at any time t > 0, there are 16

possible nucleotide base pairings, 12 of which involve mismatched base pairs. If the

mismatch looks like a transition pair in Table 1, we call the mismatch a TS-type

mismatch. We have a TV1-type mismatch if the mismatch looks like a Type 1 tranversion

listed in Table 1. The TV2-type mismatch is defined in the same manner. We summarize

these in Table 2. In Table 2, for t > 0,


4
S(t) = Si (t) = probability of no difference at a site (1)
i=1
4
P (t) = Pi (t) = probability of a TS-typedifference at a site (2)
i=1
4
Q(t) = Qi (t) = probability of a TV1-type difference at a site (3)
i=1
4
P (t) = Pi (t) = probability of a TTV2-type difference at a site (4)
i=1

Hence,
4
Q(t) + R(t) = (Ri (t) + Qi (t)) (5)
i=1
= probability of a TV-type difference at a site.

We sometimes refer to the probabilities above as the match probabilities.

We also define the following probabilities which we sometimes refer to as the base

probabilities.

U (t) = percentage frequency of uracil, (6)

C(t) = percentage frequency of cytosine, (7)

A(t) = percentgae frequency of adenine, (8)

T (t) = percentage frequency of thymine in a strand (9)

so that

U (t) + C(t) + A(t) + G(t) = 1. (10)

Note that the probabilities in (1) - (4) and (6) - (9) are all time-dependent. We also have


the following relations:

S(t) = U 2 (t) + C 2 (t) + A2 (t) + G2 (t) (11)

P (t) = 2U (t)C(t) + 2A(t)G(t) (12)

Q(t) = 2U (t)A(t) + 2C(t)G(t) (13)

R(t) = 2U (t)G(t) + 2C(t)A(t) (14)

Using the rates of substitution and the match probabilities, the mean rate of substitution

at a specific site over the time interval (0,T] is given by
4 T
αi + βi + γi
k = Bi (t) dt (15)
T 0
i=1

where B1 (t) = U (t), B2 (t) = C(t), B3 (t) = A(T ) and B4 (t) = G(t) and the integrals are

the average probabilities of finding a given base at a given site during the time interval

(0, T ].

A measure of genetic distance is therefore given by

K = 2T k (16)

where k is as defined in (15), T is the time since the two sequences started diverging from

the ancestral sequence and the factor of 2 is due to the fact that we are considering two

branches that diverged.

We now formulate the general model and proceed in a manner similar to that of

Takahata and Kimura (1981). At any time t ∈ [0, T ], consider a short time interval ∆t,

short enough so that if the mutation rate is small then higher order terms of ∆t and the

occurrence of a double substitution at a specific site may be neglected. We have

U (t + ∆t) = U (t) − α1 (∆t)U (t) + α2 (∆t)C(t) + β2 (∆t)A(t) +

γ2 (∆t)U (t) − γ1 (∆t)U (t) − β1 (∆t)U (t) (17)


which we can rewrite as

U (t + ∆t) − U (t)
= − (α1 + β1 + γ1 ) U (t) + α2 C(t) + β2 A(t) + γ2 G(t). (18)
∆t

Getting the limit as ∆t approaches zero, (18) gives

dU (t)
= − (α1 + β1 + γ1 ) U (t) + α2 C(t) + β2 A(t) + γ2 G(t). (19)
dt

Doing this for the other three probabilities we get the following system of differential

equations:

dU (t)
= −(α1 + β1 + γ1 )U (t) + α2 C(t) + β2 A(t) + γ2 G(t) (20)
dt
dC(t)
= α1 U (t) − (α2 + β3 + γ3 )C(t) + γ4 A(t) + β4 G(t) (21)
dt
dA(t)
= β1 U (t) − γ3 C(t) − (α3 + β2 + γ4 )A(t) + α4 G(t) (22)
dt
dG(t)
= γ1 U (t) + β3 C(t) + α3 A(t) − (α4 + β4 + γ2 )G(t). (23)
dt
Writing (20) – (23) in matrix form gives
    
U (t) −(α1 + β1 + γ1 ) α2 β2 γ2 U (t)
    
    
d C(t)  α1 −(α2 + β3 + γ4 ) γ4 β4  C(t)
= . (24)
    
   
dt A(t)
 

 β1 γ3 −(α3 + β2 + γ4 ) α4  A(t)
 
    
G(t) γ1 β3 α3 −(α4 + β4 + γ2 ) G(t)

Using fact that the sum of the base probabilities is equal to 1, the matrix equation
reduces to

    
U (t) −(α1 + β1 + γ1 + γ2 ) α2 − γ2 β2 − γ2  U (t)
d     
C(t) =  α 1 − β4 −(α2 + β3 + γ4 + β4 ) γ4 − β4  C(t) . (25)
dt 
    
   
A(t) β1 − α 4 γ3 − α4 −(α3 + β2 + γ4 + α4 ) A(t)

which can be written as

d
B1 (t) = Q1 B1 (t) + C1 . (26)
dt

Solving this system of differential equations entails solving for the eigenvalues of B1 .
Although it is easy to get the eigenvalues of the 3 × 3 matrix B1 , the matrix equation in
(26) is still difficult to solve since only the final conditions of the baseprobabilities can be
approximated and the initial conditions are unknown. One way to avoid this problem is to


express the base probabilities in terms of the match probabilities. The matrix equation
involving the match probabilities is easier to solve since the initial conditions for the
match probabilities are Pi (0) = Qi (0) = Ri (0) = 0, i = 1, . . . , 4 and S(0) = 1. After the
expressions for the match probabilities have been solved, we can solve for the mean rate of
base substitution k and hence the estimate of genetic distance K.
Inherent in these models of evolutionary base nucleotide substitutions are the
following four assumptions:
(1) The two sequences diverged from a common ancestor, that is, Pi (0) = Qi (0) =
Ri (0) = 0, i = 1, . . . , 4 and S(0) = 1.
(2) The two sequences are stochastically identical and independent, and within each
sequence, as substitution in one site in no way affects a substitution in some other site.
(3) The homologous sites chosen from the two sequences are of the same fixed length
during their descent from the common ancestor.
(4) (The fourth assumption reduces the number of parameters in the model by
assuming that some of the rates are equal. Since this differs among the three models that
we are going to consider, rather than stating it here, it will be stated as each model is
being considered.)

The 3ST Model

The first special case that we are going to consider is the three-substitution-type
(3ST) model. This model is due to Kimura (1981) and is the most general of the three
models we are going to consider in detail in this paper. The two other models we
considerlater are special cases of this model. The fourth assumption in the 3ST model is
that the TS-type substitutions all have rates α, and that the TV-type substitutions have
rates β and γ depending on the specific type as shown in Figure 3. Under the 3ST model,
Tables 1 and 2 can be simplified and their simplified forms are given below as Tables 3
and 4, respectively.


The system of differential equations in (20) – (23) simplifies to

dU (t)
= −(α + β + γ)U (t) + αC(t) + βA(t) + γG(t) (27)
dt
dC(t)
= αU (t) − (α + β + γ)C(t) + γA(t) + βG(t) (28)
dt
dA(t)
= βU (t) = γC(t) − (α + β + γ)A(t) + αG(t) (29)
dt
dG(t)
= γU (t) + βC(t) + αA(t) − (α + β + γ)G(t). (30)
dt
and its corresponding matrix form is
    
U (t) −(α + β + γ) α β γ U (t)
    
    
d C(t)  α −(α + β + γ) γ β  C(t)
= , (31)
    
   
dt A(t)



 β γ −(α + β + γ) α  A(t)
 
    
G(t) γ β α −(α + β + γ) G(t)

which again can be written in the form of (25). Considering the fact that the sum of the
base probabilities is 1, we can simplify (31) to

    
U (t) −(α + β + 2γ) α−γ β−γ  U (t)
d     
C(t) =  α−β −(α + 2β + γ) γ−β  C(t) . (32)
dt 
    
   
A(t) β−α γ−α −(2α + β + γ) A(t)

We can also rewrite (32) in the form of (25). The matrix equation in (32) is not
difficult to solve since the eigenvalues are easily obtainable. The problem here is that we
do not know the initial conditions for the base probabilities since we do not know the base
frequencies of the ancestral sequence. As we have mentioned before, a way to avoid this
problem is to consider the match probabilities instead. It is easier to use the match
probabilities since we have the initial conditions for this set of probabilities given by the
first assumption (A1) of our model.
Using the relationships between the base probabilities and the match probabilities
given in (11) – (14) it can be shown that

      
P (t) −2(2α + β + 2γ) −2(α − γ) −2(α − β)  P (t) 2α
d       
Q(t) =  −2(α − β) −2(α + 2β + γ) −2(β − α)  Q(t) + 2β  . (33)
dt 
      
     
R(t) −2(γ − β) −2(γ − α) −(α + β + 2γ) R(t) 2γ

which in matrix form is

d
T(t) = Q2 T(t) + C2 . (34)
dt


We now derive the expression for P (t) in (33). The expressions for Q(t) and R(t) can be
obtained in very much the same manner.
Recall that in (11) – (14) we have

P (t) = probability of a TS-type diﬀerence at a homologous site (35)

= 2C(t)U (t) + 2A(t)G(t). (36)

Using the product-rule for the derivative of a product,

dP (t) dU (t) dC(t) dG(t) dA(t)
= 2 C(t) + U (t) + 2 A(t) + G(t) . (37)
dt dt dt dt dt

If we substitute the expressions for the derivatives of the match probabilities we obtained
in (33) we have

dP (t)
= 2 {−2 (C(t)U (t) + A(t)G(t)) (α + β + γ) + 2β (A(t)C(t) + G(t)U (t)) +
dt
2γ (A(t)U (t) + G(t)C(t)) + α A2 (t) + C 2 (t) + U 2 (t) + G2 (t) (38)

Using the fact that A2 (t) + C 2 (t) + U 2 (t) + G2 (t) = 1- P (t) - Q(t) -R(t) we can
simplify (38) to obtain

dP (t)
= 2 − {−(2α + β + γ)P (t) + (β − α)R(t) + (γ − α)Q(t) + 2α} (39)
dt

which is what we want.
We now solve the matrix equation in (34). Deﬁne the following Laplace transform:
   
P (t) p(s)
   
L[T(t)] = L Q(t) = q(s) = T (s). (40)
   
   
   
R(t) r(s)

Applying the Laplace transform to (34), we get

1
sT (s) − T(0) = Q3 T (s) + C3 (41)
s

which we can rewrite as

1
− C3 = (Q − sI3 )T (s), (42)
s


where we have used the fact that T(0)= 0 and I3 is the 3 × 3 identity matrix. The
problem of solving the system of diﬀerential equations in (34) is now reduced to solving a
system of algebraic equations in the three unknowns p(s), q(s), and r(s). We now solve for
these three unknowns and then apply the inverse Laplace transform to get the solutions
for P (t), Q(t), and R(t). Using Cramer’s rule, we get

−2α/s −2(α − γ) −2(α − β)

−2β/s −2(α + 2β + γ) −2(β − α)

−2γ/s −2(γ − α) −2(α + β + 2γ) − s
p(s) = (43)
∆
−2(2α + β + γ) −2α/s −2(α − β)

−2(β − γ) −2β/s −2(β − α)

−2(γ − β) −2γ/s −2(α + β + 2γ) − s
q(s) = (44)
∆
−2(2α + β + γ) − s −2(α − γ) −2α/s

−2(β − γ) −2(α + 2β + γ) −2β/s

−2(γ − β) −2(γα) −2γ/s
r(s) = (45)
∆

where,

−2(2α + β + γ) −2(α − γ) −2(α − β)
∆ = −2(β − γ) −2(α + 2β + γ) −2(β − α) . (46)

−2(γ − β) −2(γ − α) −2(α + β + 2γ)

Upon simplifying and expressing the results in partial fractions we get,
1 1 1
1 4 4 4
p(s) = − − + (47)
4s s + 4(α + β) s + 4(α + γ) s + 4(β + γ)
1 1 1
1 4 4 4
q(s) = − + − (48)
4s s + 4(α + β) s + 4(α + γ) s + 4(β + γ)
1 1 1
1 4 4 4
r(s) = + − − . (49)
4s s + 4(α + β) s + 4(α + γ) s + 4(β + γ)


Applying the inverse Laplace transform, we get the following as solutions to the
system in (49),

1
P (t) = L−1 {p(s)} = 1 − eλ1 t − eλ2 t + eλ3 t (50)
4
1
Q(t) = L−1 {q(s)} = 1 − eλ1 t + eλ2 t − eλ3 t (51)
4
1
R(t) = L−1 {r(s)} = 1 + eλ1 t − eλ2 t − eλ3 t , (52)
4

where λ1 = −4(α+β), λ2 = −4(α+γ), λ3 = −4(β+γ).
Under the 3ST model, the equation for k in (15) can be expressed as
4 T
α+β+γ
k = Bi (t) dt = α + β + γ, (53)
T 0
i=1

where we have used the fact that the sum of the base probabilities is equal to 1. Note that
the assumption on some of the rates being equal played a crucial role in being able to
factor α+β+γ out of the summation to get a simple expression for k. For K, we obtain

K = 2T (α + β + γ). (54)

We can solve (52) for λ1 , λ2 , and λ3 to get

4(α + β)t = − ln(1 − 2P (t) − 2Q(t)) (55)

4(α + γ)t = − ln(1 − 2P (t) − 2R(t)) (56)

4(β + γ)t = − ln(1 − 2Q(t) − 2R(t)), (57)

and hence, for any time t ∈ [0, T ],

8(α + β + γ)t = − ln {[1 − 2P (t) − 2Q(t)][1 − 2P (T ) − 2R(t)][1 − 2Q(t) − 2R(t)]} (58)

K = 2kt (59)
1
= − ln {[1 − 2P (t) − 2Q(t)][1 − 2P (T ) − 2R(t)][1 − 2Q(t) − 2R(t)]} . (60)
4

The variance for this estimate of K is also given in the paper of Kimura (1981). We
have,

2 1 2
σK = a P (t) + b2 Q(t) + c2 R(t) − (aP (t) + bQ(t) + cR(t))2 (61)
n


where,

1 1 1
a = + (62)
2 1 − 2P (t) − 2Q(t) 1 − 2P (t) − 2Q(t)
1 1 1
b = + (63)
2 1 − 2P (t) − 2Q(t) 1 − 2Q(t) − 2R(t)
1 1 1
c = + . (64)
2 1 − 2P (t) − 2R(t) 1 − 2Q(t) − 2R(t)

The 2ST Model

We now proceed to a special case of this model which again is due to Kimura
(1980). We will call this model the two-substitution type model. The
two-substitution-type (2ST) was discussed by Kimura in a paper which was published a
year previous to the 3ST model. The 2ST model is a special case of the 3ST model and
hence we just give the results and do not gointo the details. (In the original paper, this
model is actually nameless. We just call it the 2ST model for convenience). The fourth
assumption here is that the transition rate is α and the transversion rate is β. Under this
assumption the diagram in Figure 3 simpliﬁes further to the diagram in Figure 4.
The tables for the base substitution and the match probabilities are given as Tables
5 and 6 below. The probability of a TS-type mismatch is given by P (t) and the
probability of a TV-type mismatch is given by QR(t) = Q(t)+ R(t). That is, we have
lumped together the TV1-type and TV2-type mismatches.
The matrix equation in (24) under the 2ST model is
   
U (t) −(α + 2β) α β β 
   
C(t) α −(α + 2β) β β
   
d  
 = 

  (65)
dt 
   
A(t)
 
 β β −(α + 2β) α 

   
G(t) β β α −(α + 2β)

and the corresponding matrix equation involving the match probabilities is
   
P (t) −2(2α + 2β) −2(α − β) −2(α − β) 
d    
Q(t) =  −2(α + 3β) −2(β − α)  . (66)
   
0
dt    
   
R(t) 0 −2(β − α) −2(α + 3β)


If we now lump Q(t) and R(t) together as QR(t) we have the matrix equation in (67)
which only involves a 2 × 2 matrix instead of the previous 3 × 3 matrix.
      
 P (t)  −2(2α + β + γ −2(α − β)  P (t)  2α
  =   +  (67)
QR(t) 0 8β QR(t) 2β

To solve (67), we use the initial conditions: P (0) = QR(0) = 0. As solutions we have

1 1 λ1 t 1 λ2 t
P (t) = − e + e (68)
4 2 4
1 1 λ2 t
QR(t) = − e (69)
2 2

where λ1 = −4(α+β) and λ2 = −8β.
Under the 2ST model k = α + 2β. We can solve (69) for αt and βt and therefore
obtain our estimate K. We have

K = 2kt = 2(α + 2β) (70)
1
= − ln [1 − 2P (t) − QR(t)]2 [1 − 2QR(t)] . (71)
4

The variance of this estimate is given

2 1 2
σK = a P (t) + b2 QR(t) − (aP (t) + bQR(t))2 (72)
n

where

1
a = (73)
1 − 2P (t) − 2QR(t)
1 1 1
b = + . (74)
2 1 − 2P (t) − 2QR(t) 1 − 2QR(t)

The Jukes-Cantor Model

The simplest possible model is due to Jukes and Cantor (1969). The model was
primarily formulated to describe protein evolution by looking at the rate of amino acid
substitution. It turns out that this model can also be used to describe base substitution.
The fourth assumption here is that all the rates of substitution are equal, i.e., α =
αi = βi = γi , i = 1, . . ., 4. Figure 2 then becomes Figure 5 below. Under the
Jukes-Cantor model, Tables 1 and 2 can be simpliﬁed to Tables 7 and 8, respectively.


The matrix equation in (24) under the Jukes-Cantor model is
    
U (t)  −3α α α α  U (t)
    
C(t)  α −3α α α  C(t)
    
d   =    (75)
dt     
A(t)
 
 α
 α −3α α 
 A(t)

    
G(t) α α α −3α G(t)

and the matrix equation involving the match probabilities is
      
P (t) −8β 0 0  P (t) 2α
d       
Q(t) =  0 0  Q(t) + 2α (76)
     
−8β
dt       
      
R(t) 0 0 −8β R(t) 2α

If we deﬁne P QR(t) = P (t) + Q(t) + R(t) we have the diﬀerential equation

d
P QR(t) = −8αP QR(t) + 6α (77)
dt

which has as a solution

3
P QR(t) = 1 − e−8αt . (78)
4

Under the Jukes-Cantor model, k = 3α and the estimate K is

3 4
K = 2kt = 6αt = − ln(1 − P QR(t)) (79)
4 3

which can be obtained by solving for α in (78).
The variance for K under the Jukes-Cantor model was derived by Kimura and Ohta
(1972) and is given by

2 1 (1 − P QR(t))P QR(t) (1 − P QR(t))P QR(t)
σJC = = . (80)
n 1 − 4P QR(t)/3 n(1 − 4P QR(t)/3)

We are going to illustrate the three models by comparing the human and protein
kinase inhibitor. These two nucleotide sequences were recently sequenced by Olsen and
Uhler (1991). The sequences are more than a thousand base pairs long but only 231 of
these are part of the coding region. Our analysis is limited to these 231 base pairs. The
sequences are shown in Figure 6. Of the 231 bp, only 15 show mismatches. These are


summarized in Table 9. Usually, the estimate K is computed by codon position since
there is that assumption that the substitution are independent of each other but there is
evidence that adjacent substitutions are actually not independent of each other. This will
not be done here since we have quite a small amount of base pairs and the mismatches are
quite far apart (except for the ones occurring at positions 200 and 201).
The estimate under each model is shown in Table 10. It is seen here that the
estimates do not differ so much from one model to the other. The variances are also not
that different from each other.
Estimates of genetic distance using some other nucleotide sequences are also
available. Tavar (1986) obtained estimates using human and mouse a-fetoprotein and
serum albumin nucleotide sequences. The results he got for the human-mouse
α-fetoprotein nucleotide sequences are reproduced below as Table 11. The data consist of
1824 base pairs and hence it was possible for him to compute the estimates by codon
positions.
Note that the estimates tend to be bigger for the third codon position and smallest
for the second codon position. Tavar in his paper showed that the estimates are not
homogeneous if we consider the codon positions as strata. Unfortunately, we cannot do
the same thing in our analysis here since we just have 231 bp and 15 mismatches.
All three models of evolutionary base substitutions that we have discussed here are
far from perfect and their weaknesses lie on the second and third assumptions made to
formulate the models.
The second assumption states that the nucleotide sequences are stochastically
identical and independent of each other. It is most possibly true that nucleotide sequences
evolve in a manner stochastically independent from each other but there are evidences
that they are in fact not stochastically identical. For example, Wu and Li (1985) noticed
that the substitution rates in rodent is much higher than that in humans. Even within a
sequence, there is evidence that that rates are much higher in some spots (“hot spots”)
than in others (Miyata and Yasunaga, 1981; Brown and Clegg, 1983) and that the rates
differ between the sense and antisense strand (Wu and Maeda, 1987). There are also
evidences showing that a substitution in one site does a affect the rate of substitution in
an adjacent site in phage T4 (Koch, 1971). It would be interesting to know if the same


holds for higher organisms. This last fact is also one of the reasons why substitution rates
are computed by codon sites if the data allow.
The third assumption assumes that the diverging nucleotide sequences are both of a
ﬁxed length and hence it doesn’t take into account mutations resulting from deletions and
insertions. These assumption also does not take into account the possibility of concerted
evolution, which brings about the presence of multigene families, and the duplication and
divergence in multigene families.
There have been eﬀorts to consider models which incorporate these shortcomings
but at the same time still make the models mathematically tractable. Needleman and
Wunsch (1970), for example, proposed a model which assigns weights to substitutions,
insertions and deletions. Unfortunately, the weights assigned were arbitrary and had no
genetic basis.
The main problem that these models of evolutionary base nucleotide substitution
face is that when all of the mechanisms of evolution are included in the model, the model
becomes mathematically intractable with the present computer technology. Considering
the fact that computer technology is still advancing, it is hoped that a model incorporating
most, if not all, of the mechanisms discussed can be formulated in the near future.


References

Brown, A., & Clegg, M. (1983). Analysis of variation in related DNA sequences. In

B. Weir (Ed.), Statistical data analysis (pp. 107–132). New York: Marcel-Dekker.

Cavalli-Sforza, L., & Bodmer, W. (1971). The genetics of human populations. San

Francisco: W. H. Freeman.

Cavalli-Sforza, L., & Edwards, A. (1967). Phylogenetic analysis: models and estimation

procedures. American Journal of Human Genetics, 19 , 233–257.

Edwards, A. (1971). The distance between populations on the basis of gene frequencies.

Biometrics, 27 , 873–881.

Jukes, T., & Cantor, C. (1969). Evolution of protein molecules. In H. N. Munro (Ed.),

Mammalian protein metabolism (pp. 21–123). New York: Academic Press.

Kimura, M. (1980). A simple method for estimating evolutionary rates of base

substitutions through comparative studies of nucleotide sequences. Journal of

Molecular Evolution, 16 , 11–120.

Kimura, M. (1981). Estimation of evolutionary distances between homologous nucleotide

sequences. Proceedings of the National Academy of Sciences USA, 78 , 454–458.

Kimura, M., & Ohta, T. (1972). On the stochastic model for estimation of mutational

distance between homologous proteins. Journal of Molecular Evolution, 2 , 87–90.

Koch, R. (1971). The inﬂuence of neighbouring base pairs upon base-pair substitution

mutation rates. Proceedings of the National Academy of Sciences USA, 68 , 773–776.

Maxam, A., & Gilbert, W. (1977). A new method for sequencing DNA. Proceedings of the

National Academy of Sciences USA, 74 , 560–564.

Miura, R. (Ed.). (1986). Lectures on mathematics in the life sciences. Rhode Island:

American Mathematical Society.

Miyata, T., & Yasunaga, T. (1981). Rapidly evolving mouse α-globin-related

pseudogenes. Proceedings of the National Academy of Sciences USA, 78 , 450–453.


Munro, H. N. (Ed.). (1969). Mammalian protein metabolism. New York: Academic Press.

Needleman, S., & Wunsch, C. (1970). A general method applicable to the search for

similarities in the amino acid sequence of two proteins. Journal of Molecular

Biology, 48 , 443–453.

Nei, M. (1977). F-statisitcs and analysis of gene diversity in subdivided populations.

Annals of Human Genetics, 41 , 225–233.

Olsen, S., & Uhler, M. (1991a). (nucleotide sequence of the human protein kinase

inhibitor). Molecular Endocrinology. (manuscript submitted)

Olsen, S., & Uhler, M. (1991b). (nucleotide sequence of the mouse protein kinase

inhibitor). Journal of Biological Chemistry. (in press)

Sanger, F., Nicklen, S., & Coulson, A. (1977). DNA sequencing with chain-terminating

inhibitors. Proceedings of the National Academy of Sciences USA, 74 , 4563–4567.

Takahata, N., & Kimura, M. (1981). A model of evolutionary base substitutions and its

application with special reference to rapid change in pseudogenes. Genetics, 98 ,

641–657.

Tavar´, S. (1986). Some probabilistic and statistical problems in the analysis of DNA
e

sequences. In R. Miura (Ed.), Lectures on mathematics in the life sciences (pp.

57–86). Rhode Island: American Mathematical Society.

Weir, B. (Ed.). (1983). Statistical data analysis. New York: Marcel-Dekker.

Weir, B. (1990). Genetic data analysis: methods for discrete population data. Sunderland,

Massachussetts: Sinauer Associates.

Wu, C., & Li, W. (1985). Evidence for higher rates of nucleotide substitution in rodents

than in man. Proceedings of the National Academy of Sciences USA, 82 , 1741–1745.

Wu, C., & Maeda, N. (1987). Inequality in mutation rates of the two strands of DNA.

Nature, 327 , 169–170.


Table 1

Types and rates of nucleotide sustitution.

Types

Transition (TS) Transversion (TV1) Transversion (TV2)

Initial base U C A G U A C G U G C A

New Base C U G A A U G C G U A C

Rates α1 α2 α3 α4 β1 β2 β3 β4 γ1 γ2 γ3 γ4


Table 2

Possible nucleotide base pairings at a speciﬁc homologius site for t > 0.

Types

Sequence Same TS-type TV1-type TV2-type

1 U C A G U C A G U A C G U G C A

2 U C A G C U G A A U G C G U A C

Probabilities S1 S2 S3 S4 P1 P2 P3 P4 Q1 Q2 Q3 Q4 R1 R2 R3 R4


Table 3

Types and rates of nucleotide sustitution under the 3ST model.

Types




Rates α α α α β β β β γ γ γ γ


Table 4

Possible nucleotide base pairings at a speciﬁc homologius site for t > 0 under the 3ST model.

Types




Probabilities S P Q R


Table 5

Types and rates of nucleotide sustitution under the 2ST model.

Types




Rates α α α α β β β β β β β β


Table 6

Possible nucleotide base pairings at a speciﬁc homologius site for t > 0 under the 2ST model.

Types




Probabilities S P QR


Table 7

Types and rates of nucleotide sustitution under the Jukes-Cantor model.

Types




Rates α α α α α α α α α α α α


Table 8

Possible nucleotide base pairings at a speciﬁc homologius site for t > 0 under the Jukes-

Cantor model.

Types




Probabilities S P QR


Table 9

Nucleotide mismatches observed after time T since divergence between human and mouse

protein kinase inhibitor (pki).

Types


Human pki U C A G U A C G U G C A

Mouse pki C U G A A U G C G U A C

Numbers observed 5 0 3 2 0 1 1 6 0 1 1 2


Table 10

Estimates of the genetic distance K under the diﬀerent models being considered.

Model K standard error

Jukes-Cantor 0.0682288 0.0178312

2ST 0.0686475 0.0180611

3ST 0.0686535 0.0180644


Table 11

Estimates of the genetic distance Ki , where i = 1, 2, or 3, is the ith codon position, under

the diﬀerent models considered in Tavar´ (1986). The sequence data are that of human and
e

mouse α-fetoprotein.

Model K1 K2 K3

Jukes-Cantor 0.1752 (.0186) 0.1387 (.0162) .6566 (.0483)

3ST 0.1760 (.0188) 0.1389 (.0163) .7230 (.0642)

(The parenthesized quantities are standard errors.)


Figure Captions

Figure 1. Divergence of sequences S1 and S2 from some common ancestor.

Figure 2. Types and rates of nucleotide substitutions.

Figure 3. Types and rates of nucleotide substitutions: 3ST Model.

Figure 4. Types and rates of nucleotide substitutions: 2ST Model.

Figure 5. Types and rates of nucleotide substitutions: Jukes-Cantor Model.

Figure 6. The nucleotide sequences of the coding region of the mouse protein kinase

inhibitor (Mpki.M) and the human protein kinase inhibitor (Hpki.2) are shown above.

The 15 mismatches are indicated with bars (Olsen and Uhler, 1991a, 1991b).

Ancestral sequence
¢f
¢ f
¢ f
¢ f
¢ f
T¢ fT
¢ f
¢ f

¢ x
f
S1 S2

DNA Nucleotide Substitution Models Explained

DNA Nucleotide Substitution Models Explained

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à DNA Nucleotide Substitution Models Explained

Similaire à DNA Nucleotide Substitution Models Explained (20)

Plus de Justine Leon Uro

Plus de Justine Leon Uro (6)

Dernier

Dernier (20)

DNA Nucleotide Substitution Models Explained