1. Proof of Kraft-McMillan theorem∗
Nguyen Hung Vu†
平成 16 年 2 月 24 日
1 Kraft-McMillan theorem This means that
Let C be a code with n codewords with lenghth l1 , l2 , ..., lN . If nl
X nl
X
C is uniquely decodable, then K(C)n = Ak 2−k ≤ 2k 2−k = nl − n + 1 (3)
k=n k=n
N
X
K(C) = 2−li ≤ 1 (1) But K(C) is greater than one, it will grow exponentially with
i=1
n, while n(l − 1) + 1 can only grow linearly, So if K(C) is greater
than one, we can always find an n large enough that the inequal-
ity (3) is violated. Therefore, for an uniquely decodable code C,
K(C) is less than or equal ot one.
2 Proof This part of of Kraft-McMillan inequality provides a necces-
The proof works by looking at the nth power of K(C). If K(C) sary condition for uniquely decodable codes. That is, if a code
is greater than one, then K(C)n should grw expoentially with n. is uniquely decodable, the codeword lengths have to satisfy the
If it does not grow expoentially with n, then this is proof that inequality.
PN −li
i=1 2 ≤ 1.
Let n be and arbitrary integer. Then
hP in
N −li 3 Construction of prefix code
i=1 2 =
“P ”
= N −li1 Given a set of integers l1 , l2 , · · · , lN that satisfy the inequality
i1 =1 2 ···
“P ” “P ” N
N −li2 N −lin X
i2 =1 2 iN =1 2 2−li ≤ 1 (4)
and then i=1
"N #n N N N
we can always find a prefix code with codeword lengths
X −li
X X X l1 , l2 , · · · , lN .
2 = ··· 2−(li1 +i2 +···+in ) (2)
i=1 i1 =1 i2 =1 in =1
The exponent li1 + li2 + · · · + lin is simply the length of n
codewords from the code C. The smallest value this exponent 4 Proof of prefix code constructing
can take is greater than or equal to n, which would be the case if
codwords were 1 bit long. If theorem
We will prove this assertion by developing a procedure for con-
l = max{l1 , l2 , · · · , lN } structing a prefix code with codeword lengths l1 , l2 , · · · , lN that
then the largest value that the exponent can take is less than or satisfy the given inequality. Without loss of generality, we can
equal to nl. Therefore, we can write this summation as assume that
l1 ≤ l2 ≤ · · · ≤ lN . (5)
nl
X Define a sequence of numbers w1 , w2 , · · · , wN as follows:
K(C)n = Ak 2−k
n
w1 = 0
Pj−1 lj −li
where Ak is the number of combinations of n codewords that wj = i=1 2 j > 1.
have a combined length of k. Let’s take a look at the size of this The binary representation of wj for j > 1 would take up
coeficient. The number of possibledistinct binary sequences of log2 wj bits. We will use this binary representation to construct
length k is 2k . If this code is uniquely decodable, then each se- a prefix code. We first note that the number of bits in the binary
quence can represent one and only one sequence of codewords. representation of wj is less than or equal to lj . This is obviously
Therefore, the number of possible combinations of codewords true for w1 . For j > 1,
whose combined length is k cannot be greater than 2k . In other
hP i
words, log2 wj = log2 j−1 lj −li
2
h i=1P i
Ak ≤ 2k . = log2 2lj j−1 2−li
i=1
hP i
j−1 −li
∗ Introduction to Data Compression, Khalid Sayood = lj + log2 i=1 2
† email: vuhung@fedu.uec.ac.jp ≤ lj .
1
2. The last inequality results from the hypothesis of the theo-
P P i2 , · · · , Xn = in )P And second–order entropy is defined
.
rem that N 2−li ≤ 1, which implies that j−1 2li ≤ 1.
i=1 i=1 by H2 (S) P − P (X1 ) log P (X1 ). In case of iid,
=
i1 =m
As the logarithm of a˜number less than one is negative, lj +
ˆP Gn = −n i1 =1 P (X1 = i1 ) log P (X1 = i1 ). Prove it!
j−1 −li
log2 i=1 2 has to be less than lj . 3. Given an alphabet A = {a1 , a2 , a3 , a4 }, find the first–order
Using the binary representation of wj , we can devise a binary entropy in the following cases:
code in the following manner: If log2 wj = lj , then the jth
codeword cj is the binary representation of wj . If log2 wj < lj , (a) P (a1 ) = P (a2 ) = P (a3 ) = P (a4 ) = 1 .
4
then cj is the binary representation of wj , with lj − log2 wj (b) P (a1 ) = 1 , P (a2 ) = 1 , P (a3 ) = P (a4 ) = 1 .
2 4 8
zeros appended to the right. This certainly a code, but is it a prefix 1 1
code? If we can show that the code C = {c1 , c2 , · · · , cN } is a (c) P (a1 ) = 0.505, P (a2 ) = 4
, P (a3 ) = 8
, P (a4 ) =
prefix, then we will have proved the theorem by construction. 0.12.
Suppose that our claim is not true.. The for some j < k, cj is a 4. Suppose we have a source with a probability model P =
refix of ck . This means that lj most significant bits of wk form the {p0 , p1 , · · · , pm } and entropy HP . Suppose we have an-
binary representation of wj . Therefore if we right–shift the binary other sourcewith probability model Q = {q0 , q1 , · · · qm }
of wk by lk − lj bits, we should get the binary representation for and entroy HQ , where
wj . We can write this as
qi = pi i = 0, 1, · · · , j − 2, j + 1, · · · , m
wk
wj = .
2lk −lj and
However, pj +pj−1
qj = qj−1 = 2
k−1
X How is HQ related to HP ( greater, equal, or less)? Prove
wk = 2lk −li . your answer.
i=1 P P
Hints: HQ − HP = pi log pi − qi log qi =
Therefore, pj−1 +pj p +p
pj−1 log pj−1 + pj log pj − 2 2
log j−1 j =
2
pj−1 +pj
φ(pj−1 ) + φ(pj ) − 2φ( 2
). Where φ(x) = x log x
k−1
X and φ (x) = 1/x > 0 ∀x > 0.
wk
= 2lj −li
2lk −lj i=0
5. There are several image and speech files among the accom-
k−1
panying data sets.
X
= wj + 2lj −li (a) Write a program to compute the first–order entropy of
i=j (6)
some of the image and speech files.
k−1
X (b) Pick one of the image files and compute is second–
= wj + 20 + 2lj −lk
i=j+1
order entropy. Comment on the difference between
the first–order and second–order entropies.
≥ wj + 1
(c) Compute the entropy of the difference between neigh-
wk boring pixels for the image you used in part (b). Com-
That is, the smalleset value for lk −lj is wj + 1. This constra-
2
dicts the requirement for cj being the prefix of ck . Therefore, cj ment on what you discovered.
cannot be the prefix of ck . As j and k were arbitrary, this means 6. Conduct an experiemnt to see how well a model can describe
that no codword is a prefix of another codeword, and the code C a source.
is a prefix code.
(a) Write a program that randomly selects letters from
a 26–letter alphabet {a, b, c, · · · , z} and forms four–
letter words. Form 100 such words and see how many
5 Projects and Problems of these words make sence.
(b) Among the accompanying data sets is a file called
1. Suppose X is a random variable that takes on values from an
4letter.words , which contains a list of four–
M–letter alphabet. Show that 0 ≤ H(X) ≤ log2 M .
P letter words. Using this file, obtain a probability
Hints: H(X) = − n P (αi ) log P (αi ) where, M =
i=1 model for the alphabet. Now repeat part (a) gener-
{α1 × m1 , α2 × m2 , · · · , αn × mP αi = αj ∀i =
Pn n }, ating the words using probability model. To pick let-
j, i=1 mi = M . H(X) = − n PMi log mi =
Pn i=1
m
M ters according to a probability model, construct the
1 1
− M i=1 mi (log mi − log M ) = − M mi log mi + cumulative density function (cdf )FX (x)2 . Using a
1
P 1
P
log M mi = − M mi log mi + log M uniform pseudorandom number generator to generate
M
a value r, where 0 ≤ r < 1, pick the letter xk if
2. Show that for the case where the elements of an observed FX (xk − 1) ≤ r < FX (xk ). Compare your results
sequence are idd 1 ,the entropy is equal to the first-order with those of part (a).
entropy.
(c) Repeat (b) using a single–letter context.
Hints: First–order entropy is defined by:
H1 (S) = 1
limn→∞ n Gn , where Gn = (d) Repeat (b) using a two–letter context.
P P P
− i1 =m i2 =m · · · in =1 in = mP (X1
i1 =1 i2 =1 = 7. Determine whether the following codes are uniquely decod-
i1 , X2 = i2 , · · · , Xn = in ) log P (X1 = i1 , X2 = able:
1 iid: independent and identical distributed 2 cumulative distribution function (cdf ) FX (x) = P (X ≤ x)
3. (a) {0, 01, 11, 111}( 01 11 = 0 111) In this alphabet, a3 and a4 are the two letters at the bottom of
(b) {0, 01, 110, 111} (01 110 = 0 111 0) the sorted list. We assign their codewords as
(c) {0, 10, 110, 111}, Yes. Prove it!
c(a3 ) = α2 ∗ 0
(d) {1, 10, 110, 111}, (1 10 = 110)
c(a4 ) = α2 ∗ 1
8. Using a text file compute the probabilities of each letter pi .
but c(a4 ) = α1 . Therefore
(a) Assume that we need a codeword of length log1 pi
2
to encode the letter i. Determine the number of bits
needed to encode the file. α1 = α2 ∗ 1
(b) Compute the conditional probability P (i/j) of a let- which means that
ter i given that the previous letter is j. Assume that
1
we need log P (i/j) to represent a letter i that fol- c(a4 ) = α2 ∗ 10
2
lows a letter j. Determine the number of bits needed c(a5 ) = α2 ∗ 11.
to encode the file. At this stage, we again define a new alphabet A that consists
of three letters a1 , a2 , a3 , where a3 is composed of a3 and a4 and
has a probability P (a3 ) = P (a3 ) + P (a4 ) = 0.4. We sort this
6 Huffman coding new alphabet in descending order to obtain the following table.
The reduced four–letter alphabet
6.1 Definition Letter Probability Codewords
A minimal variable-length character encoding based on the fre- a2 0.4 c(a2 )
quency of each character. First, each character becomes a trivial a3 0.4 α2
tree, with the character as the only node. The character’s fre- a1 0.2 c(a1 )
quency is the tree’s frequency. The two trees with the least fre-
quencies are joined with a new root which is assigned the sum
In this case, the two least probable symbols are a1 and a3 .
of their frequencies. This is repeated until all characters are in
Therefore,
one tree. One code bit represents each level. Thus more frequent
characters are near the root and are encoded with few bits, and
c(a3 ) = α3 ∗ 0
rare characters are far from the root and are encoded with many
c(a1 ) = α3 ∗ 1
bits.
But c(a3 ) = α2 . Therefore,
6.2 Example of Huffman encoding design α2 = α3 ∗ 0
which means that.
Alphabet A = {a1 , a2 , a3 , a4 } with P (a1 ) = P (a2 ) = 0.2
and P (a4 ) = P (a5 ) = 0.1. The entropy of this source is
c(a3 ) = α3 ∗ 00
2.122bits/symbol. To design the Huffman code, we first sort the
c(a4 ) = α4 ∗ 010
letters in a descending probability order as shown in table below.
c(a5 ) = α5 ∗ 011.
Here c(ai ) denotes the codewords as
Again we define a new alphabet, this time with only two let-
c(a4 ) = α1 ∗ 0 ter a3 , a2 . Here a3 is composed of letters a3 and a1 and has
c(a5 ) = α1 ∗ 1 probability P (a3 ) = P (a3 ) + P (a1 ) = 0.6. We now have the
where α1 is a binary string, and * denotes concatnation. following table
The initial five–letter alphabet
Letter Probability Codeword The reduced four–letter alphabet
a2 0.4 c(a2 ) Letter Probability Codewords
a1 0.2 c(a1 ) a3 0.6 α3
a3 0.2 c(a3 ) a2 0.4 c(a2 )
a4 0.1 c(a4 )
a5 0.1 c(a5 ) As we have only two letters, the codeword assignment is
straghtforward:
Now we define a new alphabe A with a four–letter
a1 , a2 , a3 , a4 where a4 is composed of a4 and a5 and has a proba- c(a3 ) = 0
bility P (a4 ) = P (a4 ) + P (a5 ) = 0.2. We sort this new alphabet c(a2 ) = 1
in descending order to obtain following table. which means that α3 = 0, which in turn means that
The reduced four–letter alphabet
Letter Probability Codewords c(a1 ) = 01
a2 0.4 c(a2 ) c(a3 ) = 000
a1 0.2 c(a1 ) c(a4 ) = 0010
a3 0.2 c(a3 ) c(a5 ) = 0011
a4 0.2 α1 and the Huffman finally is given here.
4. The reduced four–letter alphabet 6.6 Extended Huffman code
Letter Probability Codewords
a2 0.4 1 Consider an alphabet A = {a1 , a2 , · · · , am }. Each element of
a1 0.2 01 the sequence is generated independently of the other elements in
a3 0.2 000 the sequence. The entropyo f this source is give by
a4 0.1 0010 N
X
a5 0.1 0011 H(S) = − P (ai ) log2 P (ai )
i=1
We know that we can generate a Huffman code for this source
6.3 Huffman code is optimal with rate r such that
The neccessary conditions for an optimal variable binary code H(S) ≤ R < H(S) + 1 (8)
are as follows: We have used the looser bound here; the same argument can
be made with the tighter bound. Notice thatwe have used ”rate
1. Condition 1: Give any two letter aj and ak , if P (aj ) ≥
R” to denote the number of bits per symbol. This is a standard
P (ak ), then lj ≤ lk , where lj is the number of bits in the
convention in the data compression literate. However, in the co-
codeword for aj .
munication literature, the word ”rate” ofter refers to the number
2. Condition 2: The two least probable letters have codewords of bits per second.
with the same maximum length lm . Suppose we now encode the sequence by generating one code-
word for every n symbols. As there are mn combinations of
3. Condition 3: In the tree corresponding to the optimum code,
n symbols, we will need mn codewords in our Huffman code.
there must be two branches stemming3 from each interme-
We could generate this code by viewing mn symbols as let-
diate node. ntimes
z }| {
4. Condition 4: Suppose we change an intermediate node into ters of an extended alphabet. A(n) = {a1 a1 ...a1 , a1 a1 ...a2,
a leaf node by combining all the leaves descending from ..., a2, a1 a1 , ...am , a1 a1 ...a2 a1, ..., am am ...am } from a source
it into a composite word of a reduced alphabet. Then, if S (n) . Denote the rate for the new source as R(n) . We know
the original tree was optimal for the original alphabet, the
H(S (n) ) ≤ R(n) < H(S (n) ) + 1. (9)
reduced tree is optimal for reduced alphabet.
The Huffman code satisfies all conditions above and therefore 1 (n)
R= R
be optimal. n
and
H(S (n) ) H(S (n) ) 1
≤R< +
6.4 Average codeword length n n n
It can be proven that
For a source S with alphabet A = {a1 , a2 , · · · , aK }, and prob-
ability model {P (a1 ), P (a2 ), · · · , P (aK )}, the average code- H(S (n) ) = nH(S)
word length is given by
and therefore
1
K
X H(S) ≤ R ≤ H(S) +
¯= n
l P (ai )li .
i=1
The difference between the entropy ofthe sourcce H(S) and the 6.6.1 Example of Huffman Extended Code
average length is
A = {a1 , a2 , a3 } with probabilities model P (a1 ) =
0.8, P (a2 ) = 0.02 and P (a3 ) = 0.18. The entropy for this
P P
H(S) − ¯ =
l − K P (ai )h 2 P (ai ) i” K P (ai )li
log − i=1 source is 0.816 bits per symbol. Huffman code this this source is
P i=1 “ 1 shown below
= P (ai ) log P (ai ) − li
P “ h i ”
= P (ai ) log P (ai ) − log2 (2li )
1 a1 0
P h −l i a2 11
2 i
= P (ai ) log P (ai )
ˆP l ˜ a3 10
≤ log2 2i
and the extended code is
a1 a1 0.64 0
6.5 Length of Huffman code a1 a2 0.016 10101
a1 a3 0.144 11
It has been proven that a2 a1 0.016 101000
a2 a2 0.0004 10100101
H(S) ≤ ¯ < H(S) + 1
(l) (7) a2 a3 0.0036 1010011
a3 a1 0.1440 100
where H(S) is the entropy of the source S.
a3 a2 0.0036 10100100
3 生じる a3 a3 0.0324 1011
5. 6.7 Nonbinary Huffman codes
6.8 Adaptive Huffman Coding
7 Arithmetic coding
4
Use the mapping
X(ai ) = i, ai ∈ A (10)
where A = {a1 , a2 , ...am } is the alphabet for a discrete source
and X is a random variable. This mapping means that given a
probability model mathcalP for the source, we also have a prob-
ability density function for the random variable
P (X = i) = P (ai )
and the curmulative density function can be defined as
i
X
FX (i) = P (X = k).
k=1
¯
Define TX (ai ) as
i−1
X
¯ 1
TX (ai ) = P (X = k) + P (X = i) (11)
2
k=1
1
= FX (i − 1) + P (X = i) (12)
2
¯
For each ai , TX (ai ) will have a unique value. This value can be
used as a unique tag for ai . In general
X 1
¯(m)
TX (xi ) = P (y) + P (xi ) (13)
y<xi
2
4 Arithmetic: 算数, 算術