SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
CS6715 Cal State Hayward 1
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
CS6715:Module-1: Introduction to Data
Compression
Dr. John A Serri
CS6715 Cal State Hayward 2
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Data Compression and the Information
Revolution
• The information age has created a lot of information in the
form of digital bits(groups of 1 and 0’s ) to represent voice,
audio, text, and visual information that can be stored and
sent any where.
• Compression, one of the enabling technologies of the
information revolution, is the ability to represent digital
information in efficient ways.
– Not practical to transport data over networks without compression
– Cellular phones would not work well without the use of methods to
model voice, move the insignificant components and compress it
– Storage requirements are drastically reduced by compression
– Imagine sending PICS without compression
• Data compression is essentially a process of representing
information contained in a digital data set using a reduced
number of bits, essential to the storing and transport of
information in today’s IT / computer science world/
CS6715 Cal State Hayward 3
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Data Compression – Lossless vs
Lossy
• All compression ideas can be categorized into one of two
groups
– Lossless
– Lossy
• A Lossless compression scheme is one that represents
data in a more efficient way than the original representation
and is able to generate a perfect replica of the original data
– No information is lost
– Perfect replication is returned when “decompression” is applied
• A Lossy compression scheme is one that throws away
some information to make a more efficient representation
– Even though some information is discarded it still makes a useful
representation
– A lossy compression method can result in greater compression
than lossless but also creates some distortion( some loss of
information)
CS6715 Cal State Hayward 4
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Compression / Decompression
Process
Compressed
Item
Item to be
Compressed
Compressor
Process
“Encoding” Transmit or Store
Reconstructed
Item
Reconstruction
Process
“Decoding”
Compressed
Item
CS6715 Cal State Hayward 5
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
A very simple example of Compression
• Consider a picture comprised of 64 pixels
• The first Pixel is Black, the next 62 pixels
are white and the last pixel is black
• Uncompressed Representation
– If 1 bit represents 1 pixel = Send a string
of 64 pixels 1 = black 0 = white
{10………01}
– 64 bits need to be sent
• Could we represent this 64 bit object
more efficiently?
• We recognize the sequence as 1 black
+ 62 white + 1 black
• Lets define a pixel and the number of
same colored pixels that follow
• 1 black = 10000001
• 62 white = 00111110
• 1black = 10000001
• We can send this representation in
24 bits !
•This compression algorithm
works well if there are long
strings of the same color but is
not efficient if there are only
short strings of color changes
•For this scheme to work the
sender and receiver must both
understand the “encoding
procedure”
CS6715 Cal State Hayward 6
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Early Practical Example of
Compression- Morse Code
• Some Basic ideas are easy to understand and commonly used
• Example: Letters sent by telegraph are encoded as dots( short
pulse) and dashes(long pulse)…called Morse Code
• Morse noticed that certain letters are used more than others
• In 1830’s Morse measured what we now refer to as the statistical
structure in the data of messages sent by people and came up with
a variable code
– He assigned shorter code sequences to common letters and longer ones to
uncommon letters
– Common letters
• E (*)
• A ( * _ )
– Uncommon letters
• Q ( _ _ * _ )
• In this example statistical measurable structure of the data is used
to develop an efficient representation for lossless compression.
CS6715 Cal State Hayward 7
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Morse Code
• How much compression is obtained in Morse code?
– Suppose each * or _ could be represented by one bit
• (dot)* = 1 and (dash) _ = 0
– If the original morse code did not use a variable length code, but
assumed a fixed length code , then 26 bits would be needed
represent a data set of up to 64 symbols(Morse Code uses 40
symbols..next slide)
– Thus each letter in a fixed length code to represent a 40 symbol
alphabet would require 6 bits / symbol
– Using Morse Representation average code length across the 40
symbols is 3.9 bits/ symbol
– Therefore the compression that results is 6 / 3.9 = 1.53, if this
statistical model holds true( note any specific example could yield
different results)
CS6715 Cal State Hayward 8
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Morse Code Table
•A . _ •B _ ... •C _ . _ . •D _ .. •E .
•F .. _ . •G _ _ . •H .... •I .. •J . _ _ _
•K _ . _ •L . _ .. •M _ _ •N _ . •O _ _ _
•P . _ _ . •Q _ _ . _ •R . _ . •S ... •T _
•U .. _ •V ... _ •W . _ _ •X _ .. _ •Y _ . _ _
•Z _ _ ..
•1 . _ _ _ _ •2 .._ _ _ •3 ..._ _ •4 .... _ •5 .....
•6 _ .... •7 _ _ ... •8 _ _ _ .. •9 _ _ _ _ . •0 _ _ _ _ _
- - - . . .COLON
. - . - . -PERIOD
- -. . - -COMMA
. . - - . .QUESTION
156
6
6
6
6
25
25
4
18
15
14
17
14
Bits/row
Total Bits
Bits/ Symbol = 156/40 = 3.9
CS6715 Cal State Hayward 9
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
File Compression
• File compression is often used to reduce the size of E-mail
attachments
• How does Win ZIP, or other file compression programs
work?
• Most compression programs use a variation of the LZ
adaptive dictionary-based algorithm to shrink files. "LZ"
refers to Lempel and Ziv, the algorithm's creators, and
"dictionary" refers to the method of cataloging pieces of
data
• We will study this method in the course.
• These programs search for redundant strings and
represent them as a number.Rather than sending a long
string of characters, a short number is substituted in and
sent
CS6715 Cal State Hayward 10
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Example
• "Ask not what your country can do for you -- ask what you
can do for your country."
• Notice the redundancy
– "ask" appears two times ask= 1
– "what" appears two times what = 2
– "your" appears two times your = 3
– "country" appears two times country = 4
– "can" appears two times can = 5
– "do" appears two times do = 6
– "for" appears two times for = 7
– "you" appears two times you = 8
• We can now represent this sentence as
• 1_not_2_3_4_5_6_7_8 – 1_2_8_5_6_7_3_4.
• If the receiver has the table to translate the number back
into works it can decompress the file
CS6715 Cal State Hayward 11
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Adaptive nature of some algorithms
• In last example we simple implemented redundance by
looking for repeated words.
• A dictionary based compression program will look for
patterns in strings of characters
• Analysing the string "Ask_not_what_your_country_can_do_
for_you --ask_what_you_can_do_for_your_country."
• The following patterns emerge
– 1 ask_ 2 what_ 3 you 4 r_country 5 _can_do_for_you
– Sentence Represented as 1not__2345__--__12354.
– Assume that each Ascii character requires a byte, and each number
is represented using 4 bits( we have < 16 different numbers)
– Then original sentence is = 79 Bytes = 632 bits
– 1_not_2_3_4_5_6_7_8 – 1_2_8_5_6_7_3_4.=21 Bytes + 16x4 bits
– = 232 bits
– 1not__2345__--__12354. = 9 Bytes + 10x4 bits = 112 bits
CS6715 Cal State Hayward 12
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Telephony, Voice, and Compression
• Human Hearing is in the range of 20 hz to 20 Khz
• However he early phone system was made to pass
frequencies from 0 to 3 Khz( by accident rather than
design)
– It was discovered through trial and error that this produced
intelligible and recognizable voice
• The telephone network used for voice essentially acts as a
band-pass filter
• In some sense you might view this filtering as a
compression method
– Sufficient information can be passed in a span of 3Khz to serve as
human recognizable voice rather than using 20 Khz.
– NO need to send information that appears in higher frequencies
CS6715 Cal State Hayward 13
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Bandwidth Requirements for Voice
– A Telephone Voice channel is digitized by sampling the analog
voice pattern 8000 times a second, each sample is stored as 8
bits, that is there are 256 discrete voltage levels to characterize a
voice sample
– Nyquist Theorem( R = 2H log2(V) bits / sec )
• R = Data Rate, 2H = Sampling Rate , V = Number of Sample values /
sample
– This corresponds to a sending a data rate of
R = 8000 x 8 = 64 Kbits / sec
Telephone
Band Pass
3 Khz
Analog Signal
Discrete Samples
V
H(filter width) = 4 Khz
CS6715 Cal State Hayward 14
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Physical Structure of Data -
Voice Coding
• Mechanics of Speech Production imposes a structure on
speech and we can exploit the information to build voice
coders
• When we speak, the structure of our voice box dictates the
type of sounds that we make.
– Our sounds are on the order of tenths of seconds long or longer in
duration
– The intensity of the sound is in a limited power range
– The frequency of the sound we hear is within a certain bounded
range( 20 – 20,000 Hz)
• Instead of transmitting a full digital representation of the
speech( 64 Kbits/ sec), we could send information that
represents the conformation of the voice box, which can be
represented compactly
• This lossy compression approach is the basis of the Voice
Coder abbreviated as Vocoder.
CS6715 Cal State Hayward 15
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Voice Coding – Modeling The Voice
AIR
Speech
Voice
Tract
Voice
Cords
• Air is pushed from your lung
through your vocal tract and
out of your mouth comes
speech.
• For certain voiced sound, your
vocal cords vibrate (open and
close). The rate at which the
vocal cords vibrate determines
the pitch of your voice.
.
• The shape of your vocal tract determines
the sound that you make.
• As you speak, your vocal tract changes
its shape producing different sound.
• The shape of the vocal tract changes
relatively slowly (on the scale of 10
msec to 100 msec).
• The amount of air coming from your
lung determines the loudness of your
voice.
CS6715 Cal State Hayward 16
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Compression of Voice - considerations
• The full digitization resulting in 64Kb/s is called Pulse Code
Modulation( PCM)
• Compressing voice saves on usage of transmission resources
– Recognizable voice at << 64 Kbits/sec
• However Compression Algorithms need to be such as to not
cause too much delay
– Complex processing can result in to much delay
– A compromise needs to be struck between degree of compression and
delay
• First level of voice compression is called ADPCM (Adaptive PCM)
– Based on Linear Predictive Code
– Knowledge of the previous samples provides a good basis for estimating
the next sample. Using this estimate, we can simply encode the difference
between the estimated signal and the actual signal.
– Since the prediction is likely to be accurate, the error will be small, and we
can encode the error in just a few bits. On an 8-bit sample, send 2 - 5 bits
for the error.
CS6715 Cal State Hayward 17
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Requirements for Music
• Music (uncompressed)
– What is the symbol rate required to store / send a audio signal that
works over the full audio range of Human hearing?
– When a CD is created the Music is sampled 44,100 times per
second(compared to 8000 times/ second in telephony). The
samples are (16 bits) long( compared to 8 bits in telephony)
– Separate samples are taken for the left and right speakers in a
stereo system.
– Therefore the data rate required for an audio CD is
44,100 x 16 x 2 = 1,411,200 bit/s = 1.41Mbps
– Compare to a voice call sampled 8000 times per second and 8 bits
per sample = 64 Kbps
• MP3 is compressed version of Music that maintains close
to the full audio quality possible but typically requires on
160 Kbps or less.
CS6715 Cal State Hayward 18
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
How MP3 works
• MP3 manages to get a factor of 10 or more compression
compared to the number of bits on an Audio quality CD and
makes practically no difference when perceived( in fact some say
it sounds better ) ..A very good lossy compression algorithm !
• Technique called perceptual noise shaping is used. It is
"perceptual" because the MP3 format uses characteristics of the
human ear in the design of the compression algorithm.
– There are certain sounds that the human ear cannot hear.
– There are certain sounds that the human ear hears much better than
others.
– If there are two sounds playing simultaneously, human hears the louder one
but cannot hear the softer one.
• Therefore certain parts of the frequency spectrum can be
eliminated without significantly hurting the quality of the music for
the listener.
CS6715 Cal State Hayward 19
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
MP3:Loud band sampled, others
ignored
CS6715 Cal State Hayward 20
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Visualization and Compression
• “Seeing” takes up a lot more bits
than “hearing”
• To understand how a display works it
is first necessary to understand how
the human brain visualizes
• The Brain is able to assemble a
bunch of colored dots into a
meaningful vision
• The dots on the left resembles a
human face.
• Humans are also able to interpret a
set of changing dots into a
meaningful motion
• Without these two capabilities, TV
and other forms of 2D video as we
know it would not be possible.
• We can use these ideas to compress
visual images
CS6715 Cal State Hayward 21
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Pixels
• A unit of an image is called a Pixel
• A typical image will have millions of Pixels
• The color of a Pixel is determined by combining
the values of Red + Green + Blue Light
• “Full Color” refers to Pixels that are comprised of
8 bit – Red, Green, and Blue Pixels,
corresponding to about 16.7 million possible
colors(256 values per primary color)
Green = 194
Red = 161
Combined Colors
(194,161,149)
Blue = 149 Group of Pixels
CS6715 Cal State Hayward 22
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Example JPEG, GIF
• One form of compression you are probably familiar with is
image and video compression.
• There are standards called MPEG( Moving Picture Experts
Group) and JPEG(Joint Photographic Picture Group)- that
are the standard for compressing video and images
respectively
• Data compression algorithms are used to process digital
images comprised of pixels to reduce the number of
required bits yet produce an image that is suitable to the
Human eye
– It is part science and part art.
• Example of Benefit
– Uncompressed Video, needs about 20 Mbytes / second
– An excellent rendition can be achieved processing and sending 4
Mbytes or less per second
CS6715 Cal State Hayward 23
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Video Streams and Files
• Video requires substantially
more data than audio
• Video : Involves an image that is
comprised of many pieces of
visual data that needs to be
constantly refreshed
– Consider a computer screen
704x480 pixels, updated at 60
complete frames per second.
– Each pixel has a mix of three
colors as well as intensity and hue
– Using a model as on the right each
point requires 3 * 8 bits = 24bits
– Data Rate required for exact
reproduction is 704x480x24x60
= 486.6 Million bits per
seconds
• The Points
– There needs to be some
approximation made to reduce
the number of bits needed to be
sent ----COMPRESSION IS
ESSENTIAL
CS6715 Cal State Hayward 24
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
BMP/ GIF Images
• A digital image, or "bitmap", consists of a grid of dots, or "pixels", with
each pixel defined by a numeric value that gives its color .BMP files (
for photo quality – 24 bits of information per pixel )
– .BMP colors can be quantized into courser ranges( 256 colors, 16 colors, 8 colors)
• GIF standing for graphics interchange format
– Popular formula for files on the internet
• GIF images are limited to a list of 256 colors( each color is a 24 bit RGB
value )
– Each GIF contains a table of colors used in the image. Every time a color needs to
be specified, the GIF uses an index number to specify which color in the table to
use.
• To help keep the file size of GIFs small, use a method called run-length
coding.
– Images have pixels of the same color next to each other. Instead of specifying the
color of each individual pixel (like bitmaps do), GIFs specify strings of pixels of the
same color.
– A GIF image starts at the top left pixel of the image, moving right across the row of
pixels, then moving to the row below it and repeating (left to right, top to bottom).
– The GIF image only has to record a color when it comes across a pixel that is a
different color than the one before it; otherwise, it just adds onto the count of how
many pixels are of the current color.
CS6715 Cal State Hayward 25
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Examples
Original Photo
256 color Map
256 8 -bit map 1076 Kbytes
16 color map
Size 538Kbytes
Size 3140 Kbytes As GIF = 176 KB
As GIF = 396 KB
As GIF = 504 KB 1 bit map
1-bit Map Size 137Kbytes
As GIF = 20 KBytes
At the resolution in this chart you cannot see the
difference, but if you blow the respective bmp and
gif image you will notice a substantial difference.
CS6715 Cal State Hayward 26
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
JPEG
• JPEG is the image compression standard developed by the
Joint Photographic Experts Group. It works best on natural
images (scenes) rather than graphics ( sharp edges )
• JPEG compresses the color information, or "chrominance",
in an image separately from the actual details of shapes, or
"luminance".
• Luminance amounts to a grayscale image, while the
chrominance amounts to colors painted on top of that
grayscale image.
• The eye is much more sensitive to the details of shapes
than color information
– chrominance information can be compressed to a greater level than
luminance information
.
CS6715 Cal State Hayward 27
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
JPEG – Image Compression
• JPEG is most often used to compress 24-bit color
or 8-bit grayscale images.
• JPEG divides up the image into 3 sets of 8 by 8
pixel blocks
– 2 sets for chrominance and 1 for luminance
– Calculates the discrete cosine transform ( DCT) of each
block. A quantizer rounds off ( wipes out less important )
DCT coefficients according to the quantization matrix.
• Uses a run length code on these coefficients, and
then write the compressed data stream to an
output file (*.jpg).
CS6715 Cal State Hayward 28
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Example of a JPEG Image
Compressed Image File : 777 Kbytes
Uncompressed the image would be 4 Megapixels x 24 bits/pixel = 12 Mbytes
CS6715 Cal State Hayward 29
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Compression Techniques
• Whenever we refer to a compression technique we refer to
2 algorithms
– 1. Compression algorithm- Takes the original data input X and
generates a representation of X, call it Z that requires fewer bits
– 2. Reconstruction algorithm – Takes the compressed
representation Z and attempts to reconstruct the original
representation X
• A lossless compression scheme is one where the
reconstruction Y is exactly the same as X
• A lossy compression scheme is one where the
reconstruction Y is different than X
2332kky6
fejjfkjtip=
)))) fmsdflj
Jefwpejfpk
Wfek;jw;ef
Lossless
2332kky6
fejjfkjtip=
)))) fmsdflj
Jefwpejfpk
Wfek;jw;ef
Reconstruction
Edjkefhhp
90831r0bn
nlqcljb
2_32kky6
fejjfTYjtip
=)))) fmsdfj
Jefwpejfpk
wfek;jw;
Compression
Lossy
X Z
Y
CS6715 Cal State Hayward 30
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Lossless Compression
• Lossless compression - no loss of information
• The reconstructed Y is a perfect replica of X
• Used in applications that cannot tolerate loss of
data
– E.g. Text compression
• For data that is to be enhanced later to yield more
info, integrity needs to be preserved
– Example of radiological medical image…impact human
life…. Lossless compression is needed
– Example Satellite Data – environmental measurements
CS6715 Cal State Hayward 31
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Lossy Compression
• Compression where some loss of information
could be tolerated. In return for relieving the
requirement of exact replication considerably
higher compression ratio is achievable
• Speech – prefect replication is not required, in
fact it is never obtained over telephone because
of the 4 Khz cutoff.
– Can reduce number of bits at expense of
• Video, Imagery – loss can be tolerated as long as
it does not produce annoying artifacts.
– Blips, Jerky Motion, Long halts
CS6715 Cal State Hayward 32
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Measuring Performance
• A compression algorithm can be evaluated in a number of
ways
– Relative complexity of algorithm
• How much processing is required?
• How much memory is required?
• How hard is it to code?
– How much compression ( Ratio of Y / X )
• - compression ratio( Y / X )
– How close is the perceived resemblance to the original sample
• How stable is it to different cases?.
• In lossy compression not only is the compression ratio
important but need to quantify the difference
– Fidelity is high means that the difference between the
reconstruction and the original is small.
• Difference can be perceptual or mathematical
– We will address both
CS6715 Cal State Hayward 33
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Modeling and Coding
• A number of different compression techniques have been
created to handle different types of data
– For example: Text is different than imagery, often need a different
model and coding scheme
• The development of Data compression algorithms for a
variety of data is divided into 2 logical phases
– Modeling
– Coding
• Modeling: Try to extract information about any redundancy
or unnecessary aspects that exists in the data and
incorporate that redundancy into the Model
• Coding : A description of the Model and a description of
how the data from the Model are encoded and decoded
– Difference between the data and the model is called the residual
CS6715 Cal State Hayward 34
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Example 1: Compressing a Sequence
of Numbers
• Consider a sequence of numbers
9 11 11 11 14 13 15 17 16 18 18 19 21
We could store these as a binary list, using 5 bits per sample
The total number of bits required would be 13 x 5 = 65
-5
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10 11 12
Series1
Series2
Series3
We could also model the series as X = n + 8, and we take
the difference(the residual) between the model and the data
CS6715 Cal State Hayward 35
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Example 1 – cont.
Model
(n+8) n+8-DData(D)• We could send the
differences between the
model and the data as a
more compact
representation using a 3
bit representation., first
bit for sign( -1 is 10,
other bits for difference
• This 36 bit set combined
with the model x = n + 8
yields a compete
representation.
• This is a lossless
representation
9 9 0
11 10 -1
11 11 0
14 12 -2
13 13 0
15 14 -1
17 15 -2
16 16 0
18 17 -1
18 18 0
19 19 0
21 20 -1
000
101
000
110
000
101
110
000
101
000
000
101
Encoding
For this to work the encoder ( sender) and decode( receiver) must share the model
CS6715 Cal State Hayward 36
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Example-2
• Consider the Sequence
27 28 29 30 29 28 27 27 27 26 25 26 27 28 28 27 28 29 30 31 32
0
5
10
15
20
25
30
35
1
3
5
7
9
11
13
15
17
19
21
Series1
• Given that the next reading differs by only +1 , 0 ,
or –1 in value can represent this as
27 1 1 1 –1 –1 –1 0 0 –1 –1 1 1 1 0 –1 1 1 1 1 1
CS6715 Cal State Hayward 37
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Example - 2
• A lossless compression scheme would be to
send the first number, then send the small
subsequent difference, which in this case could
be represented by a 2 bit representation.
• The Decoder to calculate the nth value adds the
the value sent to the previous value
• This technique is called predictive, it uses past
values of a sequence to predict future values
• What is encoded in these schemes is the residual
CS6715 Cal State Hayward 38
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Example –3 Statistical Redundancy
• Consider the following sequence of 8 different
symbols
– quwuueurytturituuriewiieuriewiieurytyqiwueyt
With 8 different symbols, we could use 3 bits per
symbols to represent the 44 symbol sequence ,
which would take 132 bits
Let instead use a code where the most common words
are assigned a short code, 1 bit, and others are
assigned long codes
CS6715 Cal State Hayward 39
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
• Quwuueurytturituuriewiieu
riewiieurytyqiwueyrt
• Number of i 8
• q 2
• u 10
• w 4
• e 6
• r 5
• y 4
• t 5
2
4
4
5
5
6
8
10
0011
0001
111
011
001
11
01
1
8q
16y
12w
15t
15r
12e
16i
10u
Variable Coding
Encoding Length Needed
Bits
104 bits encoding44 symbols represented as 104 bits – 2.36 bits/symbol encoding
Compression Ration 132 : 104 corresponds to 1.269:1
CS6715 Cal State Hayward 40
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Text / Statistical Redundancy
• When dealing with text there can be redundancy with
words. Redundant words, repeat often ( the , and etc. for
example)can be set onto a list and efficient encoding
applied to these items.
• This is called a Dictionary Compression scheme
– Will study these techniques
• In effect an important part of modeling data is data
characterization
– Different characterizations lead to different schemes
• We will explore adaptive schemes, that assign codes to
structures based on previous experience.
CS6715 Cal State Hayward 41
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Compression and Standardization
• With the increasing use of compression there has
been an increasing needs for standards
• Standards allow different products by different
vendors to interoperate with each other
• International standard organizations have
responded to this by standardizing different
compression schemes.
• Compression is an ART more than a SCIENCE
– It requires practice and judgement
– To develop a good sense of compression we will develop
our own algorithms for it.
CS6715 Cal State Hayward 42
CS 6715 – Module1: Introduction to Data Compression
JAS 9/23//04
Summary
• Defined important terms
– Compression
– Lossless
– Lossy
• Provided some common examples of compression
– Morse Code
– Dictionary method – LZ
– Vocoder – LPC
– GIF
– MP3
– JPEG
• Examined some ways of modeling data structures and
providing rudimentary compression algorithms
• Discussed the needs for compression standards needed for
wide-scale implementation

Contenu connexe

Similaire à CS6715-Module1

rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningJeff Heaton
 
Teknik Pengkodean (2).pptx
Teknik Pengkodean (2).pptxTeknik Pengkodean (2).pptx
Teknik Pengkodean (2).pptxzulhelmanz
 
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
13. case study
13. case study13. case study
13. case studykhoahuy82
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTPConnor McDonald
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Dios Kurniawan
 
C++ 11 Style : A Touch of Class
C++ 11 Style : A Touch of ClassC++ 11 Style : A Touch of Class
C++ 11 Style : A Touch of ClassYogendra Rampuria
 
Comparision Of Various Lossless Image Compression Techniques
Comparision Of Various Lossless Image Compression TechniquesComparision Of Various Lossless Image Compression Techniques
Comparision Of Various Lossless Image Compression TechniquesIJERA Editor
 
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...ijaia
 
SQL For PHP Programmers
SQL For PHP ProgrammersSQL For PHP Programmers
SQL For PHP ProgrammersDave Stokes
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringMachine Learning Valencia
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really DoingDave Stokes
 
2020 FRSecure CISSP Mentor Program - Class 7
2020 FRSecure CISSP Mentor Program - Class 72020 FRSecure CISSP Mentor Program - Class 7
2020 FRSecure CISSP Mentor Program - Class 7FRSecure
 

Similaire à CS6715-Module1 (20)

rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 
Teknik Pengkodean (2).pptx
Teknik Pengkodean (2).pptxTeknik Pengkodean (2).pptx
Teknik Pengkodean (2).pptx
 
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
13. case study
13. case study13. case study
13. case study
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTP
 
Database Management & Models
Database Management & ModelsDatabase Management & Models
Database Management & Models
 
Acm icpc-briefing-prof-nbv
Acm icpc-briefing-prof-nbvAcm icpc-briefing-prof-nbv
Acm icpc-briefing-prof-nbv
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
 
C++ 11 Style : A Touch of Class
C++ 11 Style : A Touch of ClassC++ 11 Style : A Touch of Class
C++ 11 Style : A Touch of Class
 
Comparision Of Various Lossless Image Compression Techniques
Comparision Of Various Lossless Image Compression TechniquesComparision Of Various Lossless Image Compression Techniques
Comparision Of Various Lossless Image Compression Techniques
 
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
SQL For PHP Programmers
SQL For PHP ProgrammersSQL For PHP Programmers
SQL For PHP Programmers
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
lec01.ppt
lec01.pptlec01.ppt
lec01.ppt
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really Doing
 
2020 FRSecure CISSP Mentor Program - Class 7
2020 FRSecure CISSP Mentor Program - Class 72020 FRSecure CISSP Mentor Program - Class 7
2020 FRSecure CISSP Mentor Program - Class 7
 

CS6715-Module1

  • 1. CS6715 Cal State Hayward 1 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 CS6715:Module-1: Introduction to Data Compression Dr. John A Serri
  • 2. CS6715 Cal State Hayward 2 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Data Compression and the Information Revolution • The information age has created a lot of information in the form of digital bits(groups of 1 and 0’s ) to represent voice, audio, text, and visual information that can be stored and sent any where. • Compression, one of the enabling technologies of the information revolution, is the ability to represent digital information in efficient ways. – Not practical to transport data over networks without compression – Cellular phones would not work well without the use of methods to model voice, move the insignificant components and compress it – Storage requirements are drastically reduced by compression – Imagine sending PICS without compression • Data compression is essentially a process of representing information contained in a digital data set using a reduced number of bits, essential to the storing and transport of information in today’s IT / computer science world/
  • 3. CS6715 Cal State Hayward 3 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Data Compression – Lossless vs Lossy • All compression ideas can be categorized into one of two groups – Lossless – Lossy • A Lossless compression scheme is one that represents data in a more efficient way than the original representation and is able to generate a perfect replica of the original data – No information is lost – Perfect replication is returned when “decompression” is applied • A Lossy compression scheme is one that throws away some information to make a more efficient representation – Even though some information is discarded it still makes a useful representation – A lossy compression method can result in greater compression than lossless but also creates some distortion( some loss of information)
  • 4. CS6715 Cal State Hayward 4 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Compression / Decompression Process Compressed Item Item to be Compressed Compressor Process “Encoding” Transmit or Store Reconstructed Item Reconstruction Process “Decoding” Compressed Item
  • 5. CS6715 Cal State Hayward 5 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 A very simple example of Compression • Consider a picture comprised of 64 pixels • The first Pixel is Black, the next 62 pixels are white and the last pixel is black • Uncompressed Representation – If 1 bit represents 1 pixel = Send a string of 64 pixels 1 = black 0 = white {10………01} – 64 bits need to be sent • Could we represent this 64 bit object more efficiently? • We recognize the sequence as 1 black + 62 white + 1 black • Lets define a pixel and the number of same colored pixels that follow • 1 black = 10000001 • 62 white = 00111110 • 1black = 10000001 • We can send this representation in 24 bits ! •This compression algorithm works well if there are long strings of the same color but is not efficient if there are only short strings of color changes •For this scheme to work the sender and receiver must both understand the “encoding procedure”
  • 6. CS6715 Cal State Hayward 6 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Early Practical Example of Compression- Morse Code • Some Basic ideas are easy to understand and commonly used • Example: Letters sent by telegraph are encoded as dots( short pulse) and dashes(long pulse)…called Morse Code • Morse noticed that certain letters are used more than others • In 1830’s Morse measured what we now refer to as the statistical structure in the data of messages sent by people and came up with a variable code – He assigned shorter code sequences to common letters and longer ones to uncommon letters – Common letters • E (*) • A ( * _ ) – Uncommon letters • Q ( _ _ * _ ) • In this example statistical measurable structure of the data is used to develop an efficient representation for lossless compression.
  • 7. CS6715 Cal State Hayward 7 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Morse Code • How much compression is obtained in Morse code? – Suppose each * or _ could be represented by one bit • (dot)* = 1 and (dash) _ = 0 – If the original morse code did not use a variable length code, but assumed a fixed length code , then 26 bits would be needed represent a data set of up to 64 symbols(Morse Code uses 40 symbols..next slide) – Thus each letter in a fixed length code to represent a 40 symbol alphabet would require 6 bits / symbol – Using Morse Representation average code length across the 40 symbols is 3.9 bits/ symbol – Therefore the compression that results is 6 / 3.9 = 1.53, if this statistical model holds true( note any specific example could yield different results)
  • 8. CS6715 Cal State Hayward 8 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Morse Code Table •A . _ •B _ ... •C _ . _ . •D _ .. •E . •F .. _ . •G _ _ . •H .... •I .. •J . _ _ _ •K _ . _ •L . _ .. •M _ _ •N _ . •O _ _ _ •P . _ _ . •Q _ _ . _ •R . _ . •S ... •T _ •U .. _ •V ... _ •W . _ _ •X _ .. _ •Y _ . _ _ •Z _ _ .. •1 . _ _ _ _ •2 .._ _ _ •3 ..._ _ •4 .... _ •5 ..... •6 _ .... •7 _ _ ... •8 _ _ _ .. •9 _ _ _ _ . •0 _ _ _ _ _ - - - . . .COLON . - . - . -PERIOD - -. . - -COMMA . . - - . .QUESTION 156 6 6 6 6 25 25 4 18 15 14 17 14 Bits/row Total Bits Bits/ Symbol = 156/40 = 3.9
  • 9. CS6715 Cal State Hayward 9 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 File Compression • File compression is often used to reduce the size of E-mail attachments • How does Win ZIP, or other file compression programs work? • Most compression programs use a variation of the LZ adaptive dictionary-based algorithm to shrink files. "LZ" refers to Lempel and Ziv, the algorithm's creators, and "dictionary" refers to the method of cataloging pieces of data • We will study this method in the course. • These programs search for redundant strings and represent them as a number.Rather than sending a long string of characters, a short number is substituted in and sent
  • 10. CS6715 Cal State Hayward 10 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Example • "Ask not what your country can do for you -- ask what you can do for your country." • Notice the redundancy – "ask" appears two times ask= 1 – "what" appears two times what = 2 – "your" appears two times your = 3 – "country" appears two times country = 4 – "can" appears two times can = 5 – "do" appears two times do = 6 – "for" appears two times for = 7 – "you" appears two times you = 8 • We can now represent this sentence as • 1_not_2_3_4_5_6_7_8 – 1_2_8_5_6_7_3_4. • If the receiver has the table to translate the number back into works it can decompress the file
  • 11. CS6715 Cal State Hayward 11 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Adaptive nature of some algorithms • In last example we simple implemented redundance by looking for repeated words. • A dictionary based compression program will look for patterns in strings of characters • Analysing the string "Ask_not_what_your_country_can_do_ for_you --ask_what_you_can_do_for_your_country." • The following patterns emerge – 1 ask_ 2 what_ 3 you 4 r_country 5 _can_do_for_you – Sentence Represented as 1not__2345__--__12354. – Assume that each Ascii character requires a byte, and each number is represented using 4 bits( we have < 16 different numbers) – Then original sentence is = 79 Bytes = 632 bits – 1_not_2_3_4_5_6_7_8 – 1_2_8_5_6_7_3_4.=21 Bytes + 16x4 bits – = 232 bits – 1not__2345__--__12354. = 9 Bytes + 10x4 bits = 112 bits
  • 12. CS6715 Cal State Hayward 12 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Telephony, Voice, and Compression • Human Hearing is in the range of 20 hz to 20 Khz • However he early phone system was made to pass frequencies from 0 to 3 Khz( by accident rather than design) – It was discovered through trial and error that this produced intelligible and recognizable voice • The telephone network used for voice essentially acts as a band-pass filter • In some sense you might view this filtering as a compression method – Sufficient information can be passed in a span of 3Khz to serve as human recognizable voice rather than using 20 Khz. – NO need to send information that appears in higher frequencies
  • 13. CS6715 Cal State Hayward 13 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Bandwidth Requirements for Voice – A Telephone Voice channel is digitized by sampling the analog voice pattern 8000 times a second, each sample is stored as 8 bits, that is there are 256 discrete voltage levels to characterize a voice sample – Nyquist Theorem( R = 2H log2(V) bits / sec ) • R = Data Rate, 2H = Sampling Rate , V = Number of Sample values / sample – This corresponds to a sending a data rate of R = 8000 x 8 = 64 Kbits / sec Telephone Band Pass 3 Khz Analog Signal Discrete Samples V H(filter width) = 4 Khz
  • 14. CS6715 Cal State Hayward 14 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Physical Structure of Data - Voice Coding • Mechanics of Speech Production imposes a structure on speech and we can exploit the information to build voice coders • When we speak, the structure of our voice box dictates the type of sounds that we make. – Our sounds are on the order of tenths of seconds long or longer in duration – The intensity of the sound is in a limited power range – The frequency of the sound we hear is within a certain bounded range( 20 – 20,000 Hz) • Instead of transmitting a full digital representation of the speech( 64 Kbits/ sec), we could send information that represents the conformation of the voice box, which can be represented compactly • This lossy compression approach is the basis of the Voice Coder abbreviated as Vocoder.
  • 15. CS6715 Cal State Hayward 15 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Voice Coding – Modeling The Voice AIR Speech Voice Tract Voice Cords • Air is pushed from your lung through your vocal tract and out of your mouth comes speech. • For certain voiced sound, your vocal cords vibrate (open and close). The rate at which the vocal cords vibrate determines the pitch of your voice. . • The shape of your vocal tract determines the sound that you make. • As you speak, your vocal tract changes its shape producing different sound. • The shape of the vocal tract changes relatively slowly (on the scale of 10 msec to 100 msec). • The amount of air coming from your lung determines the loudness of your voice.
  • 16. CS6715 Cal State Hayward 16 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Compression of Voice - considerations • The full digitization resulting in 64Kb/s is called Pulse Code Modulation( PCM) • Compressing voice saves on usage of transmission resources – Recognizable voice at << 64 Kbits/sec • However Compression Algorithms need to be such as to not cause too much delay – Complex processing can result in to much delay – A compromise needs to be struck between degree of compression and delay • First level of voice compression is called ADPCM (Adaptive PCM) – Based on Linear Predictive Code – Knowledge of the previous samples provides a good basis for estimating the next sample. Using this estimate, we can simply encode the difference between the estimated signal and the actual signal. – Since the prediction is likely to be accurate, the error will be small, and we can encode the error in just a few bits. On an 8-bit sample, send 2 - 5 bits for the error.
  • 17. CS6715 Cal State Hayward 17 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Requirements for Music • Music (uncompressed) – What is the symbol rate required to store / send a audio signal that works over the full audio range of Human hearing? – When a CD is created the Music is sampled 44,100 times per second(compared to 8000 times/ second in telephony). The samples are (16 bits) long( compared to 8 bits in telephony) – Separate samples are taken for the left and right speakers in a stereo system. – Therefore the data rate required for an audio CD is 44,100 x 16 x 2 = 1,411,200 bit/s = 1.41Mbps – Compare to a voice call sampled 8000 times per second and 8 bits per sample = 64 Kbps • MP3 is compressed version of Music that maintains close to the full audio quality possible but typically requires on 160 Kbps or less.
  • 18. CS6715 Cal State Hayward 18 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 How MP3 works • MP3 manages to get a factor of 10 or more compression compared to the number of bits on an Audio quality CD and makes practically no difference when perceived( in fact some say it sounds better ) ..A very good lossy compression algorithm ! • Technique called perceptual noise shaping is used. It is "perceptual" because the MP3 format uses characteristics of the human ear in the design of the compression algorithm. – There are certain sounds that the human ear cannot hear. – There are certain sounds that the human ear hears much better than others. – If there are two sounds playing simultaneously, human hears the louder one but cannot hear the softer one. • Therefore certain parts of the frequency spectrum can be eliminated without significantly hurting the quality of the music for the listener.
  • 19. CS6715 Cal State Hayward 19 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 MP3:Loud band sampled, others ignored
  • 20. CS6715 Cal State Hayward 20 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Visualization and Compression • “Seeing” takes up a lot more bits than “hearing” • To understand how a display works it is first necessary to understand how the human brain visualizes • The Brain is able to assemble a bunch of colored dots into a meaningful vision • The dots on the left resembles a human face. • Humans are also able to interpret a set of changing dots into a meaningful motion • Without these two capabilities, TV and other forms of 2D video as we know it would not be possible. • We can use these ideas to compress visual images
  • 21. CS6715 Cal State Hayward 21 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Pixels • A unit of an image is called a Pixel • A typical image will have millions of Pixels • The color of a Pixel is determined by combining the values of Red + Green + Blue Light • “Full Color” refers to Pixels that are comprised of 8 bit – Red, Green, and Blue Pixels, corresponding to about 16.7 million possible colors(256 values per primary color) Green = 194 Red = 161 Combined Colors (194,161,149) Blue = 149 Group of Pixels
  • 22. CS6715 Cal State Hayward 22 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Example JPEG, GIF • One form of compression you are probably familiar with is image and video compression. • There are standards called MPEG( Moving Picture Experts Group) and JPEG(Joint Photographic Picture Group)- that are the standard for compressing video and images respectively • Data compression algorithms are used to process digital images comprised of pixels to reduce the number of required bits yet produce an image that is suitable to the Human eye – It is part science and part art. • Example of Benefit – Uncompressed Video, needs about 20 Mbytes / second – An excellent rendition can be achieved processing and sending 4 Mbytes or less per second
  • 23. CS6715 Cal State Hayward 23 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Video Streams and Files • Video requires substantially more data than audio • Video : Involves an image that is comprised of many pieces of visual data that needs to be constantly refreshed – Consider a computer screen 704x480 pixels, updated at 60 complete frames per second. – Each pixel has a mix of three colors as well as intensity and hue – Using a model as on the right each point requires 3 * 8 bits = 24bits – Data Rate required for exact reproduction is 704x480x24x60 = 486.6 Million bits per seconds • The Points – There needs to be some approximation made to reduce the number of bits needed to be sent ----COMPRESSION IS ESSENTIAL
  • 24. CS6715 Cal State Hayward 24 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 BMP/ GIF Images • A digital image, or "bitmap", consists of a grid of dots, or "pixels", with each pixel defined by a numeric value that gives its color .BMP files ( for photo quality – 24 bits of information per pixel ) – .BMP colors can be quantized into courser ranges( 256 colors, 16 colors, 8 colors) • GIF standing for graphics interchange format – Popular formula for files on the internet • GIF images are limited to a list of 256 colors( each color is a 24 bit RGB value ) – Each GIF contains a table of colors used in the image. Every time a color needs to be specified, the GIF uses an index number to specify which color in the table to use. • To help keep the file size of GIFs small, use a method called run-length coding. – Images have pixels of the same color next to each other. Instead of specifying the color of each individual pixel (like bitmaps do), GIFs specify strings of pixels of the same color. – A GIF image starts at the top left pixel of the image, moving right across the row of pixels, then moving to the row below it and repeating (left to right, top to bottom). – The GIF image only has to record a color when it comes across a pixel that is a different color than the one before it; otherwise, it just adds onto the count of how many pixels are of the current color.
  • 25. CS6715 Cal State Hayward 25 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Examples Original Photo 256 color Map 256 8 -bit map 1076 Kbytes 16 color map Size 538Kbytes Size 3140 Kbytes As GIF = 176 KB As GIF = 396 KB As GIF = 504 KB 1 bit map 1-bit Map Size 137Kbytes As GIF = 20 KBytes At the resolution in this chart you cannot see the difference, but if you blow the respective bmp and gif image you will notice a substantial difference.
  • 26. CS6715 Cal State Hayward 26 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 JPEG • JPEG is the image compression standard developed by the Joint Photographic Experts Group. It works best on natural images (scenes) rather than graphics ( sharp edges ) • JPEG compresses the color information, or "chrominance", in an image separately from the actual details of shapes, or "luminance". • Luminance amounts to a grayscale image, while the chrominance amounts to colors painted on top of that grayscale image. • The eye is much more sensitive to the details of shapes than color information – chrominance information can be compressed to a greater level than luminance information .
  • 27. CS6715 Cal State Hayward 27 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 JPEG – Image Compression • JPEG is most often used to compress 24-bit color or 8-bit grayscale images. • JPEG divides up the image into 3 sets of 8 by 8 pixel blocks – 2 sets for chrominance and 1 for luminance – Calculates the discrete cosine transform ( DCT) of each block. A quantizer rounds off ( wipes out less important ) DCT coefficients according to the quantization matrix. • Uses a run length code on these coefficients, and then write the compressed data stream to an output file (*.jpg).
  • 28. CS6715 Cal State Hayward 28 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Example of a JPEG Image Compressed Image File : 777 Kbytes Uncompressed the image would be 4 Megapixels x 24 bits/pixel = 12 Mbytes
  • 29. CS6715 Cal State Hayward 29 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Compression Techniques • Whenever we refer to a compression technique we refer to 2 algorithms – 1. Compression algorithm- Takes the original data input X and generates a representation of X, call it Z that requires fewer bits – 2. Reconstruction algorithm – Takes the compressed representation Z and attempts to reconstruct the original representation X • A lossless compression scheme is one where the reconstruction Y is exactly the same as X • A lossy compression scheme is one where the reconstruction Y is different than X 2332kky6 fejjfkjtip= )))) fmsdflj Jefwpejfpk Wfek;jw;ef Lossless 2332kky6 fejjfkjtip= )))) fmsdflj Jefwpejfpk Wfek;jw;ef Reconstruction Edjkefhhp 90831r0bn nlqcljb 2_32kky6 fejjfTYjtip =)))) fmsdfj Jefwpejfpk wfek;jw; Compression Lossy X Z Y
  • 30. CS6715 Cal State Hayward 30 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Lossless Compression • Lossless compression - no loss of information • The reconstructed Y is a perfect replica of X • Used in applications that cannot tolerate loss of data – E.g. Text compression • For data that is to be enhanced later to yield more info, integrity needs to be preserved – Example of radiological medical image…impact human life…. Lossless compression is needed – Example Satellite Data – environmental measurements
  • 31. CS6715 Cal State Hayward 31 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Lossy Compression • Compression where some loss of information could be tolerated. In return for relieving the requirement of exact replication considerably higher compression ratio is achievable • Speech – prefect replication is not required, in fact it is never obtained over telephone because of the 4 Khz cutoff. – Can reduce number of bits at expense of • Video, Imagery – loss can be tolerated as long as it does not produce annoying artifacts. – Blips, Jerky Motion, Long halts
  • 32. CS6715 Cal State Hayward 32 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Measuring Performance • A compression algorithm can be evaluated in a number of ways – Relative complexity of algorithm • How much processing is required? • How much memory is required? • How hard is it to code? – How much compression ( Ratio of Y / X ) • - compression ratio( Y / X ) – How close is the perceived resemblance to the original sample • How stable is it to different cases?. • In lossy compression not only is the compression ratio important but need to quantify the difference – Fidelity is high means that the difference between the reconstruction and the original is small. • Difference can be perceptual or mathematical – We will address both
  • 33. CS6715 Cal State Hayward 33 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Modeling and Coding • A number of different compression techniques have been created to handle different types of data – For example: Text is different than imagery, often need a different model and coding scheme • The development of Data compression algorithms for a variety of data is divided into 2 logical phases – Modeling – Coding • Modeling: Try to extract information about any redundancy or unnecessary aspects that exists in the data and incorporate that redundancy into the Model • Coding : A description of the Model and a description of how the data from the Model are encoded and decoded – Difference between the data and the model is called the residual
  • 34. CS6715 Cal State Hayward 34 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Example 1: Compressing a Sequence of Numbers • Consider a sequence of numbers 9 11 11 11 14 13 15 17 16 18 18 19 21 We could store these as a binary list, using 5 bits per sample The total number of bits required would be 13 x 5 = 65 -5 0 5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 Series1 Series2 Series3 We could also model the series as X = n + 8, and we take the difference(the residual) between the model and the data
  • 35. CS6715 Cal State Hayward 35 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Example 1 – cont. Model (n+8) n+8-DData(D)• We could send the differences between the model and the data as a more compact representation using a 3 bit representation., first bit for sign( -1 is 10, other bits for difference • This 36 bit set combined with the model x = n + 8 yields a compete representation. • This is a lossless representation 9 9 0 11 10 -1 11 11 0 14 12 -2 13 13 0 15 14 -1 17 15 -2 16 16 0 18 17 -1 18 18 0 19 19 0 21 20 -1 000 101 000 110 000 101 110 000 101 000 000 101 Encoding For this to work the encoder ( sender) and decode( receiver) must share the model
  • 36. CS6715 Cal State Hayward 36 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Example-2 • Consider the Sequence 27 28 29 30 29 28 27 27 27 26 25 26 27 28 28 27 28 29 30 31 32 0 5 10 15 20 25 30 35 1 3 5 7 9 11 13 15 17 19 21 Series1 • Given that the next reading differs by only +1 , 0 , or –1 in value can represent this as 27 1 1 1 –1 –1 –1 0 0 –1 –1 1 1 1 0 –1 1 1 1 1 1
  • 37. CS6715 Cal State Hayward 37 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Example - 2 • A lossless compression scheme would be to send the first number, then send the small subsequent difference, which in this case could be represented by a 2 bit representation. • The Decoder to calculate the nth value adds the the value sent to the previous value • This technique is called predictive, it uses past values of a sequence to predict future values • What is encoded in these schemes is the residual
  • 38. CS6715 Cal State Hayward 38 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Example –3 Statistical Redundancy • Consider the following sequence of 8 different symbols – quwuueurytturituuriewiieuriewiieurytyqiwueyt With 8 different symbols, we could use 3 bits per symbols to represent the 44 symbol sequence , which would take 132 bits Let instead use a code where the most common words are assigned a short code, 1 bit, and others are assigned long codes
  • 39. CS6715 Cal State Hayward 39 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 • Quwuueurytturituuriewiieu riewiieurytyqiwueyrt • Number of i 8 • q 2 • u 10 • w 4 • e 6 • r 5 • y 4 • t 5 2 4 4 5 5 6 8 10 0011 0001 111 011 001 11 01 1 8q 16y 12w 15t 15r 12e 16i 10u Variable Coding Encoding Length Needed Bits 104 bits encoding44 symbols represented as 104 bits – 2.36 bits/symbol encoding Compression Ration 132 : 104 corresponds to 1.269:1
  • 40. CS6715 Cal State Hayward 40 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Text / Statistical Redundancy • When dealing with text there can be redundancy with words. Redundant words, repeat often ( the , and etc. for example)can be set onto a list and efficient encoding applied to these items. • This is called a Dictionary Compression scheme – Will study these techniques • In effect an important part of modeling data is data characterization – Different characterizations lead to different schemes • We will explore adaptive schemes, that assign codes to structures based on previous experience.
  • 41. CS6715 Cal State Hayward 41 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Compression and Standardization • With the increasing use of compression there has been an increasing needs for standards • Standards allow different products by different vendors to interoperate with each other • International standard organizations have responded to this by standardizing different compression schemes. • Compression is an ART more than a SCIENCE – It requires practice and judgement – To develop a good sense of compression we will develop our own algorithms for it.
  • 42. CS6715 Cal State Hayward 42 CS 6715 – Module1: Introduction to Data Compression JAS 9/23//04 Summary • Defined important terms – Compression – Lossless – Lossy • Provided some common examples of compression – Morse Code – Dictionary method – LZ – Vocoder – LPC – GIF – MP3 – JPEG • Examined some ways of modeling data structures and providing rudimentary compression algorithms • Discussed the needs for compression standards needed for wide-scale implementation