SlideShare une entreprise Scribd logo
1  sur  33
By
Subeer Rangra
(08EBKCS059)
      &
Mukul Ranjan
 (08EBKCS029)
Index
1.   Introduction to Data Compression
2.   Introduction to Text Compression
3.   LZW
     3.1 LZW Encoding Algorithm
     3.2 Encoding a String Example
     3.2 LZW Decoding Algorithm
     3.3 Decoding a String Example.
4.   Flate Compression
     4.1 Decomposition
        4.1.1 Huffman Coding
        4.1.2 LZ77 Compression
        4.1.3 Putting both together
5.   Advantages and Disadvantages
     5.1 LZW
     5.2 Flate
6.   Conclusion
1. Introduction to Data
Compression
 Encoding information using fewer bits than the
 original representation.
 Data Compression is achieved when redundancies are
 reduced or eliminated
 Lossless where no information is lost.

 Lossy where some information is lost.

 Compression reduces the data storage space.
Introduction to Data
Compression…. Contd.
 Reduces transmission time needed over the network.

 Data must be decompressed or decoded to be reused.

 Symmetrical or Asymmetrical

 Software or Hardware
2. Introduction to Text
Compression
 The compression of Text based data.

 Major difference between Text and Image compression.

 Databases, binary programs, text on one side and sound,
  image, video signals on the other.

 Text compression needs Losseless Compression.

 Needed in literary works, product catalogues, genomic
  databases, raw text databases.
3. LZW (Lempel-Ziv-Welch)
 Starts with a dictionary of all the single characters and gradually
  builds the dictionary as the information is sent through.

 Lossless compression hence works good for text compression.

 A dictionary or code table based encoding algorithm.

 Uses a code table with 4096 as a common choice for number of
  entries.

 It tries to identify repeated sequences of data and adds them to
  the code table.
LZW (Lempel-Ziv-Welch)….contd.
 A general compression algorithm capable of working
  on almost any type of data.

 Large size Text files in English language can be
  typically be compressed to half it’s size.

 Used in GIF (Graphics Interchange Format) to reduce
  the size without degrading the visual quality.
3.1 LZW Encoding Algorithm
1.  STRING = get input character
2. WHILE not end of input stream DO
3.     CHARACTER = get input character
4.     IF STRING+CHARACTER is in the string table then
5.         STRING = STRING+CHARACTER
6.     ELSE
7.         Output the code for STRING
8.         add STRING+CHARACTER to the STRING table
9.         STRING = CHARACTER
10.     END of IF
11. END of WHILE
12. Output the code for STRING
LZW Encoding Flowchart
3.2 Encoding a String example
 To encode a string of characters
1.   First Generate a initial dictionary of single characters

                  Symbol      Binary       Decimal
              #            00000       0
              A            00001       1
              B            00010       2
              C            00011       3
              D            00100       4
              E            00101       5
              Contd……..
              upto Z
Encoding a String Example …..contd
2. Example TOBEORNOTTOBEORTOBEORNOT
    Current                           Output
              Next Char                                 Extended Dictionary                    Comments
   Sequence                    Code            Bits
    NULL         T


      T          O        20             10100        27:         TO          27 = first available code after 0 through 26


      O          B        15             01111        28:         OB
      B          E        2              00010        29:         BE
      E          O        5              00101        30:         EO
      O          R        15             01111        31:         OR


                                                                              32 requires 6 bits, so for next output use 6
      R          N        18             10010        32:         RN
                                                                              bits


      N          O        14             001110       33:         NO
      O          T        15             001111       34:         OT
      T          T        20             010100       35:         TT
     TO          B        27             011011       36:         TOB

     BE          O        29             011101       37:         BEO
Encoding a String Example …..contd
  TO    B   27   011011   36:   TOB

  BE    O   29   011101   37:   BEO

  OR    T   31   011111   38:   ORT

  TOB   E   36   100100   39:   TOBE

  EO    R   30   011110   40:   EOR

  RN    O   32   100000   41:   RNO


                                       # stops the algorithm;
  OT    #   34   100010
                                       send the cur seq


            0    000000                and the stop code
3.3 LZW Decoding Algorithm
1.    Read OLD_CODE
2.    output OLD_CODE
3.    CHARACTER = OLD_CODE
4.    WHILE there are still input characters DO
5.      Read NEW_CODE
6.      IF NEW_CODE is not in the translation table THEN
7.         STRING = get translation of OLD_CODE
8.         STRING = STRING+CHARACTER
9.      ELSE
10.        STRING = get translation of NEW_CODE
11.     END of IF
12.     output STRING
13.     CHARACTER = first character in STRING
14.     add OLD_CODE + CHARACTER to the translation table
15.     OLD_CODE = NEW_CODE
16.   END of WHILE
LZW Decoding Flowchart
3.4 Decoding a String Example
 To decode an LZW-compressed archive, one needs to know
   in advance the initial dictionary used, but additional
   entries can be reconstructed as they are always simply
   concatenations of previous entries.
         Input                           New Dictionary Entry
                        Output
                                                                             Comments
  Bits          Code   Sequence         Full            Conjecture
10100       20            T                       27:        T?
01111       15            O       27:    TO       28:        O?
00010       2             B       28:    OB       29:        B?
00101       5             E       29:    BE       30:        E?
01111       15            O       30:    EO       31:        O?
                                                                     created code 31 (last to fit
10010       18            R       31:    OR       32:        R?
                                                                     in 5 bits)


                                                                     so start reading input at 6
001110      14            N       32:    RN       33:        N?
                                                                     bits
4. Flate Compression
 A lossless data compression.
 Can discover and exploit many patterns in the input
  data.
 An improvement over LZW compression, Flate
  encoded data is usually much more compact than
  LZW encoded output.
 It was originally defined by Phil Katz for version 2 of
  his PKZIP archiving tool and was later specified in RFC
  1951.
 Used in PDF compression, Adobe uses a Flate
  compression tool for PDF files.
4.1 Decomposition
 Flate specifications defines a lossless data format that
  compresses data using a combination of LZ77 algorithm
  and Huffman coding.
 Hence the format can be implemented readily in a manner
  not covered by patents.
 The manner in which these two algorithms work are
  explained below and then the combination of the two
  which work to produce Flate compression.
4.1.1 Huffman Coding
 A type of entropy encoding algorithm.

 Used for lossless data compression.

 Can be used to generate variable-length codes.

 The variable length codes are generated based on the
 frequency of the occurrence of the characters.
 The idea of assigning shortest code to the character
 with the highest probability of occurrence.
Huffman Coding…. contd.
 The algorithm starts by assigning each element a
  ‘weight’ a number that represents the relative
  frequency within the data to be compressed.
Taking an example for the set of weights {1,2,3,3,4}




1.   They are assigned to be the nodes or leaves of the
     Huffman tree to be formed
Huffman Coding…. contd.
2. During the first step, the two nodes with weights
   (highest priority OR lowest probability) 1 and 2 are
   merged, to create a new tree with a root of weight 3.
Huffman Coding…. contd.
3. Now we have three nodes with weights 3 at their
   roots, so choosing one of the 3 weighted node.
Huffman Coding…. contd.
4. Now our two minimum trees are the two singleton
   nodes of weights 3 and 4. We will combine these to
   form a new tree of weight 7.
Huffman Coding…. contd.
5. Finally we merge our last two remaining trees.
Huffman Coding…. contd.
 When all nodes have been recombined into a single
  ``Huffman tree,'' then by starting at the root and
  selecting 0 or 1 at each step, you can reach any element
  in the tree.
 Each element now has a Huffman code, which is the
  sequence of 0's and 1's that represents that path
  through the tree.
4.1.2 LZ77 Compression
 Works by finding the sequence of data that are
    repeated.
   A lossless data compression algorithm.
   Maintains a ‘sliding window during compression’
    which means that the compressor have a record of
    what last characters were.
   Goes through the text in a sliding window consisting
    of a search buffer and a look ahead buffer.
   The search buffer is used as dictionary.
LZ77 Compression…. contd.
1. Suppose the input text is
    AABABBBABAABABBBABBABB
2. The first block found is simply A, encoded as (0,A).
   The next is AB, encoded as (1,B) where 1 is a reference
   to A:
    A|AB|ABBBABAABABBBABBABB
3. The next block is ABB, which is encoded as (2,B)
   where 2 is a reference to AB, entered in the
   dictionary one iteration ago. Going this way, the
   string parses into
   A|AB|ABB|B|ABA|ABAB|BB|ABBA|BB
LZ77 Compression…. Contd.
 At the end of the algorithm, the dictionary is:
                  Reference        Phrase    Encoding
              1               A             (0,A)
              2               AB            (1,B)
              3               ABB           (2,B)
              4               B             (0,B)
              5               ABA           (2,A)
              6               ABAB          (5,B)
              7               BB            (4,B)
              8               ABBA          (3,A)
              9               BB            (7,0)
4.1.3 Putting Both Together
The Flate is a smart algorithm that adapts the way it
compresses data to the actual data themselves. There are
three modes of compression that the compressor has
available:
1. Not compressed at all an intelligent choice when the
    data has already been compressed.
2. Compression, first with LZ77 and then with a slightly
    modified version of Huffman coding. The trees that
    are used are defined by the Flate specification itself.
Putting Both Together….contd.
3. Compression first with LZ77 and then with Huffman
   coding with trees that compressor creates and stores
   along with the data.
   The data is broken up into blocks each block uses a
   single mode of compression.
5. Advantages & Disadvantages
5.1 LZW
Advantage
   Is a lossless compression algo. Hence no information is lost.
   One need not pass the code table between the two
    compression and the decompression.
   Simple, fast and good compression.
Disadvantage
   What happens when the dictionary becomes too large.
   One approach is to throw the dictionary away when it reaches
    a certain size.
   Useful only for a large amount of text data where redundancy
    is high.
Advantages & Disadvantages
5.1 Flate Compression
Advantage
    Huffman is easy to implement.
    Flate is a lossless compression technique hence no loss of text.
    Simple, fast and good compression.
    Freedom to chose the type of compression based on the need of the
     content.
Disadvantage
    Overhead is generated due to Huffman tree generation.
    The actual resulting compression code becomes too complex as it
     combines LZ77 and Huffman.
    It’s quiet tricky to understand and correctly apply the correct
     combination of LZ77 and Huffman.
6. Conclusion
 LZW has various advantages when being used to
  compress large text data, in English language which
  has high redundancy.
 Both LZW and Flate are software based, Dictionary
  and lossless methods of compression.
 The text compression needs lossless technique of
  compression.
 Flate which is readily used in PDF files, is an adaptive,
  changeable and complex way to compress text.
Thank You

Contenu connexe

Tendances

Ch 04 Arithmetic Coding (Ppt)
Ch 04 Arithmetic Coding (Ppt)Ch 04 Arithmetic Coding (Ppt)
Ch 04 Arithmetic Coding (Ppt)
anithabalaprabhu
 

Tendances (20)

Comparison between Lossy and Lossless Compression
Comparison between Lossy and Lossless CompressionComparison between Lossy and Lossless Compression
Comparison between Lossy and Lossless Compression
 
Scene recognition using Convolutional Neural Network
Scene recognition using Convolutional Neural NetworkScene recognition using Convolutional Neural Network
Scene recognition using Convolutional Neural Network
 
Video Compression Techniques
Video Compression TechniquesVideo Compression Techniques
Video Compression Techniques
 
Hash Function & Analysis
Hash Function & AnalysisHash Function & Analysis
Hash Function & Analysis
 
Log Transformation in Image Processing with Example
Log Transformation in Image Processing with ExampleLog Transformation in Image Processing with Example
Log Transformation in Image Processing with Example
 
Image compression
Image compressionImage compression
Image compression
 
Lzw coding technique for image compression
Lzw coding technique for image compressionLzw coding technique for image compression
Lzw coding technique for image compression
 
Video Steganography
Video SteganographyVideo Steganography
Video Steganography
 
Lzw
LzwLzw
Lzw
 
Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)
 
Fundamentals of Data compression
Fundamentals of Data compressionFundamentals of Data compression
Fundamentals of Data compression
 
Data compression
Data compressionData compression
Data compression
 
Ch 04 Arithmetic Coding (Ppt)
Ch 04 Arithmetic Coding (Ppt)Ch 04 Arithmetic Coding (Ppt)
Ch 04 Arithmetic Coding (Ppt)
 
Steganography: LSB technique
Steganography: LSB techniqueSteganography: LSB technique
Steganography: LSB technique
 
Audio compression
Audio compressionAudio compression
Audio compression
 
Chapter 5 - Data Compression
Chapter 5 - Data CompressionChapter 5 - Data Compression
Chapter 5 - Data Compression
 
SHA-256.pptx
SHA-256.pptxSHA-256.pptx
SHA-256.pptx
 
H.264 vs HEVC
H.264 vs HEVCH.264 vs HEVC
H.264 vs HEVC
 
Stego.ppt
Stego.pptStego.ppt
Stego.ppt
 
Line Detection
Line DetectionLine Detection
Line Detection
 

En vedette

Dictionary Based Compression
Dictionary Based CompressionDictionary Based Compression
Dictionary Based Compression
anithabalaprabhu
 
Compression project presentation
Compression project presentationCompression project presentation
Compression project presentation
faizang909
 
Data compression huffman coding algoritham
Data compression huffman coding algorithamData compression huffman coding algoritham
Data compression huffman coding algoritham
Rahul Khanwani
 

En vedette (20)

Lzw compression
Lzw compressionLzw compression
Lzw compression
 
Lzw compression ppt
Lzw compression pptLzw compression ppt
Lzw compression ppt
 
Lzw algorithm
Lzw algorithmLzw algorithm
Lzw algorithm
 
Lz77 (sliding window)
Lz77 (sliding window)Lz77 (sliding window)
Lz77 (sliding window)
 
OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM
OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHMOPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM
OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM
 
Lz77 / Lempel-Ziv Algorithm
Lz77 / Lempel-Ziv AlgorithmLz77 / Lempel-Ziv Algorithm
Lz77 / Lempel-Ziv Algorithm
 
LZ78
LZ78LZ78
LZ78
 
Huffman Coding
Huffman CodingHuffman Coding
Huffman Coding
 
Dictionary Based Compression
Dictionary Based CompressionDictionary Based Compression
Dictionary Based Compression
 
Compression project presentation
Compression project presentationCompression project presentation
Compression project presentation
 
Data compression huffman coding algoritham
Data compression huffman coding algorithamData compression huffman coding algoritham
Data compression huffman coding algoritham
 
image compression ppt
image compression pptimage compression ppt
image compression ppt
 
Compression
CompressionCompression
Compression
 
Shannon Fano
Shannon FanoShannon Fano
Shannon Fano
 
Data compression
Data compressionData compression
Data compression
 
Image compression
Image compressionImage compression
Image compression
 
Digital Communication Techniques
Digital Communication TechniquesDigital Communication Techniques
Digital Communication Techniques
 
Compression techniques
Compression techniquesCompression techniques
Compression techniques
 
Data compression
Data compressionData compression
Data compression
 
Multimediaexercise
MultimediaexerciseMultimediaexercise
Multimediaexercise
 

Similaire à Text compression in LZW and Flate

Similaire à Text compression in LZW and Flate (20)

Lec-03 Entropy Coding I: Hoffmann & Golomb Codes
Lec-03 Entropy Coding I: Hoffmann & Golomb CodesLec-03 Entropy Coding I: Hoffmann & Golomb Codes
Lec-03 Entropy Coding I: Hoffmann & Golomb Codes
 
Data Encryption standard in cryptography
Data Encryption standard in cryptographyData Encryption standard in cryptography
Data Encryption standard in cryptography
 
Lz algorithm
Lz algorithmLz algorithm
Lz algorithm
 
EMBEDDED SYSTEMS 2&3
EMBEDDED SYSTEMS 2&3EMBEDDED SYSTEMS 2&3
EMBEDDED SYSTEMS 2&3
 
Logic Design - Chapter 5: Part1 Combinattional Logic
Logic Design - Chapter 5: Part1 Combinattional LogicLogic Design - Chapter 5: Part1 Combinattional Logic
Logic Design - Chapter 5: Part1 Combinattional Logic
 
ATT SMK.pptx
ATT SMK.pptxATT SMK.pptx
ATT SMK.pptx
 
Compression Ii
Compression IiCompression Ii
Compression Ii
 
Compression Ii
Compression IiCompression Ii
Compression Ii
 
Chapter 4 combinational circuit
Chapter 4 combinational circuit Chapter 4 combinational circuit
Chapter 4 combinational circuit
 
11.ppt
11.ppt11.ppt
11.ppt
 
06 Arithmetic 1
06 Arithmetic 106 Arithmetic 1
06 Arithmetic 1
 
Lab01
Lab01Lab01
Lab01
 
Lecture.1
Lecture.1Lecture.1
Lecture.1
 
unit 5 (1).pptx
unit 5 (1).pptxunit 5 (1).pptx
unit 5 (1).pptx
 
Computer archi&mp
Computer archi&mpComputer archi&mp
Computer archi&mp
 
Octal encoding
Octal encodingOctal encoding
Octal encoding
 
Crypto-Presentation jfjfd dkfdnfdj kdfjdjfdjkfd .pptx
Crypto-Presentation jfjfd dkfdnfdj kdfjdjfdjkfd .pptxCrypto-Presentation jfjfd dkfdnfdj kdfjdjfdjkfd .pptx
Crypto-Presentation jfjfd dkfdnfdj kdfjdjfdjkfd .pptx
 
Compression ii
Compression iiCompression ii
Compression ii
 
Turbo Code
Turbo Code Turbo Code
Turbo Code
 
Ch03 des
Ch03 desCh03 des
Ch03 des
 

Dernier

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Dernier (20)

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 

Text compression in LZW and Flate

  • 1. By Subeer Rangra (08EBKCS059) & Mukul Ranjan (08EBKCS029)
  • 2. Index 1. Introduction to Data Compression 2. Introduction to Text Compression 3. LZW 3.1 LZW Encoding Algorithm 3.2 Encoding a String Example 3.2 LZW Decoding Algorithm 3.3 Decoding a String Example. 4. Flate Compression 4.1 Decomposition 4.1.1 Huffman Coding 4.1.2 LZ77 Compression 4.1.3 Putting both together 5. Advantages and Disadvantages 5.1 LZW 5.2 Flate 6. Conclusion
  • 3. 1. Introduction to Data Compression  Encoding information using fewer bits than the original representation.  Data Compression is achieved when redundancies are reduced or eliminated  Lossless where no information is lost.  Lossy where some information is lost.  Compression reduces the data storage space.
  • 4. Introduction to Data Compression…. Contd.  Reduces transmission time needed over the network.  Data must be decompressed or decoded to be reused.  Symmetrical or Asymmetrical  Software or Hardware
  • 5. 2. Introduction to Text Compression  The compression of Text based data.  Major difference between Text and Image compression.  Databases, binary programs, text on one side and sound, image, video signals on the other.  Text compression needs Losseless Compression.  Needed in literary works, product catalogues, genomic databases, raw text databases.
  • 6. 3. LZW (Lempel-Ziv-Welch)  Starts with a dictionary of all the single characters and gradually builds the dictionary as the information is sent through.  Lossless compression hence works good for text compression.  A dictionary or code table based encoding algorithm.  Uses a code table with 4096 as a common choice for number of entries.  It tries to identify repeated sequences of data and adds them to the code table.
  • 7. LZW (Lempel-Ziv-Welch)….contd.  A general compression algorithm capable of working on almost any type of data.  Large size Text files in English language can be typically be compressed to half it’s size.  Used in GIF (Graphics Interchange Format) to reduce the size without degrading the visual quality.
  • 8. 3.1 LZW Encoding Algorithm 1. STRING = get input character 2. WHILE not end of input stream DO 3. CHARACTER = get input character 4. IF STRING+CHARACTER is in the string table then 5. STRING = STRING+CHARACTER 6. ELSE 7. Output the code for STRING 8. add STRING+CHARACTER to the STRING table 9. STRING = CHARACTER 10. END of IF 11. END of WHILE 12. Output the code for STRING
  • 10. 3.2 Encoding a String example  To encode a string of characters 1. First Generate a initial dictionary of single characters Symbol Binary Decimal # 00000 0 A 00001 1 B 00010 2 C 00011 3 D 00100 4 E 00101 5 Contd…….. upto Z
  • 11. Encoding a String Example …..contd 2. Example TOBEORNOTTOBEORTOBEORNOT Current Output Next Char Extended Dictionary Comments Sequence Code Bits NULL T T O 20 10100 27: TO 27 = first available code after 0 through 26 O B 15 01111 28: OB B E 2 00010 29: BE E O 5 00101 30: EO O R 15 01111 31: OR 32 requires 6 bits, so for next output use 6 R N 18 10010 32: RN bits N O 14 001110 33: NO O T 15 001111 34: OT T T 20 010100 35: TT TO B 27 011011 36: TOB BE O 29 011101 37: BEO
  • 12. Encoding a String Example …..contd TO B 27 011011 36: TOB BE O 29 011101 37: BEO OR T 31 011111 38: ORT TOB E 36 100100 39: TOBE EO R 30 011110 40: EOR RN O 32 100000 41: RNO # stops the algorithm; OT # 34 100010 send the cur seq 0 000000 and the stop code
  • 13. 3.3 LZW Decoding Algorithm 1. Read OLD_CODE 2. output OLD_CODE 3. CHARACTER = OLD_CODE 4. WHILE there are still input characters DO 5. Read NEW_CODE 6. IF NEW_CODE is not in the translation table THEN 7. STRING = get translation of OLD_CODE 8. STRING = STRING+CHARACTER 9. ELSE 10. STRING = get translation of NEW_CODE 11. END of IF 12. output STRING 13. CHARACTER = first character in STRING 14. add OLD_CODE + CHARACTER to the translation table 15. OLD_CODE = NEW_CODE 16. END of WHILE
  • 15. 3.4 Decoding a String Example  To decode an LZW-compressed archive, one needs to know in advance the initial dictionary used, but additional entries can be reconstructed as they are always simply concatenations of previous entries. Input New Dictionary Entry Output Comments Bits Code Sequence Full Conjecture 10100 20 T 27: T? 01111 15 O 27: TO 28: O? 00010 2 B 28: OB 29: B? 00101 5 E 29: BE 30: E? 01111 15 O 30: EO 31: O? created code 31 (last to fit 10010 18 R 31: OR 32: R? in 5 bits) so start reading input at 6 001110 14 N 32: RN 33: N? bits
  • 16. 4. Flate Compression  A lossless data compression.  Can discover and exploit many patterns in the input data.  An improvement over LZW compression, Flate encoded data is usually much more compact than LZW encoded output.  It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool and was later specified in RFC 1951.  Used in PDF compression, Adobe uses a Flate compression tool for PDF files.
  • 17. 4.1 Decomposition  Flate specifications defines a lossless data format that compresses data using a combination of LZ77 algorithm and Huffman coding.  Hence the format can be implemented readily in a manner not covered by patents.  The manner in which these two algorithms work are explained below and then the combination of the two which work to produce Flate compression.
  • 18. 4.1.1 Huffman Coding  A type of entropy encoding algorithm.  Used for lossless data compression.  Can be used to generate variable-length codes.  The variable length codes are generated based on the frequency of the occurrence of the characters.  The idea of assigning shortest code to the character with the highest probability of occurrence.
  • 19. Huffman Coding…. contd.  The algorithm starts by assigning each element a ‘weight’ a number that represents the relative frequency within the data to be compressed. Taking an example for the set of weights {1,2,3,3,4} 1. They are assigned to be the nodes or leaves of the Huffman tree to be formed
  • 20. Huffman Coding…. contd. 2. During the first step, the two nodes with weights (highest priority OR lowest probability) 1 and 2 are merged, to create a new tree with a root of weight 3.
  • 21. Huffman Coding…. contd. 3. Now we have three nodes with weights 3 at their roots, so choosing one of the 3 weighted node.
  • 22. Huffman Coding…. contd. 4. Now our two minimum trees are the two singleton nodes of weights 3 and 4. We will combine these to form a new tree of weight 7.
  • 23. Huffman Coding…. contd. 5. Finally we merge our last two remaining trees.
  • 24. Huffman Coding…. contd.  When all nodes have been recombined into a single ``Huffman tree,'' then by starting at the root and selecting 0 or 1 at each step, you can reach any element in the tree.  Each element now has a Huffman code, which is the sequence of 0's and 1's that represents that path through the tree.
  • 25. 4.1.2 LZ77 Compression  Works by finding the sequence of data that are repeated.  A lossless data compression algorithm.  Maintains a ‘sliding window during compression’ which means that the compressor have a record of what last characters were.  Goes through the text in a sliding window consisting of a search buffer and a look ahead buffer.  The search buffer is used as dictionary.
  • 26. LZ77 Compression…. contd. 1. Suppose the input text is AABABBBABAABABBBABBABB 2. The first block found is simply A, encoded as (0,A). The next is AB, encoded as (1,B) where 1 is a reference to A: A|AB|ABBBABAABABBBABBABB 3. The next block is ABB, which is encoded as (2,B) where 2 is a reference to AB, entered in the dictionary one iteration ago. Going this way, the string parses into A|AB|ABB|B|ABA|ABAB|BB|ABBA|BB
  • 27. LZ77 Compression…. Contd.  At the end of the algorithm, the dictionary is: Reference Phrase Encoding 1 A (0,A) 2 AB (1,B) 3 ABB (2,B) 4 B (0,B) 5 ABA (2,A) 6 ABAB (5,B) 7 BB (4,B) 8 ABBA (3,A) 9 BB (7,0)
  • 28. 4.1.3 Putting Both Together The Flate is a smart algorithm that adapts the way it compresses data to the actual data themselves. There are three modes of compression that the compressor has available: 1. Not compressed at all an intelligent choice when the data has already been compressed. 2. Compression, first with LZ77 and then with a slightly modified version of Huffman coding. The trees that are used are defined by the Flate specification itself.
  • 29. Putting Both Together….contd. 3. Compression first with LZ77 and then with Huffman coding with trees that compressor creates and stores along with the data. The data is broken up into blocks each block uses a single mode of compression.
  • 30. 5. Advantages & Disadvantages 5.1 LZW Advantage  Is a lossless compression algo. Hence no information is lost.  One need not pass the code table between the two compression and the decompression.  Simple, fast and good compression. Disadvantage  What happens when the dictionary becomes too large.  One approach is to throw the dictionary away when it reaches a certain size.  Useful only for a large amount of text data where redundancy is high.
  • 31. Advantages & Disadvantages 5.1 Flate Compression Advantage  Huffman is easy to implement.  Flate is a lossless compression technique hence no loss of text.  Simple, fast and good compression.  Freedom to chose the type of compression based on the need of the content. Disadvantage  Overhead is generated due to Huffman tree generation.  The actual resulting compression code becomes too complex as it combines LZ77 and Huffman.  It’s quiet tricky to understand and correctly apply the correct combination of LZ77 and Huffman.
  • 32. 6. Conclusion  LZW has various advantages when being used to compress large text data, in English language which has high redundancy.  Both LZW and Flate are software based, Dictionary and lossless methods of compression.  The text compression needs lossless technique of compression.  Flate which is readily used in PDF files, is an adaptive, changeable and complex way to compress text.