SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Copyright 2011 Trend Micro Inc. 1
Taxonomy of Differential Compression
Liwei Ren, Ph.D, Trend Micro™
SNIA SDC 2015, Santa Clara, California, Sept, 2015
Copyright 2011 Trend Micro Inc.
Background
• Liwei Ren, Ph.D
– Research interests
• Data security, network security, data compression, math modeling & algorithms.
– Major works
• 10+ academic papers;
• 20+ US patents granted, and a few more pending;
• Co-founded a data security company in Silicon Valley with successful exit.
– Education
• MS/BS in mathematics, Tsinghua University, Beijing
• Ph.D in mathematics, MS in information science, University of Pittsburgh
• Trend Micro™
– Global security software company with headquarter in Tokyo, and R&D centers in
Silicon Valley, Nanjing and Taipei;
– One of top security software vendors.
– A leader in cloud security.
2
Copyright 2011 Trend Micro Inc.
Agenda
• Introduction
• A Math Model for Describing File Differences
• Categorizing Differential Compression
• Advanced Topics
• Summary
Classification 9/21/2015 3
Copyright 2011 Trend Micro Inc.
Introduction
• Objectives for this sharing:
– Understand what differential compression is AND its applications.
– Learn a mathematical model for describing differential compression
– Know categories of differential compression
– Be aware of a few advanced topics
Classification 9/21/2015 4
Copyright 2011 Trend Micro Inc.
Introduction
• Lossless data compression --- three categories
Classification 9/21/2015 5
• Two purposes:
– Network data transfer acceleration
– Storage space reduction
Copyright 2011 Trend Micro Inc.
Introduction
• Today we talk about Differential Compression (DC).
• Why do I talk about differential compression?
– I have been designing various algorithms for differential compression since 2002
for a few domains:
• FOTA ( Firmware Over The Air) for mobile phones
• Incremental update of data files for security software.
• File synchronization & transfer over WAN
• Differential compression for executable files
• …
– It is a time to summarize various problems & techniques in a systemic view:
• I may write an academic book on this.
Classification 9/21/2015 6
Copyright 2011 Trend Micro Inc.
Introduction
• What is differential compression?
Classification 9/21/2015 7
Copyright 2011 Trend Micro Inc.
Introduction
• To reduce network bandwidth cost :
Classification 9/21/2015 8
• To reduce storage cost:
Copyright 2011 Trend Micro Inc.
Introduction
• Applications:
– Data backup
• Remote data backup
– Revision control systems
– Software vulnerability & patch management
• FOTA ( firmware over the air)
• Malware signature update
– File synchronization and transfer
– Distributed file systems
– Cloud data migration
Classification 9/21/2015 9
Copyright 2011 Trend Micro Inc.
A Math Model for Describing File Differences
• In the formal presentation Δ = T – R, what do we mean by “-” and Δ?
• There are a few approaches to describe DIFF.
– Here is one.
• Diff Model: A math model to describe the “differences” of T and R:
– Δ is basically a procedure that transforms reference file R to target file T.
• To be specific, Δ is a sequence of string edit operations for reconstructing T
from R.
– Two edit operations COPY & ADD :
• COPY (addrSrc, size ,addrDest) --- to copy a block of data from reference file
to target file.
• ADD (dataBlock, size ,addrDest) --- to add a block of data to the target file.
Classification 9/21/2015 10
Copyright 2011 Trend Micro Inc.
A Math Model for Describing File Differences
• Look at an example:
Classification 9/21/2015 11
Copyright 2011 Trend Micro Inc.
A Math Model for Describing File Differences
For better illustration, let us assume:
– Block 1 has 100 bytes
– Block 2 has 60 bytes
– Block 3 has 50 bytes
– Block 4 has 50 bytes
– Block 5 has 120 bytes
– Block 6 has 60 bytes
– Block 7 has 70 bytes
– Block 8 has 150 bytes
– Block 9 has 80 bytes
– Block 10 has 180 bytes
Classification 9/21/2015 12
• A sequence of edit
operations:
1. COPY <0,100,0>
2. ADD <2nd block,60,100>
3. COPY <100,50,160>
4. ADD <4th block,50,210>
5. COPY <380,120,260>
6. ADD <6th block,60,380>
7. COPY <680,70,440>
8. COPY <150,150,510>
• This sequence is Δ.
Copyright 2011 Trend Micro Inc.
A Math Model for Describing File Differences
• To optimize the presentation of Δ, if we arrange the edit operations in an
ascending order of addrDest, all addrDest are not required for explicit
presentation.
• We can rewrite two operations in following formats:
• COPY <addrSrc, size>
• ADD <dataBlock, size>
Classification 9/21/2015 13
Δ is presented by:
1. COPY <0,100>
2. ADD <2nd block,60>
3. COPY <100,50>
4. ADD <4th block,50>
5. COPY <380,120>
6. ADD <6th block,60>
7. COPY <680,70>
8. COPY <150,150>
Copyright 2011 Trend Micro Inc.
A Math Model for Describing File Differences
• Two tasks remains to be solved:
1. How to create Δ, i.e., the sequence of the
edit operations?
2. How to encode Δ into a file ( we refer to it
as DIFF package) ?
• The top task is an effective algorithm to
identify the common blocks, e.g., the
blocks {1,3,5,7,8} shown in the right side.
• I don’t think I should talk about
algorithms at this conference… the
details may take half an hour.
Classification 9/21/2015 14
Copyright 2011 Trend Micro Inc.
A Math Model for Describing File Differences
Designing a diff package: an example:
Classification 9/21/2015 15
Copyright 2011 Trend Micro Inc.
A Math Model for Describing File Differences
• We answered two basic questions:
– What is differential compression?
– How to describe differential compression mathematically?
• A few questions remained:
– How to design an algorithm for differential compression?
– How to measure the efficiency of an algorithm?
• We need to introduce a cost model.
– Can we design the most efficient algorithm in terms of the cost model?
Classification 9/21/2015 16
Copyright 2011 Trend Micro Inc.
Categorizing Differential Compression
• Due to applications, differential compression can be
categorized into different ways.
Classification 9/21/2015 17
Copyright 2011 Trend Micro Inc.
Categorizing Differential Compression
• Continued:
Classification 9/21/2015 18
Copyright 2011 Trend Micro Inc.
Categorizing Differential Compression
• Summary:
Classification 9/21/2015 19
LDC RDC IDC
General File Yes Yes Yes
Executable
file
Yes No Study Yet No Study Yet
General
firmware
Yes No Study Yet No Study Yet
Executable
firmware
Yes No Study Yet No Study Yet
Copyright 2011 Trend Micro Inc.
Advanced Topics
• Let us investigate three topics in depth:
1. LDC vs RDC vs IDC for general files
2. LDC for executable files
3. LDC for in-place merging
20
Copyright 2011 Trend Micro Inc.
Advanced Topics
LDC vs RDC vs IDC : use cases
– How to implement Δ=T-R for three different cases in term of space & time?
– LDC : both R and T appear in the same location at the same time.
– RDC : R and T appear in different locations at the same time.
– IDC : R and T appear in the same location at different times.
21
Copyright 2011 Trend Micro Inc.
Advanced Topics
• LDC vs RDC vs IDC : architecture
Classification 9/21/2015 22
Copyright 2011 Trend Micro Inc.
Advanced Topics
• LDC vs RDC vs IDC :
Classification 9/21/2015 23
Copyright 2011 Trend Micro Inc.
Advanced Topics
• LDC vs RDC vs IDC :
Classification 9/21/2015 24
Copyright 2011 Trend Micro Inc.
Advanced Topics
• Notes:
– V(F) = view of F :
• It is a data structure that summarizes a file in an abstract yet efficient way. It
takes much less space than original file.
• This concept VIEW was proposed by a local startup to describe the IDC
scheme.
– I found it applies to RDC scheme too.
• There are different implementations of VIEW. For example, to describe
rsync and RDC protocols, we would have two different VIEWs for the same
file.
– F2= F1 + Δ
• where Δ= F2-V(F1) instead of Δ = F2- F1
– This makes RDC & IDC possible.
Classification 9/21/2015 25
Copyright 2011 Trend Micro Inc.
Advanced Topics
• LDC for executable files:
Classification 9/21/2015 26
Copyright 2011 Trend Micro Inc.
Advanced Topics
LDC for executable files: for a specific
ISA
– Differential compression algorithms
identify changes between files.
– General files: all changes are just
changes.
– Executable files:
• Primary change: instructions are altered
due to source code changes.
• Secondary change: an instruction is altered
at the byte level due to code change
happening at other addresses.
• We use JUMP as an example to illustrate
the concept. An JUMP instruction is a few
bytes that encode the distance between
the source and destination.
Classification 9/21/2015 27
Copyright 2011 Trend Micro Inc.
Advanced Topics
• LDC for executable files: how to reduce the diff?
– A mathematical model is necessary.
– For example: removing secondary change for JUMP:
• The secondary code change causes instr1 ≠ instr2
• Given the file R, if we can derive instr2 from instr1, we can replace instr1 in R
with instr2.
• For all such instructions in R, we can do the same substitution, we transfer R into
another file and denote it as P(R,H(R,T)) where H stands for hints. We have the
new formal presentation:
Classification 9/21/2015 28
Copyright 2011 Trend Micro Inc.
Advanced Topics
• LDC for executable files
• Mathematical Modeling:
– Lets start with common code blocks between two versions:
• Assume we can identify them with symbol tables or code alignment algorithms.
Classification 9/21/2015 29
Copyright 2011 Trend Micro Inc.
Advanced Topics
• A Mathematical Model:
– How to derive a new JUMP instruction from an old one?
Classification 9/21/2015 30
Copyright 2011 Trend Micro Inc.
Advanced Topics
• Mathematical Modeling:
– How to derive a new JUMP instruction from an old one?
• instr2 = Encode(destAddr2 – srcAddr2)
• destAddr2 – srcAddr2 = (destAddr1 – srcAddr1) + (destAddr2 – srcAddr2 ) -
(destAddr1 – srcAddr1) = Decode(instr1) + (destAddr2 – destAddr1) – (srcAddr2
- srcAddr1) = Decode(instr1) + (destBlkAddr2 – destBlkAddr1) – (srcBlkAddr2 -
srcBlkAddr1)
– instr2 = Decode(instr1) + (destBlkAddr2 – destBlkAddr1) – (srcBlkAddr2 -
srcBlkAddr1)
– We can do the similar to other instructions such as data pointers.
– All these instructions such as JUMP or data pointers are called profitable
instructions.
– A solution is an algorithm that identifies all the profitable instructions and
removes all secondary changes accordingly.
Classification 9/21/2015 31
Copyright 2011 Trend Micro Inc.
Advanced Topics
• LDC for in-place merging: the 3rd topic
• Use case:
– Mobile phone FOTA ( Firmware Over The Air).
– A firmware can be considered as a file contained in a sequence of code blocks
(of fixed sizes).
– A phone has limited memory space for firmware updating.
• Updating must be implemented using in-place algorithm.
Classification 9/21/2015 32
NOTE: work space required for general
merge: Size(R)+Size(T)+Size(Δ)
Copyright 2011 Trend Micro Inc.
Advanced Topics
• LDC for in-place merging
Classification 9/21/2015 33
Copyright 2011 Trend Micro Inc.
Advanced Topics
• LDC for in-place merging:
– block based differential compression
• block dependency between two versions of firmware
• Topological sorting to create a sequence of block number based on block
precedence.
– In-place merging
• Block based merging
• Block writing based on the sequence of block number.
Classification 9/21/2015 34
That is a very interesting technique!
Copyright 2011 Trend Micro Inc.
Summary
• Background of differential compression
• A mathematical model for differential compression
• Categorizing differential compression from two perspectives:
– <time, location>
– <file type, merging style>
• Three advanced topics:
– Comparing three differential compression schemes
– Differential compression of executable files
– In-place file merging with LDC
Classification 9/21/2015 35
Copyright 2011 Trend Micro Inc.
Q & A
Classification 9/21/2015 36
Thank You for Your Attention!

Contenu connexe

Similaire à Taxonomy of Differential Compression Techniques

Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Daniel Zivkovic
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsLiwei Ren任力偉
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentTasktop
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsLiwei Ren任力偉
 
NTC 300 Enhance teaching - snaptutorial.com
NTC 300  Enhance teaching - snaptutorial.comNTC 300  Enhance teaching - snaptutorial.com
NTC 300 Enhance teaching - snaptutorial.comDavisMurphyA64
 
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and OptimizeISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and OptimizeChristoph Adler
 
NTC/300 ENTIRE CLASS UOP TUTORIALS Drotos Engineering
NTC/300 ENTIRE CLASS UOP TUTORIALS Drotos EngineeringNTC/300 ENTIRE CLASS UOP TUTORIALS Drotos Engineering
NTC/300 ENTIRE CLASS UOP TUTORIALS Drotos EngineeringSharon Reynolds
 
Software Design PatternsConsider a company migrating to a third-p.pdf
Software Design PatternsConsider a company migrating to a third-p.pdfSoftware Design PatternsConsider a company migrating to a third-p.pdf
Software Design PatternsConsider a company migrating to a third-p.pdfarorastores
 
Devry CIS 246 Full Course Latest
Devry CIS 246 Full Course LatestDevry CIS 246 Full Course Latest
Devry CIS 246 Full Course LatestAtifkhilji
 
[2015/2016] Software systems engineering PRINCIPLES
[2015/2016] Software systems engineering PRINCIPLES[2015/2016] Software systems engineering PRINCIPLES
[2015/2016] Software systems engineering PRINCIPLESIvano Malavolta
 
Dita ot pipeline webinar
Dita ot pipeline webinarDita ot pipeline webinar
Dita ot pipeline webinarSuite Solutions
 
Key Skills Required for Data Engineering
Key Skills Required for Data EngineeringKey Skills Required for Data Engineering
Key Skills Required for Data EngineeringFibonalabs
 
CHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docx
CHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docxCHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docx
CHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docxcravennichole326
 
Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowDatabricks
 
Informatica_Rajesh-CV 28_03_16
Informatica_Rajesh-CV 28_03_16Informatica_Rajesh-CV 28_03_16
Informatica_Rajesh-CV 28_03_16Rajesh Dheeti
 

Similaire à Taxonomy of Differential Compression Techniques (20)

Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and Algorithms
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
 
NTC 300 Enhance teaching - snaptutorial.com
NTC 300  Enhance teaching - snaptutorial.comNTC 300  Enhance teaching - snaptutorial.com
NTC 300 Enhance teaching - snaptutorial.com
 
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and OptimizeISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
 
NTC/300 ENTIRE CLASS UOP TUTORIALS Drotos Engineering
NTC/300 ENTIRE CLASS UOP TUTORIALS Drotos EngineeringNTC/300 ENTIRE CLASS UOP TUTORIALS Drotos Engineering
NTC/300 ENTIRE CLASS UOP TUTORIALS Drotos Engineering
 
Software Design PatternsConsider a company migrating to a third-p.pdf
Software Design PatternsConsider a company migrating to a third-p.pdfSoftware Design PatternsConsider a company migrating to a third-p.pdf
Software Design PatternsConsider a company migrating to a third-p.pdf
 
Devry CIS 246 Full Course Latest
Devry CIS 246 Full Course LatestDevry CIS 246 Full Course Latest
Devry CIS 246 Full Course Latest
 
Shivaprasada_Kodoth
Shivaprasada_KodothShivaprasada_Kodoth
Shivaprasada_Kodoth
 
[2015/2016] Software systems engineering PRINCIPLES
[2015/2016] Software systems engineering PRINCIPLES[2015/2016] Software systems engineering PRINCIPLES
[2015/2016] Software systems engineering PRINCIPLES
 
Day1
Day1Day1
Day1
 
IEC.ppt
IEC.pptIEC.ppt
IEC.ppt
 
Dita ot pipeline webinar
Dita ot pipeline webinarDita ot pipeline webinar
Dita ot pipeline webinar
 
Key Skills Required for Data Engineering
Key Skills Required for Data EngineeringKey Skills Required for Data Engineering
Key Skills Required for Data Engineering
 
CHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docx
CHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docxCHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docx
CHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docx
 
Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflow
 
Informatica_Rajesh-CV 28_03_16
Informatica_Rajesh-CV 28_03_16Informatica_Rajesh-CV 28_03_16
Informatica_Rajesh-CV 28_03_16
 

Plus de Liwei Ren任力偉

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇Liwei Ren任力偉
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural NetworkLiwei Ren任力偉
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维Liwei Ren任力偉
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究Liwei Ren任力偉
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based SecurityLiwei Ren任力偉
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemLiwei Ren任力偉
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsLiwei Ren任力偉
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Liwei Ren任力偉
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsLiwei Ren任力偉
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemLiwei Ren任力偉
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsLiwei Ren任力偉
 
Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and ApplicationsLiwei Ren任力偉
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyLiwei Ren任力偉
 

Plus de Liwei Ren任力偉 (20)

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇
 
企业安全市场综述
企业安全市场综述 企业安全市场综述
企业安全市场综述
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
 
聊一聊大明朝的火器
聊一聊大明朝的火器聊一聊大明朝的火器
聊一聊大明朝的火器
 
防火牆們的故事
防火牆們的故事防火牆們的故事
防火牆們的故事
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维
 
硅谷的那点事儿
硅谷的那点事儿硅谷的那点事儿
硅谷的那点事儿
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究
 
世纪猜想
世纪猜想世纪猜想
世纪猜想
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based Security
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillators
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problem
 
Math stories
Math storiesMath stories
Math stories
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and Solutions
 
Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and Applications
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) Technology
 

Dernier

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Dernier (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Taxonomy of Differential Compression Techniques

  • 1. Copyright 2011 Trend Micro Inc. 1 Taxonomy of Differential Compression Liwei Ren, Ph.D, Trend Micro™ SNIA SDC 2015, Santa Clara, California, Sept, 2015
  • 2. Copyright 2011 Trend Micro Inc. Background • Liwei Ren, Ph.D – Research interests • Data security, network security, data compression, math modeling & algorithms. – Major works • 10+ academic papers; • 20+ US patents granted, and a few more pending; • Co-founded a data security company in Silicon Valley with successful exit. – Education • MS/BS in mathematics, Tsinghua University, Beijing • Ph.D in mathematics, MS in information science, University of Pittsburgh • Trend Micro™ – Global security software company with headquarter in Tokyo, and R&D centers in Silicon Valley, Nanjing and Taipei; – One of top security software vendors. – A leader in cloud security. 2
  • 3. Copyright 2011 Trend Micro Inc. Agenda • Introduction • A Math Model for Describing File Differences • Categorizing Differential Compression • Advanced Topics • Summary Classification 9/21/2015 3
  • 4. Copyright 2011 Trend Micro Inc. Introduction • Objectives for this sharing: – Understand what differential compression is AND its applications. – Learn a mathematical model for describing differential compression – Know categories of differential compression – Be aware of a few advanced topics Classification 9/21/2015 4
  • 5. Copyright 2011 Trend Micro Inc. Introduction • Lossless data compression --- three categories Classification 9/21/2015 5 • Two purposes: – Network data transfer acceleration – Storage space reduction
  • 6. Copyright 2011 Trend Micro Inc. Introduction • Today we talk about Differential Compression (DC). • Why do I talk about differential compression? – I have been designing various algorithms for differential compression since 2002 for a few domains: • FOTA ( Firmware Over The Air) for mobile phones • Incremental update of data files for security software. • File synchronization & transfer over WAN • Differential compression for executable files • … – It is a time to summarize various problems & techniques in a systemic view: • I may write an academic book on this. Classification 9/21/2015 6
  • 7. Copyright 2011 Trend Micro Inc. Introduction • What is differential compression? Classification 9/21/2015 7
  • 8. Copyright 2011 Trend Micro Inc. Introduction • To reduce network bandwidth cost : Classification 9/21/2015 8 • To reduce storage cost:
  • 9. Copyright 2011 Trend Micro Inc. Introduction • Applications: – Data backup • Remote data backup – Revision control systems – Software vulnerability & patch management • FOTA ( firmware over the air) • Malware signature update – File synchronization and transfer – Distributed file systems – Cloud data migration Classification 9/21/2015 9
  • 10. Copyright 2011 Trend Micro Inc. A Math Model for Describing File Differences • In the formal presentation Δ = T – R, what do we mean by “-” and Δ? • There are a few approaches to describe DIFF. – Here is one. • Diff Model: A math model to describe the “differences” of T and R: – Δ is basically a procedure that transforms reference file R to target file T. • To be specific, Δ is a sequence of string edit operations for reconstructing T from R. – Two edit operations COPY & ADD : • COPY (addrSrc, size ,addrDest) --- to copy a block of data from reference file to target file. • ADD (dataBlock, size ,addrDest) --- to add a block of data to the target file. Classification 9/21/2015 10
  • 11. Copyright 2011 Trend Micro Inc. A Math Model for Describing File Differences • Look at an example: Classification 9/21/2015 11
  • 12. Copyright 2011 Trend Micro Inc. A Math Model for Describing File Differences For better illustration, let us assume: – Block 1 has 100 bytes – Block 2 has 60 bytes – Block 3 has 50 bytes – Block 4 has 50 bytes – Block 5 has 120 bytes – Block 6 has 60 bytes – Block 7 has 70 bytes – Block 8 has 150 bytes – Block 9 has 80 bytes – Block 10 has 180 bytes Classification 9/21/2015 12 • A sequence of edit operations: 1. COPY <0,100,0> 2. ADD <2nd block,60,100> 3. COPY <100,50,160> 4. ADD <4th block,50,210> 5. COPY <380,120,260> 6. ADD <6th block,60,380> 7. COPY <680,70,440> 8. COPY <150,150,510> • This sequence is Δ.
  • 13. Copyright 2011 Trend Micro Inc. A Math Model for Describing File Differences • To optimize the presentation of Δ, if we arrange the edit operations in an ascending order of addrDest, all addrDest are not required for explicit presentation. • We can rewrite two operations in following formats: • COPY <addrSrc, size> • ADD <dataBlock, size> Classification 9/21/2015 13 Δ is presented by: 1. COPY <0,100> 2. ADD <2nd block,60> 3. COPY <100,50> 4. ADD <4th block,50> 5. COPY <380,120> 6. ADD <6th block,60> 7. COPY <680,70> 8. COPY <150,150>
  • 14. Copyright 2011 Trend Micro Inc. A Math Model for Describing File Differences • Two tasks remains to be solved: 1. How to create Δ, i.e., the sequence of the edit operations? 2. How to encode Δ into a file ( we refer to it as DIFF package) ? • The top task is an effective algorithm to identify the common blocks, e.g., the blocks {1,3,5,7,8} shown in the right side. • I don’t think I should talk about algorithms at this conference… the details may take half an hour. Classification 9/21/2015 14
  • 15. Copyright 2011 Trend Micro Inc. A Math Model for Describing File Differences Designing a diff package: an example: Classification 9/21/2015 15
  • 16. Copyright 2011 Trend Micro Inc. A Math Model for Describing File Differences • We answered two basic questions: – What is differential compression? – How to describe differential compression mathematically? • A few questions remained: – How to design an algorithm for differential compression? – How to measure the efficiency of an algorithm? • We need to introduce a cost model. – Can we design the most efficient algorithm in terms of the cost model? Classification 9/21/2015 16
  • 17. Copyright 2011 Trend Micro Inc. Categorizing Differential Compression • Due to applications, differential compression can be categorized into different ways. Classification 9/21/2015 17
  • 18. Copyright 2011 Trend Micro Inc. Categorizing Differential Compression • Continued: Classification 9/21/2015 18
  • 19. Copyright 2011 Trend Micro Inc. Categorizing Differential Compression • Summary: Classification 9/21/2015 19 LDC RDC IDC General File Yes Yes Yes Executable file Yes No Study Yet No Study Yet General firmware Yes No Study Yet No Study Yet Executable firmware Yes No Study Yet No Study Yet
  • 20. Copyright 2011 Trend Micro Inc. Advanced Topics • Let us investigate three topics in depth: 1. LDC vs RDC vs IDC for general files 2. LDC for executable files 3. LDC for in-place merging 20
  • 21. Copyright 2011 Trend Micro Inc. Advanced Topics LDC vs RDC vs IDC : use cases – How to implement Δ=T-R for three different cases in term of space & time? – LDC : both R and T appear in the same location at the same time. – RDC : R and T appear in different locations at the same time. – IDC : R and T appear in the same location at different times. 21
  • 22. Copyright 2011 Trend Micro Inc. Advanced Topics • LDC vs RDC vs IDC : architecture Classification 9/21/2015 22
  • 23. Copyright 2011 Trend Micro Inc. Advanced Topics • LDC vs RDC vs IDC : Classification 9/21/2015 23
  • 24. Copyright 2011 Trend Micro Inc. Advanced Topics • LDC vs RDC vs IDC : Classification 9/21/2015 24
  • 25. Copyright 2011 Trend Micro Inc. Advanced Topics • Notes: – V(F) = view of F : • It is a data structure that summarizes a file in an abstract yet efficient way. It takes much less space than original file. • This concept VIEW was proposed by a local startup to describe the IDC scheme. – I found it applies to RDC scheme too. • There are different implementations of VIEW. For example, to describe rsync and RDC protocols, we would have two different VIEWs for the same file. – F2= F1 + Δ • where Δ= F2-V(F1) instead of Δ = F2- F1 – This makes RDC & IDC possible. Classification 9/21/2015 25
  • 26. Copyright 2011 Trend Micro Inc. Advanced Topics • LDC for executable files: Classification 9/21/2015 26
  • 27. Copyright 2011 Trend Micro Inc. Advanced Topics LDC for executable files: for a specific ISA – Differential compression algorithms identify changes between files. – General files: all changes are just changes. – Executable files: • Primary change: instructions are altered due to source code changes. • Secondary change: an instruction is altered at the byte level due to code change happening at other addresses. • We use JUMP as an example to illustrate the concept. An JUMP instruction is a few bytes that encode the distance between the source and destination. Classification 9/21/2015 27
  • 28. Copyright 2011 Trend Micro Inc. Advanced Topics • LDC for executable files: how to reduce the diff? – A mathematical model is necessary. – For example: removing secondary change for JUMP: • The secondary code change causes instr1 ≠ instr2 • Given the file R, if we can derive instr2 from instr1, we can replace instr1 in R with instr2. • For all such instructions in R, we can do the same substitution, we transfer R into another file and denote it as P(R,H(R,T)) where H stands for hints. We have the new formal presentation: Classification 9/21/2015 28
  • 29. Copyright 2011 Trend Micro Inc. Advanced Topics • LDC for executable files • Mathematical Modeling: – Lets start with common code blocks between two versions: • Assume we can identify them with symbol tables or code alignment algorithms. Classification 9/21/2015 29
  • 30. Copyright 2011 Trend Micro Inc. Advanced Topics • A Mathematical Model: – How to derive a new JUMP instruction from an old one? Classification 9/21/2015 30
  • 31. Copyright 2011 Trend Micro Inc. Advanced Topics • Mathematical Modeling: – How to derive a new JUMP instruction from an old one? • instr2 = Encode(destAddr2 – srcAddr2) • destAddr2 – srcAddr2 = (destAddr1 – srcAddr1) + (destAddr2 – srcAddr2 ) - (destAddr1 – srcAddr1) = Decode(instr1) + (destAddr2 – destAddr1) – (srcAddr2 - srcAddr1) = Decode(instr1) + (destBlkAddr2 – destBlkAddr1) – (srcBlkAddr2 - srcBlkAddr1) – instr2 = Decode(instr1) + (destBlkAddr2 – destBlkAddr1) – (srcBlkAddr2 - srcBlkAddr1) – We can do the similar to other instructions such as data pointers. – All these instructions such as JUMP or data pointers are called profitable instructions. – A solution is an algorithm that identifies all the profitable instructions and removes all secondary changes accordingly. Classification 9/21/2015 31
  • 32. Copyright 2011 Trend Micro Inc. Advanced Topics • LDC for in-place merging: the 3rd topic • Use case: – Mobile phone FOTA ( Firmware Over The Air). – A firmware can be considered as a file contained in a sequence of code blocks (of fixed sizes). – A phone has limited memory space for firmware updating. • Updating must be implemented using in-place algorithm. Classification 9/21/2015 32 NOTE: work space required for general merge: Size(R)+Size(T)+Size(Δ)
  • 33. Copyright 2011 Trend Micro Inc. Advanced Topics • LDC for in-place merging Classification 9/21/2015 33
  • 34. Copyright 2011 Trend Micro Inc. Advanced Topics • LDC for in-place merging: – block based differential compression • block dependency between two versions of firmware • Topological sorting to create a sequence of block number based on block precedence. – In-place merging • Block based merging • Block writing based on the sequence of block number. Classification 9/21/2015 34 That is a very interesting technique!
  • 35. Copyright 2011 Trend Micro Inc. Summary • Background of differential compression • A mathematical model for differential compression • Categorizing differential compression from two perspectives: – <time, location> – <file type, merging style> • Three advanced topics: – Comparing three differential compression schemes – Differential compression of executable files – In-place file merging with LDC Classification 9/21/2015 35
  • 36. Copyright 2011 Trend Micro Inc. Q & A Classification 9/21/2015 36 Thank You for Your Attention!