Virtual machine (VM) replication (replicating the state of a primary VM running on a primary node to a secondary VM running on a secondary node) is a well known technique for providing application-agnostic, non-stop service. Unfortunately, existing VM replication approaches suffer from excessive replication overhead and, for client-server systems, there is really no need for the secondary VM to match its machine state with the primary VM at all times.
In this paper, we propose COLO (COarse-grain LOck-stepping virtual machine solution for non-stop service), a generic and highly efficient non-stop service solution, based on on-demand VM replication. COLO monitors the output responses of the primary and secondary VMs. COLO considers the secondary VM a valid replica of the primary VM, as long as network responses generated by the secondary VM match that of the primary. The primary VM state is propagated to the secondary VM, if and only if outputs from the secondary and primary servers no longer match.
4. Non-Stop Service with VM Replication
• Typical Non-stop Service Requires
– Expensive hardware for redundancy
– Extensive software customization
• VM Replication: Cheap Application-agnostic Solution
Software & Services Group
4
5. Existing VM Replication Approaches
• Replication Per Instruction: Lock-stepping
– Execute in parallel for deterministic instructions
– Lock and step for un-deterministic instructions
• Replication Per Epoch: Continuous Checkpoint
– Secondary VM is synchronized with Primary VM per
epoch
– Output is buffered within an epoch
Software & Services Group
5
6. Problems
• Lock-stepping
– Excessive replication overhead
• memory access in an MP-guest is un-deterministic
• Continuous Checkpoint
– Extra network latency
– Excessive VM checkpoint overhead
Software & Services Group
6
8. Why COarse-grain LOck-stepping (COLO)
• VM Replication is an overly strong condition
– Why we care about the VM state ?
• The client care about response only
– Can the control failover without ”precise VM state
replication”?
• Coarse-grain lock-stepping VMs
– Secondary VM is a replica, as if it can generate same
response with primary so far
• Be able to failover without service stop
Non-stop service focus on server response, not internal machine state!
Software & Services Group
8
9. How COLO Works
• Response Model for C/S System
– & are the request and the execution result of
an un-deterministic instruction
– Each response packet from the equation is a
semantics response
• Successfully failover at kth packet if
(C is the packet series the client received)
Software & Services Group
9
10. Architecture of COLO
COarse-grain LOck-stepping Virtual Machine for Non-stop Service
Software & Services Group
10
11. Why Better
• Comparing with Continuous VM checkpoint
– No buffering-introduced latency
– Less checkpoint frequency
• On demand vs. periodic
• Comparing with lock-stepping
– Eliminate excessive overhead of un-deterministic
instruction execution due to MP-guest memory
access
Software & Services Group
11
13. Performance Challenges
• Frequency of Checkpoint
– Highly dependent on the Output Similarity, or
Response Similarity
• Key Focus is TCP packet!
• Cost of Checkpoint
– Xen/Remus uses passive-checkpoint
• Secondary VM is not resumed until failover Slow path
– COLO implements active-checkpoint
• Secondary VM resumes frequently
Software & Services Group
13
14. Improving Response Similarity
• Minor Modification to Guest TCP/IP Stack
– Coarse Grain Time Stamp
– Highly-deterministic ACK mechanism
– Coarse Grain Notification Window Size
– Per-Connection Comparison
Software & Services Group
14
15. Similarities after Optimization
• Web Server • FTP Server
Packets # Duration Packets # Duration
16000 600 4000 400
500
12000
3000 300
400
Number of Packets
Time (ms)
Time (ms)
Packets #
8000 300
2000 200
200
4000
100 1000 100
0 0
1 2 4 8 16 32
0 0
Number of Threads
PUT GET
*Run Web Bench in Client
For more complete information about performance and benchmark results, visit Performance Test Disclosure
Software & Services Group
15
16. Reducing the Cost of Active-checkpoint
• Lazy Device State Update
– Lazy network interface up/down
– Lazy event channel up/down
• Fast Path Communication
Software & Services Group
16
17. Checkpoint Cost with Optimizations
Leave Net UP
Suspend Resume Netif-Up Netif-Down Mem Xmit
Leave EventChannel up
1800
1500
Replace
XenStore
1200 Access with
Spent Time (ms)
Eventchannel
900
600
300
0
Baseline Lazy Netif Lazy Event Channel Efficient Comm.
Final cost: 74ms/checkpoint: (1/3 on page transmission, 2/3 on suspend/resume)
For more complete information about performance and benchmark results, visit Performance Test Disclosure
Software & Services Group
17
19. Configurations
• Hardware
– Intel® Core™ i7 platform, a 2.8 GHz quad-core
processor
– 2GB RAM
– Intel® 82576 1Gbps NIC * 2 (internal & external)
• Software
– Xen 4.1
– Domain 0: RHEL5U5
– Guest: 32-bit BusyBox 1.20.0, Linux kernel 2.6.32
• 256MB RAM and uses a ramdisk for storage
Software & Services Group
19
20. Bandwidth of NetPerf
Native Remus-20ms Remus-40ms COLO Native Remus-20ms Remus-40ms COLO
1000 1000
800 800
Bandwidth (Mb/s)
Bandwidth (Mb/s)
600 600
400 400
200 200
0 0
54 1500 64K 54 1500 64K
Message Size Message Size
TCP UDP
For more complete information about performance and benchmark results, visit Performance Test Disclosure
Software & Services Group
20
21. FTP Server
128
64
32
16
Remus-20ms
Time (s)
Remus-40ms
8
COLO
Native
4
2
1
PUT GET
0.5
For more complete information about performance and benchmark results, visit Performance Test Disclosure
Software & Services Group
21
22. Web Server - Concurrency
Native Remus-20ms Remus-40ms COLO
1000
100
Throughput (Mbps)
10
1
1 2 4 8 16 32
Threads
Run Web Bench in Client
For more complete information about performance and benchmark results, visit Performance Test Disclosure
Software & Services Group
22
23. Web Server - Throughput
Native Remus-20ms Remus-40ms COLO
1200
1000
Throughput (response/s)
800
600
400
200
0
100 500 1000 1500
Request / second
Run httperf in Client
For more complete information about performance and benchmark results, visit Performance Test Disclosure
Software & Services Group
23
24. Latency in Netperf/Ping
40
30
Average Latency (ms)
Native
20 Remus-20ms
Remus-40ms
COLO
10
0.28 0.40 0.28 0.4 0.38 0.55
0
Netperf-TCP-RR Netperf-UDP-RR Ping
For more complete information about performance and benchmark results, visit Performance Test Disclosure
Software & Services Group
24
25. Web Server - Latency
Native Remus-20ms Remus-40ms COLO
1200
1000
Response Latency (ms)
800
600
400
200
0
100 500 1000 1500
Request/second
Run httperf in Client
For more complete information about performance and benchmark results, visit Performance Test Disclosure
Software & Services Group
25
27. Summary
• COLO is an ideal Application-agnostic Solution
for Non-stop service
– Web server: 67% of native performance
– CPU, memory and netperf: near-native performance
• Next steps:
– Merge into Xen
– More optimizations
Software & Services Group
27