Short time-to-localize and time-to-fix for production bugs is extremely important for any 24x7 service-oriented application (SOA). Debugging buggy behavior in deployed applications is hard, as it requires careful reproduction of a similar environment and workload. Prior approaches for automatically reproducing production failures do not scale to large SOA systems. Our key insight is that for many failures in SOA systems (e.g., many semantic and performance bugs), a failure can automatically be reproduced solely by relaying network packets to replicas of suspect services, an insight that we validated through a manual study of 16 real bugs across five different systems. This paper presents Parikshan, an application monitoring framework that leverages user-space virtualization and network proxy technologies to provide a sandbox “debug” environment. In this “debug” environment, developers are free to attach debuggers and analysis tools without impacting performance or correctness of the production environment. In comparison to existing monitoring solutions that can slow down production applications, Parikshan allows application monitoring at significantly lower overhead.
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
Replay without Recording of Production Bugs for Service Oriented Applications
1. ASE Sept 6, 2018@_jon_bell_
Nipun Arora, Jonathan Bell, Franjo Ivancic, Gail Kaiser and Baishakhi Ray
Dropbox, George Mason University, Google and Columbia University
Fork
Parikshan
on
G
itH
ub
Replay without Recording of
Production Bugs for Service Oriented
Applications
2. ASE Sept 6, 2018@_jon_bell_
Performance Bugs
Developer
AvgThroughput
Time
App throughput over time
Why is performance getting worse?
3. ASE Sept 6, 2018@_jon_bell_
Debugging Production Bugs
• What CAN we do?
• Inspect logs
• Heap dumps and telemetry
• Sample application inputs and performance
• What CAN’T we do?
• Anything that introduces more overhead (can’t use debugger)
• Anything that might impact application correctness (can’t change code)
4. ASE Sept 6, 2018@_jon_bell_
Debugging Production
Failures in Development
Developer
AvgThroughput
Time
App throughput over time
SOA App
[Production]
SOA App
[Testing]
AvgThroughput
Time
App throughput over time
Bug does not appear in
debugging/testing environment
#!?
5. ASE Sept 6, 2018@_jon_bell_
Problem: Distributed App State
Developer
SOA App [Production]
DNS
Apache/
NGINX
Glassfish App
Server
Cache Database
Glassfish App
Server
Glassfish App
Server
Each component has its own accumulated state; it’s unknown which
component(s) are buggy and which state is relevant!
6. ASE Sept 6, 2018@_jon_bell_
Live Debugging
• What if our developer could attach their favorite debugging tools
directly to the production environment?
• Would allow existing state-of-the-art tools used to debug programs in
the lab to be applied directly to field failures
• But, we are constrained from making modifications or introducing
latency to production service
7. ASE Sept 6, 2018@_jon_bell_
Live Debugging with Parikshan
Developer
SOA App [Production]
DNS Database
Apache/
NGINX
Cache
Glassfish App
Server
Glassfish App
Server
Glassfish App
Server
SOA App [Debugging]
DNS Database
Apache/
NGINX
Cache
Glassfish App
Server
Glassfish App
Server
Glassfish App
Server
“Debug” environment mirrors production, contains the same bad state that caused the bug
8. ASE Sept 6, 2018@_jon_bell_
Challenge: Maintaining Synchronization
SOA App [Production]
DNS
Apache/
NGINX
Glassfish App
Server
Cache Database
Glassfish App
Server
Glassfish App
Server
SOA App [Debugging]
DNS
Apache/
NGINX
Glassfish App
Server
Cache Database
Glassfish App
Server
Glassfish App
Server
Glassfish App
Server
Cache
Glassfish App
Server
Database
The moment after it’s created, the debug environment will diverge!
Developer
#!?
9. ASE Sept 6, 2018@_jon_bell_
Parikshan
• To enable live debugging, address two key challenges:
• 1: How to create the debug container?
• 2: How to keep the debug container in sync with production?
10. ASE Sept 6, 2018@_jon_bell_
Physical Machine
Key Insight: Containers
are Everywhere!
Parikshan adopts live migration technology from
containers/VMs to do live cloning
Physical Machine
Container
Glassfish App
Server
Live Migration:
11. ASE Sept 6, 2018@_jon_bell_
NAT
Live Cloning
• Live Cloning vs Live Migration
• Encapsulate both containers in different networks but have internal network ports
or addresses remain the same for the processes running within each container
• Live Cloning starts both the original container and the debug container at the end
of the cloning
Physical MachinePhysical Machine
Container
Glassfish App
Server
Container
Glassfish App
Server
12. ASE Sept 6, 2018@_jon_bell_
SOA App
[Production]
SOA App
[Debugging]
Users
Parikshan Relies on Network
Duplication
Key insight: ditch traditional
high-fidelity record and replay
Thread scheduling
decisions
System calls
13. ASE Sept 6, 2018@_jon_bell_
SOA App
[Production]
SOA App
[Debugging]
Network
Duplicator
(Asynchronous)
2: Install network
proxies
3: Monitor responses for
divergence
Network
Aggregator
Buffer
Network Duplication
UsersDeveloper
Asynchronous duplicator buffers requests allowing debug
environment to be completely paused (until buffer is full)
1: Replica environment
created with live cloning
(e.g. of VM or container)
Debugs without fear of
breaking production
14. ASE Sept 6, 2018@_jon_bell_
Buffer
Detecting Divergence
SOA App
[Production]
SOA App
[Debugging]
Users
Network
Duplicator
(Asynchronous)
Network
Aggregator
Developer
Network aggregator checks packets on responses to measure
user-perceived divergence
Response
History
15. ASE Sept 6, 2018@_jon_bell_
Is network data enough to
reproduce real bugs?
16. ASE Sept 6, 2018@_jon_bell_
Network Data IS Often Enough!
• 217 real-world bugs- Apache (45), MySQL (96), HDFS (76) from issue trackers
• Only excluded: feature requests, misunderstandings, etc
Key Takeaway:
• Approx. 80% semantic bugs, 6% non-deterministic
• Manually confirmed these bugs could be triggered by network input
Apache
HTTPD MySQL HDFS
Performance
Semantic
Concurrency
Resource Leak
17. ASE Sept 6, 2018@_jon_bell_
Reproducing Real Bugs
Category Bug ID Application Symptom/Cause Deterministic Crash Trigger
Performance
Bugs
MySQL
#15811
mysql-5.0.15
Bug caused due to multiple calls
in a loop
Yes No Repeated insert into table
MySQL
#26527
mysql-5.1.14
Load data is slow in a partitioned
table
Yes No
Create table with partition and
load data
MySQL
#49491
Mysql-5.1.38
Calculation of hash values
inefficient
Yes No MySQL client select requests
Redis
#614
Redis-2.6.0
Master + replica, not replicated
correctly
Yes No
Setup replication, push and pop
some elements
Resource
Leaks
Redis
#417
Redis-2.4.9 Memory leak in master Yes No Concurrent key set requests
Redis
#487
Redis-2.6.14
Keys* command duplicate or
omits keys
Yes No
Set keys to execute specific set of
requests
Semantic
Bugs
Cassandra
#5225
Cassandra-1.5.2 Missing columns from wide row Yes No Fetch columns from cassandra
Cassandra
#1837
Cassandra-0.7.0
Deleted columns become
available after flush
Yes No Insert, delete and flush columns
Redis
#761
Redis-2.6.0 Crash with a large integer input Yes Yes Query for a input of a large integer
18. ASE Sept 6, 2018@_jon_bell_
Reproducing Real Bugs
Category Bug ID Application Symptom/Cause Deterministic Crash Trigger
Concurrency
Bugs
Apache
#25520
Httpd-2.0.4
Per-child buffer management not
thread safe
No No
Continuous concurrent requests
initiated by the client
Apache
#21287
Httpd-2.0.48
Php-4.4.1
Dangling pointer due to atomicity
violation
No Yes
Concurrent requests initiated by
the client
MySQL
#644
Mysql-4.1 Data race leading to crash No Yes Concurrent select queries
MySQL
#169
Mysql-3.23
Race condition leading to out-of-
order logging
No No Delete and insert requests
MySQL
#791
Mysql-4.0 Race-visible in logging No No
Concurrent flush log and insert
requests
Configuration
Bug
Redis #957 Redis-2.6.11 Slave cannot sync with master Yes No Load a very large database
HDFS
#1904
Hdfs-0.23.0
Create a directory in wrong
location
Yes No Create new directory
19. ASE Sept 6, 2018@_jon_bell_
End-to-End Evaluation
• Recreated partial Wikipedia DB, & ran workload trace from 2008 (while
maintaining a replica)
• Difference in latencies between the proxy and duplicate was found to
be less than 2%
20. ASE Sept 6, 2018@_jon_bell_
Microbenchmark: Network Forwarding
• (Bonus, not in paper)
• Measured network-level performance using iPerf
• Native Mode: Direct network communication
• Proxy Mode: Network communication via a Proxy
• Duplication Mode: Network communication with a proxy duplicating traffic
• The bandwidth difference
between the proxy and
duplication was less than
0.5%
• The latency difference
between http ping requests
was less than 0.3%
934
943
0
20
Native Proxy Duplication
Throughput(Mbps)
21. ASE Sept 6, 2018@_jon_bell_
Benchmarks: Live Cloning
• Measured the time to do a live clone on five
applications, plus an empty container (“Basic”)
• Suspend time ranged 2-3 seconds for Apache,
Thttpd, ~10 for TradeBeans/TradeSoap (large
heaps in JVM)
• Note time is mostly in doing the actual copy
Time to clone, broken down by step
22. Replay without Recording of Production Bugs for
Service Oriented Applications
Nipun Arora, Jonathan Bell, Franjo Ivancic, Gail Kaiser and Baishakhi Ray
Dropbox, George Mason University, Google and Columbia University
https://github.com/Programming-Systems-Lab/parikshan