SlideShare une entreprise Scribd logo
1  sur  45
Copyright© 2016 NTT Corp. All Rights Reserved.
Tackling non-determinism in Hadoop
Testing and debugging distributed systems with Earthquake
GitHub : https://github.com/osrg/earthquake/
Twitter: @EarthquakeDMCK
Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
NTT Software Innovation Center
30/1/2016, at Brussels
2Copyright© 2016 NTT Corp. All Rights Reserved.
• What are distributed systems? (e.g., Hadoop)
Why is it difficult to test them?
• Bugs we found/reproduced with Earthquake
https://github.com/osrg/earthquake/
• How Earthquake controls non-determinism
• Lessons we learned
What I will talk about
3Copyright© 2016 NTT Corp. All Rights Reserved.
WHAT ARE DISTRIBUTED SYSTEMS?
WHY IS IT DIFFICULT TO TEST THEM?
And there are many more..
Machine Learning
Recommend attractive contents to users
Clustered Containers
Scalable DevOps
Big Data Analysis
Improve business strategy
NoSQL & NewSQL
Highly available and reliable service
5Copyright© 2016 NTT Corp. All Rights Reserved.
Scalability is the charm of distributed systems
But... it forces the system to be composed of many machines and many software..
Something in the system definitely goes awry!
We need to test whether the system is tolerant of awry things,
but it’s difficult to non-deternimism
Why is it difficult to test distributed systems?
Some disk can crash
Some packet can be delayed/lost
Some process can run slowly
(although not specific to dist sys)
Clip arts are from openclipart.org
6Copyright© 2016 NTT Corp. All Rights Reserved.
But..
Good News: distributed systems are well tested!
Software Production code (LOC) Test code (LOC)
MapReduce 95K 87K
YARN 178K 121K
HDFS 152K 150K
ZooKeeper 33K 27K
Spark 167K 128K
Flume 46K 34K
Cassandra 168K 78K
HBase 571K 222K
Kubernetes 309K 153K
etcd 34K 27K
Data are measured at 14/01/2016, using CLOC
Prod Test
7Copyright© 2016 NTT Corp. All Rights Reserved.
• over 3 bug reports per day, on average
• 50% of bugs take several months to get resolved
• 26% of bugs are "flaky", i.e., hard to reproduce due to
non-deternimism
What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems [ Gunawi et.al., SoCC'14]
Simple Testing Can Prevent Most Critical Failures.. [Yuan et. al., OSDI'14]
Bad News: they are still buggy
8Copyright© 2016 NTT Corp. All Rights Reserved.
https://builds.apache.org/job/%s-trunk/
MapReduce YARN HDFS
ZooKeeper
Data are captured at 14/01/2016
HBase
Build
Build Time
Red = Fail
Never seen fully successful build, even on my local machine
9Copyright© 2016 NTT Corp. All Rights Reserved.
• Sometimes we can see test failures on Jenkins
• So.. if we run test codes repeatedly on a PC,
we can locally reproduce the bug, right..?
No! Just repeating tests does not make much sense,
due to poor non-determinism variation
We need to control non-determinism!
Our Earthquake
increases
non-determinism
We revisit the plot later
higher is better
10Copyright© 2016 NTT Corp. All Rights Reserved.
Earthquake: programmable fuzzy scheduler
Earthquake
Disk access events
(FUSE)
Ethernet packet events
(iptables, Openflow)
Function Call/Return events
(byteman for Java, clang for C)
Linux threads
(sched_setattr(2))
11Copyright© 2016 NTT Corp. All Rights Reserved.
How to use Earthquake?
For Thread scheduling
$ git clone https://github.com/AkihiroSuda/microearthquake
$ sudo microearthquake pid $PID
For Ethernet scheduling
$ go get github.com/osrg/earthquake/earthquake-container
$ sudo earthquake-container run -i -t ubuntu $COMMAND
Will be unified to `earthquake-container` in February
Docker-like CLI
(For non-Dockerized version, please refer to docs)
12Copyright© 2016 NTT Corp. All Rights Reserved.
BUGS WE FOUND/REPRODUCED WITH EARTHQUAKE
Reproduction codes are included in the github repo:
https://github.com/osrg/earthquake/tree/master/example
Details of Earthquake
is followed later
13Copyright© 2016 NTT Corp. All Rights Reserved.
• Provides distributed lock for HA in Hadoop
• HA of Hadoop relies on ZooKeeper
• Also used by Spark, Mesos, Docker Swarm, Kafka, ...
• Similar softwares: etcd, Consul, Atomix,..
ZooKeeper: distributed lock service for Hadoop (and so on)
YARN
Resource Manager YARN
Resource Manager
YARN
Resource Manager
Active Stand-by Stand-by
14Copyright© 2016 NTT Corp. All Rights Reserved.
• Bug: New node cannot participate to ZK cluster properly
New node cannot become a leader of ZK cluster itself
• Cause: distributed race (ZAB packet vs FLE packet)
• ZAB.. broadcast protocol for data
• FLE.. leader election protocol for ZK cluster itself
• Code is becoming spaghetti due to many contributions
which had not been planned in the first release
Interaction between ZAB and FLE is not obvious
Uses different TCP connection
Non-deterministic packet order
Found https://issues.apache.org/jira/browse/ZOOKEEPER-2212
Leader of ZK cluster New ZK node
ZAB Packet [2888/tcp]
FLE Packet [3888/tcp]
Photo: Wikipedia
https://issues.apache.org/jira/browse/ZOOKEEPER-2212
Data are captured at 22/01/2016
16Copyright© 2016 NTT Corp. All Rights Reserved.
• Expected: ZK cluster works even when 𝑵/𝟐 nodes crashed
(𝑵 is an odd number, 3 or 5 in most cases)
• Real: it doesn't work!
 Example: No MapReduce job can be submitted to Hadoop
Affect to production
MapReduce
Not participating properly
17Copyright© 2016 NTT Corp. All Rights Reserved.
• We permuted some specific Ethernet packets in random order using Earthquake
• Unlike Linux netem (or FreeBSD dummynet), Earthquake provides much more
programmable interface
• Reproducibility: 0.0%  21.8% (tested 1,000 times)
• We could not reproduce the bug even after 5,000 times traditional testing (60
hours!)
How hard is it to reproduce?
Earthquake
Open vSwitch
+ ryu SDN framework
ZooKeeper cluster
18Copyright© 2016 NTT Corp. All Rights Reserved.
We define the distributed execution pattern based on code coverage:
𝑷 =
𝒑 𝟏,𝟏 ⋯ 𝒑 𝟏,𝑵
⋮ ⋱ ⋮
𝒑 𝑳,𝟏 ⋯ 𝒑 𝑳,𝑵
• 𝐿: Number of line of codes
• 𝑁: Number of nodes
• 𝑝 𝑖,𝑗 : 1 if the node 𝑗 covers the branch in line 𝑖, otherwise 0
• We used JaCoCo: Java Code Coverage Library
Distributed Execution Pattern
Earthquake achieves faster pattern growth
That's why we can hit the bug
19Copyright© 2016 NTT Corp. All Rights Reserved.
• YARN assigns MapReduce jobs to computation nodes
• Computation nodes reports their health status for
fault-tolerant job scheduling
YARN: the kernel of Hadoop
YARN
Node Manager YARN
Node Manager
YARN
Node Manager
YARN
Resource Manager
ZooKeeper
20Copyright© 2016 NTT Corp. All Rights Reserved.
• Bug: YARN cannot detect disk failure cases where mkdir()/rmdir()
blocks
• We noticed that the bug can occur theoretically
when we are reading the code, and actually produced the bug using
Earthquake
• When we should inject the fault is pre-known;
so we manually wrote a concrete scenario using Earthquake API
• Much more realistic than mocking
Found https://issues.apache.org/jira/browse/YARN-4301
mkdir
EIO
mkdir
...
21Copyright© 2016 NTT Corp. All Rights Reserved.
R: ZOOKEEPER-2080: CnxManager race
• Not reproduced nor analyzed for past 2 years
• Can affect production, not only in JUnit test
R: YARN-1978: NM TestLogAggregationService race
R: YARN-4168: NM TestLogAggregationService race
F: YARN-4543: NM TestNodeStatusUpdater race
F: YARN-4548: RM TestCapacityScheduler race
R: YARN-4556: RM TestFifoScheduler race
R: etcd #4006: kvstore_test race
R: etcd #4039: kv_test race
and more..
Found/Reproduced flaky tests (excluding simple "timed out" fail)
22Copyright© 2016 NTT Corp. All Rights Reserved.
It still matters!
For developers..
It's a barrier to promotion of CI
• If many tests are flaky, developers tend to ignore CI failure
 overlook real bugs
For users..
It's a barrier to risk assessment for production
• No one can tell flaky tests from real bugs
Flaky test doesn't matter, as it doesn't affect production?
23Copyright© 2016 NTT Corp. All Rights Reserved.
HOW EARTHQUAKE CONTROLS NON-DETERNIMISM
24Copyright© 2016 NTT Corp. All Rights Reserved.
Earthquake: programmable fuzzy scheduler
Earthquake
Disk access events
(FUSE)
Ethernet packet events
(iptables, Openflow)
Function Call/Return events
(byteman for Java, clang for C)
Linux threads
(sched_setattr(2))
25Copyright© 2016 NTT Corp. All Rights Reserved.
• Non-invasiveness
• Incremental Adoptability
• Language Independence
Three Major Principles of Earthquake
26Copyright© 2016 NTT Corp. All Rights Reserved.
• It's possible to insert extra hook codes for controlling non-
determinism
• Great academic works: SAMC[OSDI'14], DEMi[NSDI'16],..
• But we avoid such "invasive" modifications:
• We need to test multiple versions
• For evaluating which version is suitable for production
• For bisection testing (`git bisect`)
• We don't like to rebase the modification patch!
• We don't like to see false-positives!
• modification itself can be a false bug
• debugging false bugs is vain
Principle 1 of 3: Non-invasiveness
27Copyright© 2016 NTT Corp. All Rights Reserved.
• Real implementation of distributed systems are very complicated and difficult to
understand
• So we need to start testing without complex knowledge
• But after we got the details, we want to incrementally utilize the new knowledge for
improving test planning
Principle 2 of 3: Incremental Adoptability
Software Production code (LOC)
MapReduce 95K
YARN 178K
HDFS 152K
ZooKeeper 33K
Spark 167K
Flume 46K
Cassandra 168K
HBase 571K
Kubernetes 309K
etcd 34K
Paper Pages
Google's MapReduce
[OSDI '04]
13
Chubby (==ZooKeeper, etcd)
[OSDI'06]
16
BigTable (==HBase)
[OSDI'06]
16
Borg (==Kubernetes)
[Eurosys'15]
18
Spark
[NSDI'12]
14
28Copyright© 2016 NTT Corp. All Rights Reserved.
• Many distributed systems are written in Java
• But the world won't be unified under Java
• Scala.. Spark, Kafka
• Go.. etcd, Consul
• Erlang.. Riak, RabbitMQ
• Clojure.. Storm, ..
• C++.. Kudu,
• ..
• We avoid language-dependent technique so that we can test
any software written in any language
• Earthquake has some Java/C dependent components, but they are
just extensions, not mandatory.
Principle 3 of 3: Language Independence
29Copyright© 2016 NTT Corp. All Rights Reserved.
• Random is the default behavior
• No state transition model is required
• If you are suspicious of some specific scenarios,
you can program the scenarios!
• e.g., If FuncFoo() and FuncBar() happen concurrently (in XX msecs
window) ..
• You can ensure that FuncFoo() returns before FuncBar() returns
• You can inject a fault between FuncFoo() and FuncBar()
How Earthquake schedules events and faults
node1:FuncFoo()
node2:FuncBar()
Split network node1<->node2
Events:
• PacketEvent (iptables, Openflow)
• FilesystemEvent (FUSE)
• JavaFunctionEvent(byteman)
• SyslogEvent ..
30Copyright© 2016 NTT Corp. All Rights Reserved.
Earthquake API (Go)
type ExplorePolicy interface {
QueueNextEvent(Event)
GetNextActionChan() chan Action
}
func (p *MyPolicy) QueueNextEvent(event Event) {
action := event.DefaultAction()
p.timeBoundedQ.Enqueue(action,
10 * Millisecond, 30 * Millisecond)
}
func (p *MyPolicy) GetNextActionChan() chan Action {.. }
func NewMyPolicy() ExplorePolicy { return &MyPolicy {..} }
func main() {
RegisterPolicy("mypolicy", NewMyPolicy)
os.Exit(CLIMain(os.Args))
}
Events:
• PacketEvent (iptables, Openflow)
• FilesystemEvent (FUSE)
• JavaFunctionEvent(byteman)
• SyslogEvent ..
Actions:
• (Default action)
• PacketFaultAction
• FilesystemFaultAction
• ShellAction ..
Fired in [10ms, 30ms]
Go is poor at DLL..
So user-written policy plugin is statically linked
(as in Docker Machine drivers)
31Copyright© 2016 NTT Corp. All Rights Reserved.
• RPC mode
• Best for complex scenarios
• Doesn't scale so much, but it doesn't matter for tests
• Auto-pilot, single-binary mode
• Best for the default randomized scenario
Earthquake API
Orchestrator
Inspector Inspector Inspector
32Copyright© 2016 NTT Corp. All Rights Reserved.
• Linux 3.14 introduced SCHED_DEADLINE scheduler
• Earthquake is not related to performance testing;
but SCHED_DEADLINE is much more configurable than renice(8) with default CFS
scheduler; so we use it
• Earthquake changes runtime of threads, using SCHED_DEADLINE +
sched_setattr(2)
• Finds runtime set that satisfies the deadline constraint using
Dirichlet-distributed random values (∑𝑟𝑖 = 1000)
How Earthquake sets scheduling attributes?
thread1
300us
thread2
200us
thread3
400us
thr4
100us
deadline: 1,000us
33Copyright© 2016 NTT Corp. All Rights Reserved.
• Jepsen
• specialized in network partition
• specialized in testing linearizability (strong consistency)
• Famous for "Call Me Maybe" blog: http://jepsen.io/
• Earthquake is much more generalized
• The bugs we found/reproduced are basically beyond the scope of Jepsen
• Earthquake can be also combined with Jepsen!
It will be our next work..
Similar great tool: Jepsen
causes network partition
tests linearizablity
increases non-determinism
(especially of thread scheduling)
Jepsen Earthquake ...
34Copyright© 2016 NTT Corp. All Rights Reserved.
LESSONS WE LEARNED
35Copyright© 2016 NTT Corp. All Rights Reserved.
"Liveness": something good eventually happens
• c.f. "safety": nothing bad happens (typical assertions are safety)
Problem: Flaky test expects something good happens in a certain timeout, which depends
on the performance of the machine
Solution: make sure any test succeeds on slow machines
(of course performance testing is important, but it should be separated)
Performance should be split from liveness
invokeAsyncProcess();
sleep(certainTimeout); // some tests lack even this sleep for async proc
assertTrue(checkSomethingGoodHasHappened());
invokeAsyncProcess();
bool b;
for (int i = 0; i < MANY_RETRIES; i++) {
sleep(certainTimeout);
b = checkSomethingGoodHasHappened(); // idempotent
if (b) break;
}
assertTrue(b);
36Copyright© 2016 NTT Corp. All Rights Reserved.
• Of course it is definitely better than nothing
• Problem: But sometimes it's useless for distributed systems, because it's too hard for
humans to inspect logs from multiple nodes
• Unclear ordering
• Unclear structure (needs an alternative to log4j!)
• enormous non-interesting logs
• Solution: So we made a tool that picks up some branches that only occur in failed
experiments for analyzing the root cause of the bug
LOG.debug() is sometimes useless for debugging
Analyzer
Scanning 00000000: experiment successful=true
Scanning 00000001: experiment successful=false
..
Suspicious: QuorumCnxManager::receiveConnection() line 382-382
Suspicious: QuorumCnxManager$Listener::run() line 657-657
Suspicious: QuorumCnxManager$SendWorker::run() line 809-811
37Copyright© 2016 NTT Corp. All Rights Reserved.
Each of the Hadoop components has plenty of JUnit test codes
But we also need integration tests with multiple components!
• e.g. ZOOKEEPER-2347: critical deadlock which leaded to withdrawal of ZooKeeper release
v3.4.7
• Found (not by us nor with Earthquake) in an integration test with ZooKeeper and HBase
Apache BigTop is an interesting project about this!
• But.. how should we prepare realistic workloads systematically?
It's an open question..
Need for integration test framework
38Copyright© 2016 NTT Corp. All Rights Reserved.
CONCLUSION
39Copyright© 2016 NTT Corp. All Rights Reserved.
• Distributed systems are difficult to test and debug, due to
non-determinism
• Earthquake controls non-determinism
so that you can find and reproduce such "flaky" bugs
• Please try and give us your feedback:
https://github.com/osrg/earthquake/
Conclusion
40Copyright© 2016 NTT Corp. All Rights Reserved.
How to use Earthquake?
For Thread scheduling
$ git clone https://github.com/AkihiroSuda/microearthquake
$ sudo microearthquake pid $PID
For Ethernet scheduling
$ go get github.com/osrg/earthquake/earthquake-container
$ sudo earthquake-container run -i -t ubuntu $COMMAND
Will be unified to `earthquake-container` in February
Docker-like CLI
(For non-Dockerized version, please refer to docs)
41Copyright© 2016 NTT Corp. All Rights Reserved.
ADDENDUM
42Copyright© 2016 NTT Corp. All Rights Reserved.
Terminal 1
Quick start for reproducing flaky bugs in Hadoop [1/3]
$ git clone https://github.com/apache/hadoop.git
$ cd hadoop
$ ./start-build-env.sh
build-env$ (switch to Terminal 2)
• `start-build-env.sh` starts the build environment in a Docker container.
Suppose the started container name is $HADOOP_BUILD_ENV.
• If you have a package dependency issue, you may have to run
`mvn package –Pdist –DskipTests -Dtar` in the container .
43Copyright© 2016 NTT Corp. All Rights Reserved.
Terminal 2
$ git clone https://github.com/AkihiroSuda/microearthquake
$ cd microearthquake
$ git checkout tag20160203
$ sudo apt-get install –y python3-{pip,dev} lib{cap,ffi}-dev
$ sudo pip3 install cffi colorama numpy python-prctl psutil
$ (cd microearthquake; python3 native_build.py)
$ docker inspect --format '{{.State.Pid}}' $HADOOP_BUILD_ENV
4242
$ sudo ./bin/microearthquake pid 4242
(switch to Terminal 1)
Quick start for reproducing flaky bugs in Hadoop [2/3]
• `microearthauke` fuzzes the thread scheduling for all the threads under the
process tree (pid=4242).
• `microearthquake` will be rewritten in Go, and unified to `earthquake-container`.
(planned in February)
44Copyright© 2016 NTT Corp. All Rights Reserved.
Terminal 1
build-env$ while mvn test –Dtest=TestFooBar; do true; done
(the failure will be eventually reproduced)
• Suppose TestFooBar.java is failing intermittently on Jenkins.
• You can find failing tests at
https://hadoop.apache.org/issue_tracking.html
Quick start for reproducing flaky bugs in Hadoop [3/3]
45Copyright© 2016 NTT Corp. All Rights Reserved.
Q. How can I write a test?
A. Basically you don't have to write any new test, as you can use existing
xUnit tests.
If you want to test a real cluster rather than an xUnit pseudo cluster,
please refer to examples:
https://github.com/osrg/earthquake/tree/master/example .
Q. Can I test standalone multi-threaded programs?
A. Yes, by fuzzing thread scheduling.
Q. Do I need to write ExplorePolicy manually? It's bothersome.
A. No, basically the default random behavior is enough.
FAQs
Please feel free to open an issue in GitHub:
https://github.com/osrg/earthquake/issues

Contenu connexe

Tendances

DockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐるDockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐるKohei Tokunaga
 
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterThe overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterKohei Tokunaga
 
Daneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver Meetup
Daneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver MeetupDaneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver Meetup
Daneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver MeetupShannon McFarland
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能Kohei Tokunaga
 
eStargzイメージとlazy pullingによる高速なコンテナ起動
eStargzイメージとlazy pullingによる高速なコンテナ起動eStargzイメージとlazy pullingによる高速なコンテナ起動
eStargzイメージとlazy pullingによる高速なコンテナ起動Kohei Tokunaga
 
Faster Container Image Distribution on a Variety of Tools with Lazy Pulling
Faster Container Image Distribution on a Variety of Tools with Lazy PullingFaster Container Image Distribution on a Variety of Tools with Lazy Pulling
Faster Container Image Distribution on a Variety of Tools with Lazy PullingKohei Tokunaga
 
BuildKitでLazy Pullを有効にしてビルドを早くする話
BuildKitでLazy Pullを有効にしてビルドを早くする話BuildKitでLazy Pullを有効にしてビルドを早くする話
BuildKitでLazy Pullを有効にしてビルドを早くする話Kohei Tokunaga
 
Ceph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper MeliorCeph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper MeliorPatrick McGarry
 
Introduction of private cloud in LINE - OpenStack最新情報セミナー(2019年2月)
Introduction of private cloud in LINE - OpenStack最新情報セミナー(2019年2月)Introduction of private cloud in LINE - OpenStack最新情報セミナー(2019年2月)
Introduction of private cloud in LINE - OpenStack最新情報セミナー(2019年2月)VirtualTech Japan Inc.
 
[DockerCon 2019] Hardening Docker daemon with Rootless mode
[DockerCon 2019] Hardening Docker daemon with Rootless mode[DockerCon 2019] Hardening Docker daemon with Rootless mode
[DockerCon 2019] Hardening Docker daemon with Rootless modeAkihiro Suda
 
Develop QNAP NAS App by Docker
Develop QNAP NAS App by DockerDevelop QNAP NAS App by Docker
Develop QNAP NAS App by DockerTerry Chen
 
Using docker to develop NAS applications
Using docker to develop NAS applicationsUsing docker to develop NAS applications
Using docker to develop NAS applicationsTerry Chen
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin Kuberton
 
Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned  Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned RightScale
 
Docker1.12イングレスロードバランサ
Docker1.12イングレスロードバランサDocker1.12イングレスロードバランサ
Docker1.12イングレスロードバランサYuji Oshima
 
What's new in FreeBSD 10
What's new in FreeBSD 10What's new in FreeBSD 10
What's new in FreeBSD 10Gleb Smirnoff
 
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?KubeAcademy
 
The Lies We Tell Our Code (#seascale 2015 04-22)
The Lies We Tell Our Code (#seascale 2015 04-22)The Lies We Tell Our Code (#seascale 2015 04-22)
The Lies We Tell Our Code (#seascale 2015 04-22)Casey Bisson
 

Tendances (20)

DockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐるDockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐる
 
Neutron CI Run on Docker
Neutron CI Run on DockerNeutron CI Run on Docker
Neutron CI Run on Docker
 
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterThe overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
 
Daneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver Meetup
Daneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver MeetupDaneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver Meetup
Daneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver Meetup
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能
 
eStargzイメージとlazy pullingによる高速なコンテナ起動
eStargzイメージとlazy pullingによる高速なコンテナ起動eStargzイメージとlazy pullingによる高速なコンテナ起動
eStargzイメージとlazy pullingによる高速なコンテナ起動
 
Faster Container Image Distribution on a Variety of Tools with Lazy Pulling
Faster Container Image Distribution on a Variety of Tools with Lazy PullingFaster Container Image Distribution on a Variety of Tools with Lazy Pulling
Faster Container Image Distribution on a Variety of Tools with Lazy Pulling
 
BuildKitでLazy Pullを有効にしてビルドを早くする話
BuildKitでLazy Pullを有効にしてビルドを早くする話BuildKitでLazy Pullを有効にしてビルドを早くする話
BuildKitでLazy Pullを有効にしてビルドを早くする話
 
Ceph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper MeliorCeph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper Melior
 
Introduction of private cloud in LINE - OpenStack最新情報セミナー(2019年2月)
Introduction of private cloud in LINE - OpenStack最新情報セミナー(2019年2月)Introduction of private cloud in LINE - OpenStack最新情報セミナー(2019年2月)
Introduction of private cloud in LINE - OpenStack最新情報セミナー(2019年2月)
 
RKT
RKTRKT
RKT
 
[DockerCon 2019] Hardening Docker daemon with Rootless mode
[DockerCon 2019] Hardening Docker daemon with Rootless mode[DockerCon 2019] Hardening Docker daemon with Rootless mode
[DockerCon 2019] Hardening Docker daemon with Rootless mode
 
Develop QNAP NAS App by Docker
Develop QNAP NAS App by DockerDevelop QNAP NAS App by Docker
Develop QNAP NAS App by Docker
 
Using docker to develop NAS applications
Using docker to develop NAS applicationsUsing docker to develop NAS applications
Using docker to develop NAS applications
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin
 
Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned  Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned
 
Docker1.12イングレスロードバランサ
Docker1.12イングレスロードバランサDocker1.12イングレスロードバランサ
Docker1.12イングレスロードバランサ
 
What's new in FreeBSD 10
What's new in FreeBSD 10What's new in FreeBSD 10
What's new in FreeBSD 10
 
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
 
The Lies We Tell Our Code (#seascale 2015 04-22)
The Lies We Tell Our Code (#seascale 2015 04-22)The Lies We Tell Our Code (#seascale 2015 04-22)
The Lies We Tell Our Code (#seascale 2015 04-22)
 

En vedette

Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopMark Johnson
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Sly and the RoarVM: Exploring the Manycore Future of Programming
Sly and the RoarVM: Exploring the Manycore Future of ProgrammingSly and the RoarVM: Exploring the Manycore Future of Programming
Sly and the RoarVM: Exploring the Manycore Future of ProgrammingStefan Marr
 
Ravikanth_CV_10 yrs_ETL-BI-BigData-Testing
Ravikanth_CV_10 yrs_ETL-BI-BigData-TestingRavikanth_CV_10 yrs_ETL-BI-BigData-Testing
Ravikanth_CV_10 yrs_ETL-BI-BigData-TestingRavikanth Marpuri
 
Testing Centralization
Testing CentralizationTesting Centralization
Testing CentralizationCognizant
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!Edureka!
 
Top 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksTop 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksEdureka!
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastImpetus Technologies
 
Infosys growth strategy - a case analysis
Infosys growth strategy - a case analysisInfosys growth strategy - a case analysis
Infosys growth strategy - a case analysisAchal Raghavan
 
Knowledge Management at Infosys
Knowledge Management at InfosysKnowledge Management at Infosys
Knowledge Management at InfosysShuhab Tariq
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...RTTS
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015Holden Karau
 
Ryu Learning Guide
Ryu Learning GuideRyu Learning Guide
Ryu Learning Guide呈 李
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 

En vedette (17)

Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and Hadoop
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Sly and the RoarVM: Exploring the Manycore Future of Programming
Sly and the RoarVM: Exploring the Manycore Future of ProgrammingSly and the RoarVM: Exploring the Manycore Future of Programming
Sly and the RoarVM: Exploring the Manycore Future of Programming
 
Ravikanth_CV_10 yrs_ETL-BI-BigData-Testing
Ravikanth_CV_10 yrs_ETL-BI-BigData-TestingRavikanth_CV_10 yrs_ETL-BI-BigData-Testing
Ravikanth_CV_10 yrs_ETL-BI-BigData-Testing
 
Testing Centralization
Testing CentralizationTesting Centralization
Testing Centralization
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!
 
Top 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksTop 5 Hadoop Admin Tasks
Top 5 Hadoop Admin Tasks
 
Ryu sdn framework
Ryu sdn framework Ryu sdn framework
Ryu sdn framework
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus Webcast
 
Infosys growth strategy - a case analysis
Infosys growth strategy - a case analysisInfosys growth strategy - a case analysis
Infosys growth strategy - a case analysis
 
Knowledge Management at Infosys
Knowledge Management at InfosysKnowledge Management at Infosys
Knowledge Management at Infosys
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
Infosys final
Infosys finalInfosys final
Infosys final
 
Ryu Learning Guide
Ryu Learning GuideRyu Learning Guide
Ryu Learning Guide
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 

Similaire à Tackling non-determinism in Hadoop - Testing and debugging distributed systems with Earthquake -

HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...Yuji Kubota
 
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...VirtualTech Japan Inc.
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wilddatamantra
 
More Efficient Object Replication in OpenStack Summit Juno
More Efficient Object Replication in OpenStack Summit JunoMore Efficient Object Replication in OpenStack Summit Juno
More Efficient Object Replication in OpenStack Summit JunoKota Tsuyuzaki
 
Arakoon: A distributed consistent key-value store
Arakoon: A distributed consistent key-value storeArakoon: A distributed consistent key-value store
Arakoon: A distributed consistent key-value storeNicolas Trangez
 
Usernetes: Kubernetes as a non-root user
Usernetes: Kubernetes as a non-root userUsernetes: Kubernetes as a non-root user
Usernetes: Kubernetes as a non-root userAkihiro Suda
 
SequenceL Auto-Parallelizing Toolset Intro slideshare
SequenceL Auto-Parallelizing Toolset Intro slideshareSequenceL Auto-Parallelizing Toolset Intro slideshare
SequenceL Auto-Parallelizing Toolset Intro slideshareDoug Norton
 
SequenceL intro slideshare
SequenceL intro slideshareSequenceL intro slideshare
SequenceL intro slideshareDoug Norton
 
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHDEMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD{code}
 
Building Cloud Virtual Topologies with Ravello and Ansible
Building Cloud Virtual Topologies with Ravello and AnsibleBuilding Cloud Virtual Topologies with Ravello and Ansible
Building Cloud Virtual Topologies with Ravello and AnsibleDamien Garros
 
No one puts java in the container
No one puts java in the containerNo one puts java in the container
No one puts java in the containerkensipe
 
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019Jakarta_EE
 
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019The Eclipse Foundation
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...Ambassador Labs
 
Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016Kendrick Coleman
 
Software Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVSoftware Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVYoshihiro Nakajima
 
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...The Linux Foundation
 

Similaire à Tackling non-determinism in Hadoop - Testing and debugging distributed systems with Earthquake - (20)

HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
 
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wild
 
More Efficient Object Replication in OpenStack Summit Juno
More Efficient Object Replication in OpenStack Summit JunoMore Efficient Object Replication in OpenStack Summit Juno
More Efficient Object Replication in OpenStack Summit Juno
 
Arakoon: A distributed consistent key-value store
Arakoon: A distributed consistent key-value storeArakoon: A distributed consistent key-value store
Arakoon: A distributed consistent key-value store
 
Usernetes: Kubernetes as a non-root user
Usernetes: Kubernetes as a non-root userUsernetes: Kubernetes as a non-root user
Usernetes: Kubernetes as a non-root user
 
Container Security
Container SecurityContainer Security
Container Security
 
SequenceL Auto-Parallelizing Toolset Intro slideshare
SequenceL Auto-Parallelizing Toolset Intro slideshareSequenceL Auto-Parallelizing Toolset Intro slideshare
SequenceL Auto-Parallelizing Toolset Intro slideshare
 
SequenceL intro slideshare
SequenceL intro slideshareSequenceL intro slideshare
SequenceL intro slideshare
 
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHDEMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
 
Building Cloud Virtual Topologies with Ravello and Ansible
Building Cloud Virtual Topologies with Ravello and AnsibleBuilding Cloud Virtual Topologies with Ravello and Ansible
Building Cloud Virtual Topologies with Ravello and Ansible
 
No one puts java in the container
No one puts java in the containerNo one puts java in the container
No one puts java in the container
 
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
 
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
 
Surge2012
Surge2012Surge2012
Surge2012
 
Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016
 
Software Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVSoftware Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFV
 
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
 

Plus de Akihiro Suda

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
20240321 [KubeCon EU Pavilion] Lima.pdf_
20240321 [KubeCon EU Pavilion] Lima.pdf_20240321 [KubeCon EU Pavilion] Lima.pdf_
20240321 [KubeCon EU Pavilion] Lima.pdf_Akihiro Suda
 
20240320 [KubeCon EU Pavilion] containerd.pdf
20240320 [KubeCon EU Pavilion] containerd.pdf20240320 [KubeCon EU Pavilion] containerd.pdf
20240320 [KubeCon EU Pavilion] containerd.pdfAkihiro Suda
 
20240201 [HPC Containers] Rootless Containers.pdf
20240201 [HPC Containers] Rootless Containers.pdf20240201 [HPC Containers] Rootless Containers.pdf
20240201 [HPC Containers] Rootless Containers.pdfAkihiro Suda
 
[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless PodmanAkihiro Suda
 
[KubeConNA2023] Lima pavilion
[KubeConNA2023] Lima pavilion[KubeConNA2023] Lima pavilion
[KubeConNA2023] Lima pavilionAkihiro Suda
 
[KubeConNA2023] containerd pavilion
[KubeConNA2023] containerd pavilion[KubeConNA2023] containerd pavilion
[KubeConNA2023] containerd pavilionAkihiro Suda
 
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdfAkihiro Suda
 
[CNCF TAG-Runtime] Usernetes Gen2
[CNCF TAG-Runtime] Usernetes Gen2[CNCF TAG-Runtime] Usernetes Gen2
[CNCF TAG-Runtime] Usernetes Gen2Akihiro Suda
 
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...Akihiro Suda
 
The internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesThe internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesAkihiro Suda
 
[KubeConEU2023] Lima pavilion
[KubeConEU2023] Lima pavilion[KubeConEU2023] Lima pavilion
[KubeConEU2023] Lima pavilionAkihiro Suda
 
[KubeConEU2023] containerd pavilion
[KubeConEU2023] containerd pavilion[KubeConEU2023] containerd pavilion
[KubeConEU2023] containerd pavilionAkihiro Suda
 
[Container Plumbing Days 2023] Why was nerdctl made?
[Container Plumbing Days 2023] Why was nerdctl made?[Container Plumbing Days 2023] Why was nerdctl made?
[Container Plumbing Days 2023] Why was nerdctl made?Akihiro Suda
 
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
[FOSDEM2023] Bit-for-bit reproducible builds with DockerfileAkihiro Suda
 
[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] Lima[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] LimaAkihiro Suda
 
[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOSAkihiro Suda
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Akihiro Suda
 
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...Akihiro Suda
 
[Docker Tokyo #35] Docker 20.10
[Docker Tokyo #35] Docker 20.10[Docker Tokyo #35] Docker 20.10
[Docker Tokyo #35] Docker 20.10Akihiro Suda
 

Plus de Akihiro Suda (20)

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
20240321 [KubeCon EU Pavilion] Lima.pdf_
20240321 [KubeCon EU Pavilion] Lima.pdf_20240321 [KubeCon EU Pavilion] Lima.pdf_
20240321 [KubeCon EU Pavilion] Lima.pdf_
 
20240320 [KubeCon EU Pavilion] containerd.pdf
20240320 [KubeCon EU Pavilion] containerd.pdf20240320 [KubeCon EU Pavilion] containerd.pdf
20240320 [KubeCon EU Pavilion] containerd.pdf
 
20240201 [HPC Containers] Rootless Containers.pdf
20240201 [HPC Containers] Rootless Containers.pdf20240201 [HPC Containers] Rootless Containers.pdf
20240201 [HPC Containers] Rootless Containers.pdf
 
[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman
 
[KubeConNA2023] Lima pavilion
[KubeConNA2023] Lima pavilion[KubeConNA2023] Lima pavilion
[KubeConNA2023] Lima pavilion
 
[KubeConNA2023] containerd pavilion
[KubeConNA2023] containerd pavilion[KubeConNA2023] containerd pavilion
[KubeConNA2023] containerd pavilion
 
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
 
[CNCF TAG-Runtime] Usernetes Gen2
[CNCF TAG-Runtime] Usernetes Gen2[CNCF TAG-Runtime] Usernetes Gen2
[CNCF TAG-Runtime] Usernetes Gen2
 
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
 
The internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesThe internals and the latest trends of container runtimes
The internals and the latest trends of container runtimes
 
[KubeConEU2023] Lima pavilion
[KubeConEU2023] Lima pavilion[KubeConEU2023] Lima pavilion
[KubeConEU2023] Lima pavilion
 
[KubeConEU2023] containerd pavilion
[KubeConEU2023] containerd pavilion[KubeConEU2023] containerd pavilion
[KubeConEU2023] containerd pavilion
 
[Container Plumbing Days 2023] Why was nerdctl made?
[Container Plumbing Days 2023] Why was nerdctl made?[Container Plumbing Days 2023] Why was nerdctl made?
[Container Plumbing Days 2023] Why was nerdctl made?
 
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
 
[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] Lima[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] Lima
 
[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
 
[Docker Tokyo #35] Docker 20.10
[Docker Tokyo #35] Docker 20.10[Docker Tokyo #35] Docker 20.10
[Docker Tokyo #35] Docker 20.10
 

Dernier

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Dernier (20)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Tackling non-determinism in Hadoop - Testing and debugging distributed systems with Earthquake -

  • 1. Copyright© 2016 NTT Corp. All Rights Reserved. Tackling non-determinism in Hadoop Testing and debugging distributed systems with Earthquake GitHub : https://github.com/osrg/earthquake/ Twitter: @EarthquakeDMCK Akihiro Suda <suda.akihiro@lab.ntt.co.jp> NTT Software Innovation Center 30/1/2016, at Brussels
  • 2. 2Copyright© 2016 NTT Corp. All Rights Reserved. • What are distributed systems? (e.g., Hadoop) Why is it difficult to test them? • Bugs we found/reproduced with Earthquake https://github.com/osrg/earthquake/ • How Earthquake controls non-determinism • Lessons we learned What I will talk about
  • 3. 3Copyright© 2016 NTT Corp. All Rights Reserved. WHAT ARE DISTRIBUTED SYSTEMS? WHY IS IT DIFFICULT TO TEST THEM?
  • 4. And there are many more.. Machine Learning Recommend attractive contents to users Clustered Containers Scalable DevOps Big Data Analysis Improve business strategy NoSQL & NewSQL Highly available and reliable service
  • 5. 5Copyright© 2016 NTT Corp. All Rights Reserved. Scalability is the charm of distributed systems But... it forces the system to be composed of many machines and many software.. Something in the system definitely goes awry! We need to test whether the system is tolerant of awry things, but it’s difficult to non-deternimism Why is it difficult to test distributed systems? Some disk can crash Some packet can be delayed/lost Some process can run slowly (although not specific to dist sys) Clip arts are from openclipart.org
  • 6. 6Copyright© 2016 NTT Corp. All Rights Reserved. But.. Good News: distributed systems are well tested! Software Production code (LOC) Test code (LOC) MapReduce 95K 87K YARN 178K 121K HDFS 152K 150K ZooKeeper 33K 27K Spark 167K 128K Flume 46K 34K Cassandra 168K 78K HBase 571K 222K Kubernetes 309K 153K etcd 34K 27K Data are measured at 14/01/2016, using CLOC Prod Test
  • 7. 7Copyright© 2016 NTT Corp. All Rights Reserved. • over 3 bug reports per day, on average • 50% of bugs take several months to get resolved • 26% of bugs are "flaky", i.e., hard to reproduce due to non-deternimism What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems [ Gunawi et.al., SoCC'14] Simple Testing Can Prevent Most Critical Failures.. [Yuan et. al., OSDI'14] Bad News: they are still buggy
  • 8. 8Copyright© 2016 NTT Corp. All Rights Reserved. https://builds.apache.org/job/%s-trunk/ MapReduce YARN HDFS ZooKeeper Data are captured at 14/01/2016 HBase Build Build Time Red = Fail Never seen fully successful build, even on my local machine
  • 9. 9Copyright© 2016 NTT Corp. All Rights Reserved. • Sometimes we can see test failures on Jenkins • So.. if we run test codes repeatedly on a PC, we can locally reproduce the bug, right..? No! Just repeating tests does not make much sense, due to poor non-determinism variation We need to control non-determinism! Our Earthquake increases non-determinism We revisit the plot later higher is better
  • 10. 10Copyright© 2016 NTT Corp. All Rights Reserved. Earthquake: programmable fuzzy scheduler Earthquake Disk access events (FUSE) Ethernet packet events (iptables, Openflow) Function Call/Return events (byteman for Java, clang for C) Linux threads (sched_setattr(2))
  • 11. 11Copyright© 2016 NTT Corp. All Rights Reserved. How to use Earthquake? For Thread scheduling $ git clone https://github.com/AkihiroSuda/microearthquake $ sudo microearthquake pid $PID For Ethernet scheduling $ go get github.com/osrg/earthquake/earthquake-container $ sudo earthquake-container run -i -t ubuntu $COMMAND Will be unified to `earthquake-container` in February Docker-like CLI (For non-Dockerized version, please refer to docs)
  • 12. 12Copyright© 2016 NTT Corp. All Rights Reserved. BUGS WE FOUND/REPRODUCED WITH EARTHQUAKE Reproduction codes are included in the github repo: https://github.com/osrg/earthquake/tree/master/example Details of Earthquake is followed later
  • 13. 13Copyright© 2016 NTT Corp. All Rights Reserved. • Provides distributed lock for HA in Hadoop • HA of Hadoop relies on ZooKeeper • Also used by Spark, Mesos, Docker Swarm, Kafka, ... • Similar softwares: etcd, Consul, Atomix,.. ZooKeeper: distributed lock service for Hadoop (and so on) YARN Resource Manager YARN Resource Manager YARN Resource Manager Active Stand-by Stand-by
  • 14. 14Copyright© 2016 NTT Corp. All Rights Reserved. • Bug: New node cannot participate to ZK cluster properly New node cannot become a leader of ZK cluster itself • Cause: distributed race (ZAB packet vs FLE packet) • ZAB.. broadcast protocol for data • FLE.. leader election protocol for ZK cluster itself • Code is becoming spaghetti due to many contributions which had not been planned in the first release Interaction between ZAB and FLE is not obvious Uses different TCP connection Non-deterministic packet order Found https://issues.apache.org/jira/browse/ZOOKEEPER-2212 Leader of ZK cluster New ZK node ZAB Packet [2888/tcp] FLE Packet [3888/tcp] Photo: Wikipedia
  • 16. 16Copyright© 2016 NTT Corp. All Rights Reserved. • Expected: ZK cluster works even when 𝑵/𝟐 nodes crashed (𝑵 is an odd number, 3 or 5 in most cases) • Real: it doesn't work!  Example: No MapReduce job can be submitted to Hadoop Affect to production MapReduce Not participating properly
  • 17. 17Copyright© 2016 NTT Corp. All Rights Reserved. • We permuted some specific Ethernet packets in random order using Earthquake • Unlike Linux netem (or FreeBSD dummynet), Earthquake provides much more programmable interface • Reproducibility: 0.0%  21.8% (tested 1,000 times) • We could not reproduce the bug even after 5,000 times traditional testing (60 hours!) How hard is it to reproduce? Earthquake Open vSwitch + ryu SDN framework ZooKeeper cluster
  • 18. 18Copyright© 2016 NTT Corp. All Rights Reserved. We define the distributed execution pattern based on code coverage: 𝑷 = 𝒑 𝟏,𝟏 ⋯ 𝒑 𝟏,𝑵 ⋮ ⋱ ⋮ 𝒑 𝑳,𝟏 ⋯ 𝒑 𝑳,𝑵 • 𝐿: Number of line of codes • 𝑁: Number of nodes • 𝑝 𝑖,𝑗 : 1 if the node 𝑗 covers the branch in line 𝑖, otherwise 0 • We used JaCoCo: Java Code Coverage Library Distributed Execution Pattern Earthquake achieves faster pattern growth That's why we can hit the bug
  • 19. 19Copyright© 2016 NTT Corp. All Rights Reserved. • YARN assigns MapReduce jobs to computation nodes • Computation nodes reports their health status for fault-tolerant job scheduling YARN: the kernel of Hadoop YARN Node Manager YARN Node Manager YARN Node Manager YARN Resource Manager ZooKeeper
  • 20. 20Copyright© 2016 NTT Corp. All Rights Reserved. • Bug: YARN cannot detect disk failure cases where mkdir()/rmdir() blocks • We noticed that the bug can occur theoretically when we are reading the code, and actually produced the bug using Earthquake • When we should inject the fault is pre-known; so we manually wrote a concrete scenario using Earthquake API • Much more realistic than mocking Found https://issues.apache.org/jira/browse/YARN-4301 mkdir EIO mkdir ...
  • 21. 21Copyright© 2016 NTT Corp. All Rights Reserved. R: ZOOKEEPER-2080: CnxManager race • Not reproduced nor analyzed for past 2 years • Can affect production, not only in JUnit test R: YARN-1978: NM TestLogAggregationService race R: YARN-4168: NM TestLogAggregationService race F: YARN-4543: NM TestNodeStatusUpdater race F: YARN-4548: RM TestCapacityScheduler race R: YARN-4556: RM TestFifoScheduler race R: etcd #4006: kvstore_test race R: etcd #4039: kv_test race and more.. Found/Reproduced flaky tests (excluding simple "timed out" fail)
  • 22. 22Copyright© 2016 NTT Corp. All Rights Reserved. It still matters! For developers.. It's a barrier to promotion of CI • If many tests are flaky, developers tend to ignore CI failure  overlook real bugs For users.. It's a barrier to risk assessment for production • No one can tell flaky tests from real bugs Flaky test doesn't matter, as it doesn't affect production?
  • 23. 23Copyright© 2016 NTT Corp. All Rights Reserved. HOW EARTHQUAKE CONTROLS NON-DETERNIMISM
  • 24. 24Copyright© 2016 NTT Corp. All Rights Reserved. Earthquake: programmable fuzzy scheduler Earthquake Disk access events (FUSE) Ethernet packet events (iptables, Openflow) Function Call/Return events (byteman for Java, clang for C) Linux threads (sched_setattr(2))
  • 25. 25Copyright© 2016 NTT Corp. All Rights Reserved. • Non-invasiveness • Incremental Adoptability • Language Independence Three Major Principles of Earthquake
  • 26. 26Copyright© 2016 NTT Corp. All Rights Reserved. • It's possible to insert extra hook codes for controlling non- determinism • Great academic works: SAMC[OSDI'14], DEMi[NSDI'16],.. • But we avoid such "invasive" modifications: • We need to test multiple versions • For evaluating which version is suitable for production • For bisection testing (`git bisect`) • We don't like to rebase the modification patch! • We don't like to see false-positives! • modification itself can be a false bug • debugging false bugs is vain Principle 1 of 3: Non-invasiveness
  • 27. 27Copyright© 2016 NTT Corp. All Rights Reserved. • Real implementation of distributed systems are very complicated and difficult to understand • So we need to start testing without complex knowledge • But after we got the details, we want to incrementally utilize the new knowledge for improving test planning Principle 2 of 3: Incremental Adoptability Software Production code (LOC) MapReduce 95K YARN 178K HDFS 152K ZooKeeper 33K Spark 167K Flume 46K Cassandra 168K HBase 571K Kubernetes 309K etcd 34K Paper Pages Google's MapReduce [OSDI '04] 13 Chubby (==ZooKeeper, etcd) [OSDI'06] 16 BigTable (==HBase) [OSDI'06] 16 Borg (==Kubernetes) [Eurosys'15] 18 Spark [NSDI'12] 14
  • 28. 28Copyright© 2016 NTT Corp. All Rights Reserved. • Many distributed systems are written in Java • But the world won't be unified under Java • Scala.. Spark, Kafka • Go.. etcd, Consul • Erlang.. Riak, RabbitMQ • Clojure.. Storm, .. • C++.. Kudu, • .. • We avoid language-dependent technique so that we can test any software written in any language • Earthquake has some Java/C dependent components, but they are just extensions, not mandatory. Principle 3 of 3: Language Independence
  • 29. 29Copyright© 2016 NTT Corp. All Rights Reserved. • Random is the default behavior • No state transition model is required • If you are suspicious of some specific scenarios, you can program the scenarios! • e.g., If FuncFoo() and FuncBar() happen concurrently (in XX msecs window) .. • You can ensure that FuncFoo() returns before FuncBar() returns • You can inject a fault between FuncFoo() and FuncBar() How Earthquake schedules events and faults node1:FuncFoo() node2:FuncBar() Split network node1<->node2 Events: • PacketEvent (iptables, Openflow) • FilesystemEvent (FUSE) • JavaFunctionEvent(byteman) • SyslogEvent ..
  • 30. 30Copyright© 2016 NTT Corp. All Rights Reserved. Earthquake API (Go) type ExplorePolicy interface { QueueNextEvent(Event) GetNextActionChan() chan Action } func (p *MyPolicy) QueueNextEvent(event Event) { action := event.DefaultAction() p.timeBoundedQ.Enqueue(action, 10 * Millisecond, 30 * Millisecond) } func (p *MyPolicy) GetNextActionChan() chan Action {.. } func NewMyPolicy() ExplorePolicy { return &MyPolicy {..} } func main() { RegisterPolicy("mypolicy", NewMyPolicy) os.Exit(CLIMain(os.Args)) } Events: • PacketEvent (iptables, Openflow) • FilesystemEvent (FUSE) • JavaFunctionEvent(byteman) • SyslogEvent .. Actions: • (Default action) • PacketFaultAction • FilesystemFaultAction • ShellAction .. Fired in [10ms, 30ms] Go is poor at DLL.. So user-written policy plugin is statically linked (as in Docker Machine drivers)
  • 31. 31Copyright© 2016 NTT Corp. All Rights Reserved. • RPC mode • Best for complex scenarios • Doesn't scale so much, but it doesn't matter for tests • Auto-pilot, single-binary mode • Best for the default randomized scenario Earthquake API Orchestrator Inspector Inspector Inspector
  • 32. 32Copyright© 2016 NTT Corp. All Rights Reserved. • Linux 3.14 introduced SCHED_DEADLINE scheduler • Earthquake is not related to performance testing; but SCHED_DEADLINE is much more configurable than renice(8) with default CFS scheduler; so we use it • Earthquake changes runtime of threads, using SCHED_DEADLINE + sched_setattr(2) • Finds runtime set that satisfies the deadline constraint using Dirichlet-distributed random values (∑𝑟𝑖 = 1000) How Earthquake sets scheduling attributes? thread1 300us thread2 200us thread3 400us thr4 100us deadline: 1,000us
  • 33. 33Copyright© 2016 NTT Corp. All Rights Reserved. • Jepsen • specialized in network partition • specialized in testing linearizability (strong consistency) • Famous for "Call Me Maybe" blog: http://jepsen.io/ • Earthquake is much more generalized • The bugs we found/reproduced are basically beyond the scope of Jepsen • Earthquake can be also combined with Jepsen! It will be our next work.. Similar great tool: Jepsen causes network partition tests linearizablity increases non-determinism (especially of thread scheduling) Jepsen Earthquake ...
  • 34. 34Copyright© 2016 NTT Corp. All Rights Reserved. LESSONS WE LEARNED
  • 35. 35Copyright© 2016 NTT Corp. All Rights Reserved. "Liveness": something good eventually happens • c.f. "safety": nothing bad happens (typical assertions are safety) Problem: Flaky test expects something good happens in a certain timeout, which depends on the performance of the machine Solution: make sure any test succeeds on slow machines (of course performance testing is important, but it should be separated) Performance should be split from liveness invokeAsyncProcess(); sleep(certainTimeout); // some tests lack even this sleep for async proc assertTrue(checkSomethingGoodHasHappened()); invokeAsyncProcess(); bool b; for (int i = 0; i < MANY_RETRIES; i++) { sleep(certainTimeout); b = checkSomethingGoodHasHappened(); // idempotent if (b) break; } assertTrue(b);
  • 36. 36Copyright© 2016 NTT Corp. All Rights Reserved. • Of course it is definitely better than nothing • Problem: But sometimes it's useless for distributed systems, because it's too hard for humans to inspect logs from multiple nodes • Unclear ordering • Unclear structure (needs an alternative to log4j!) • enormous non-interesting logs • Solution: So we made a tool that picks up some branches that only occur in failed experiments for analyzing the root cause of the bug LOG.debug() is sometimes useless for debugging Analyzer Scanning 00000000: experiment successful=true Scanning 00000001: experiment successful=false .. Suspicious: QuorumCnxManager::receiveConnection() line 382-382 Suspicious: QuorumCnxManager$Listener::run() line 657-657 Suspicious: QuorumCnxManager$SendWorker::run() line 809-811
  • 37. 37Copyright© 2016 NTT Corp. All Rights Reserved. Each of the Hadoop components has plenty of JUnit test codes But we also need integration tests with multiple components! • e.g. ZOOKEEPER-2347: critical deadlock which leaded to withdrawal of ZooKeeper release v3.4.7 • Found (not by us nor with Earthquake) in an integration test with ZooKeeper and HBase Apache BigTop is an interesting project about this! • But.. how should we prepare realistic workloads systematically? It's an open question.. Need for integration test framework
  • 38. 38Copyright© 2016 NTT Corp. All Rights Reserved. CONCLUSION
  • 39. 39Copyright© 2016 NTT Corp. All Rights Reserved. • Distributed systems are difficult to test and debug, due to non-determinism • Earthquake controls non-determinism so that you can find and reproduce such "flaky" bugs • Please try and give us your feedback: https://github.com/osrg/earthquake/ Conclusion
  • 40. 40Copyright© 2016 NTT Corp. All Rights Reserved. How to use Earthquake? For Thread scheduling $ git clone https://github.com/AkihiroSuda/microearthquake $ sudo microearthquake pid $PID For Ethernet scheduling $ go get github.com/osrg/earthquake/earthquake-container $ sudo earthquake-container run -i -t ubuntu $COMMAND Will be unified to `earthquake-container` in February Docker-like CLI (For non-Dockerized version, please refer to docs)
  • 41. 41Copyright© 2016 NTT Corp. All Rights Reserved. ADDENDUM
  • 42. 42Copyright© 2016 NTT Corp. All Rights Reserved. Terminal 1 Quick start for reproducing flaky bugs in Hadoop [1/3] $ git clone https://github.com/apache/hadoop.git $ cd hadoop $ ./start-build-env.sh build-env$ (switch to Terminal 2) • `start-build-env.sh` starts the build environment in a Docker container. Suppose the started container name is $HADOOP_BUILD_ENV. • If you have a package dependency issue, you may have to run `mvn package –Pdist –DskipTests -Dtar` in the container .
  • 43. 43Copyright© 2016 NTT Corp. All Rights Reserved. Terminal 2 $ git clone https://github.com/AkihiroSuda/microearthquake $ cd microearthquake $ git checkout tag20160203 $ sudo apt-get install –y python3-{pip,dev} lib{cap,ffi}-dev $ sudo pip3 install cffi colorama numpy python-prctl psutil $ (cd microearthquake; python3 native_build.py) $ docker inspect --format '{{.State.Pid}}' $HADOOP_BUILD_ENV 4242 $ sudo ./bin/microearthquake pid 4242 (switch to Terminal 1) Quick start for reproducing flaky bugs in Hadoop [2/3] • `microearthauke` fuzzes the thread scheduling for all the threads under the process tree (pid=4242). • `microearthquake` will be rewritten in Go, and unified to `earthquake-container`. (planned in February)
  • 44. 44Copyright© 2016 NTT Corp. All Rights Reserved. Terminal 1 build-env$ while mvn test –Dtest=TestFooBar; do true; done (the failure will be eventually reproduced) • Suppose TestFooBar.java is failing intermittently on Jenkins. • You can find failing tests at https://hadoop.apache.org/issue_tracking.html Quick start for reproducing flaky bugs in Hadoop [3/3]
  • 45. 45Copyright© 2016 NTT Corp. All Rights Reserved. Q. How can I write a test? A. Basically you don't have to write any new test, as you can use existing xUnit tests. If you want to test a real cluster rather than an xUnit pseudo cluster, please refer to examples: https://github.com/osrg/earthquake/tree/master/example . Q. Can I test standalone multi-threaded programs? A. Yes, by fuzzing thread scheduling. Q. Do I need to write ExplorePolicy manually? It's bothersome. A. No, basically the default random behavior is enough. FAQs Please feel free to open an issue in GitHub: https://github.com/osrg/earthquake/issues