SlideShare une entreprise Scribd logo
1  sur  26
High-performance Hadoop Clusters
with Joyent’s SmartOS
NoSQL Now!
Aug 21, 2013

Ben Wen, Joyent
Renat Khasanshyn, Altoros
About Altoros
 Cloud Foundry PaaS Consulting & Integration
 Hadoop/NoSQL performance engineering
 Cluster Automation & Server Templates on Joyent, AWS, SoftLayer, Rackspace,
CloudStack and OpenStack using Chef/Puppet, RightScale
 200+ employees globally (US, Eastern Europe, Argentina, UK, Denmark,
Switzerland, Norway)
 Vertical application experience:
• Automated device analytics
• Advertising analytics
• Big data warehouse

Customers

Partners
About Joyent

 The high-performance public cloud
infrastructure provider
 Cloud IaaS Virtual Machines:
 Linux, Windows, BSD, SmartOS
(fka Solaris) with Zones
 Core founding sponsors of Node.js
 Four global datacenters
 Key markets:
 Big data, mobile, e-commerce,
finsvc, SaaS
 Open Source contributions:
 Node.js, KVM, DTrace, ZFS,
SmartOS
The Problem


Running bare-metal only practical for some organizations



Performance varies significantly across various job types



In fact, for many jobs, less = more



Utilization of most clusters in production is low



Optimizing Hadoop/MapReduce performance is hard

4
Hadoop Vendors



Get upset when truth comes out!



Biased (to the shiny side of the coin)



Often add controversy and confusion

5
Goals of the Study
- For Hadoop, what is the impact of Container-based virtualization vs Hardware
emulation (KVM)*
- What are the Hadoop optimization strategies? Is there a “rule of thumb” when it

comes to determining the optimization approach?
- What are the optimal Hadoop cluster settings for 1TB TeraSort benchmark on
100 and 400 node clusters running Linux and SmartOS on the Joyent Public
Cloud?

6
Factors Influencing Performance

Physical (disks, cpu, network)
OS/Hypervisor (especially for virtualized environments)
Hadoop/MapReduce (tons of settings)
Algorithmic (data structures, join strategies, big-O…)
Implementation (code efficiency, architecture decisions that fit all other factors)

7
Benchmarking tool set:

operating system based on the Debian
Linux distribution and distributed as free
and open source software.

Open source Unix operating system based on the active
fork of Open Solaris technology (illumos) for the cloud.
Uses containerized OS virtualization, called Zones (think a
mature LXC with secure RBAC and auditing)

Apache Hadoop is an open-source software framework that
supports data-intensive distributed applications, licensed
under the Apache v2 license. Derived from Google's
MapReduce and Google File System (GFS) papers, Hadoop
enables applications to work with thousands of computationindependent computers and petabytes of data.
8
Benchmarking tool set:

Written by Opscode and released as open source under the
Apache License 2.0., Chef is a DevOps tool used for configuring
cloud services or to streamline the task of configuring a
company's internal servers. Chef automatically sets up and
tweaks the operating systems and programs that run in massive
data centers.

Developed by creators of the Starfish project from Duke
University, Unravel brings run-time profiling of Hadoop jobs
followed by a cost-based database query optimization. Unravel
connects to streams of Hadoop and system instrumentation
data, and applies statistical machine learning to optimize cost of
Hadoop jobs and increase cluster utilization.

9
Comparing I/O Path on
Bare Metal Unix Vs Zones Vs KVM

Bare-metal

Kernel Virtualization

OS Virtualization

•

•

1
0

•

Zones partition at the OS
level

KVM is encapsulated
by hypervisor

•

Code path is much
more circuitous in a
KVM process.

•

•

Code path is essentially
the same as bare metal

Performance is
impacted

Performance is higher
Bare Metal

Joyent Zone (aka SmartMachine)

Start
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

Start

Skips stepping
through
39 functions
required
when Fedora
is running on
KVM/qemu

1
1

mac`mac_tx+0xda
dld`str_mdata_fastpath_put+0x53
ip`ip_xmit+0x82d
ip`ire_send_wire_v4+0x3e9
ip`conn_ip_output+0x190
ip`tcp_send_data+0x59
ip`tcp_output+0x58c
ip`squeue_enter+0x426
ip`tcp_sendmsg+0x14f
sockfs`so_sendmsg+0x26b
sockfs`socket_sendmsg+0x48
sockfs`socket_vop_write+0x6c
genunix`fop_write+0x8b
genunix`write+0x250
genunix`write32+0x1e
unix`_sys_sysenter_post_swapgs+0x14

Fedora VM on KVM VM
Start

Note that
a Joyent Zone
is exactly the
same as “Bare
Metal”

mac`mac_tx+0xda
dld`str_mdata_fastpath_put+0x53
ip`ip_xmit+0x82d
ip`ire_send_wire_v4+0x3e9
ip`conn_ip_output+0x190
ip`tcp_send_data+0x59
ip`tcp_output+0x58c
ip`squeue_enter+0x426
ip`tcp_sendmsg+0x14f
sockfs`so_sendmsg+0x26b
sockfs`socket_sendmsg+0x48
sockfs`socket_vop_write+0x6c
genunix`fop_write+0x8b
genunix`write+0x250
genunix`write32+0x1e
unix`_sys_sysenter_post_swapgs+0x14

kernel`start_xmit
kernel`dtrace_int3_handler+0xd2
kernel`kmem_cache_free+0x2f
kernel`dtrace_int3+0x3a
kernel`eth_header
kernel`__kfree_skb+0x47
kernel`start_xmit+0x1
kernel`dev_hard_start_xmit+0x322
kernel`sch_direct_xmit+0xef
kernel`dev_queue_xmit+0x184
kernel`eth_header+0x3a
kernel`neigh_resolve_output+0x11e
kernel`nf_hook_slow+0x75
kernel`ip_finish_output
kernel`ip_finish_output+0x17e
kernel`ip_output+0x98
kernel`__ip_local_out+0xa4
kernel`ip_local_out+0x29
kernel`ip_queue_xmit+0x14f
kernel`tcp_transmit_skb+0x3e4
kernel`__kmalloc_node_track_caller+0x185
kernel`sk_stream_alloc_skb+0x41
kernel`tcp_write_xmit+0xf7
kernel`__alloc_skb+0x8c
kernel`__tcp_push_pending_frames+0x26
kernel`tcp_sendmsg+0x895
kernel`inet_sendmsg+0x64
kernel`sock_aio_write+0x13a
kernel`do_sync_write+0xd2
kernel`security_file_permission+0x2c
kernel`rw_verify_area+0x61
kernel`vfs_write+0x16d
kernel`sys_write+0x4a
kernel`sys_rt_sigprocmask+0x84
kernel`system_call_fastpath+0x16
igb`igb_tx_ring_send+0x33
mac`mac_hwring_tx+0x1d
mac`mac_tx_send+0x5dc
mac`mac_tx_single_ring_mode+0x6e
mac`mac_tx+0xda
dld`str_mdata_fastpath_put+0x53
ip`ip_xmit+0x82d
ip`ire_send_wire_v4+0x3e9
ip`conn_ip_output+0x190
ip`tcp_send_data+0x59
ip`tcp_output+0x58c
ip`squeue_enter+0x426
ip`tcp_sendmsg+0x14f
sockfs`so_sendmsg+0x26b
sockfs`socket_sendmsg+0x48
sockfs`socket_vop_write+0x6c
genunix`fop_write+0x8b
genunix`write+0x250
genunix`write32+0x1e
unix`_sys_sysenter_post_swapgs+0x149

No over
head for
Zones:
Stack traces
show how a
network
packet is
transmitted
from:

Bare Metal
vs
Joyent Zone
vs
Fedora VM
on KVM
Benchmarking setup:
Three identical Apache Hadoop 1.0.4 clusters were provisioned on Joyent
infrastructure using Joyent REST API and Opscode Chef
Each cluster was tweaked for optimal performance following best practices for
TeraSort benchmark.
Benchmarking:
1) Cluster of 100 virtual machines
Script launches virtual machines and stores information about them in a json file.

13
Benchmarking:
2) We used Chef to install and configure Hadoop
Each machine in cluster is being configured according to its role in cluster using
Chef cookbooks.

14
Benchmarking:
3) We ran the Teragen program generate 1TB of data
As part of TeraSort benchmark a dataset is generated using TeraGen utility
included in Apache Hadoop.

15
Benchmarking:
4) We ran the Terasort benchmark
On one of the nodes a Hadoop TeraSort job using previously generated dataset is
submitted.

16
Benchmarking:
6) The Hadoop output file was as following
See: Hadoop job_201210261134_0010 on hadoop-smartos-r-1.html
The key difference between the two clusters was unveiled when monitoring I/O and
CPU utilization. Ubuntu cluster was spending too much time in OS kernel while

performing I/O operations as demonstrated on Figure 1.

17
Hadoop Cluster Specifications for Linux and
SmartOS
SmartOS cluster was using CPU much more efficiently and was able to utilize larger
number of Hadoop mappers and reducers, key configuration parameters for Hadoop:
Operating System Base
Memory
Image

CPUs
Nodes (Virtual Instances)
Input Size
Run time (seconds, lower is better)
Mappers
Reducers
io.sort.mb
io.sort.factor
dfs.block.size (mb)
mapred.reduce.child.java.opts
mapred.job.shuffle.input.buffer.percent
mapred.reduce.slowstart.completed.maps

Linux
32 GB
sdc:jpc:ubuntu12.04:2.0.2
4 VCPUs
98
1T
819
6
3
610
300
512
-Xmx=2700m
1
0.5

SmartOS
32 GB
sdc:sdc:base64:1.8.1

4 VCPUs
100
1T
360
10
8
610
300
512
-Xmx=2500m
1
0.5
Measure: CPU, Memory, Disk, Network (Ganglia, Cacti)
CPU metrics example
A Samlpe Hive Query Plan

20
Measure/Optimize MapReduce jobs using 3rd
party tools

21
Measure/Optimize MapReduce jobs

22
OS/hypervisor choice matters – more benchmarks coming?

The key difference
between the clusters was
unveiled when monitoring
I/O and CPU utilization.
Ubuntu cluster was
spending too much time in
OS kernel while performing
I/O (for copies of config
files and job reports –
email
renat.k@altoros.com)
Simple ways to increase Hadoop
performance
1) Basic cluster configuration is key (one time effort for typical workloads)
DATA DISK SCALING
COMPRESSION
JVM REUSE POLICY
HDFS BLOCK SIZE
MAP-SIDE SPILLS
COPY/SHUFFLE PHASE TUNING
REDUCE-SIDE SPILLS

2) Tune the number of map and reduce tasks appropriately
3) Consider GPU for some workloads

24
Joyent’s Brendan Gregg’s performance book
Systems Performance
• Forthcoming in October
• Includes cloud performance
• Co-author DTrace book
• More here on his techniques:
• http://dtrace.org/blogs/brendan/

25
Thank you!
Ben Wen: benwen@joyent.com

Renat Khasanshyn: renat.k@altoros.com

@renatco (650) 395-7002

26

Contenu connexe

En vedette

Experiences porting KVM to SmartOS
Experiences porting KVM to SmartOSExperiences porting KVM to SmartOS
Experiences porting KVM to SmartOSbcantrill
 
Integrating PostgreSql with RabbitMQ
Integrating PostgreSql with RabbitMQIntegrating PostgreSql with RabbitMQ
Integrating PostgreSql with RabbitMQGavin Roy
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaBrendan Gregg
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 
Steve Jobs Inspirational Quotes
Steve Jobs Inspirational QuotesSteve Jobs Inspirational Quotes
Steve Jobs Inspirational QuotesInsideView
 

En vedette (7)

OpenStack on SmartOS
OpenStack on SmartOSOpenStack on SmartOS
OpenStack on SmartOS
 
Experiences porting KVM to SmartOS
Experiences porting KVM to SmartOSExperiences porting KVM to SmartOS
Experiences porting KVM to SmartOS
 
Integrating PostgreSql with RabbitMQ
Integrating PostgreSql with RabbitMQIntegrating PostgreSql with RabbitMQ
Integrating PostgreSql with RabbitMQ
 
SmartOS Primer
SmartOS PrimerSmartOS Primer
SmartOS Primer
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versa
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Steve Jobs Inspirational Quotes
Steve Jobs Inspirational QuotesSteve Jobs Inspirational Quotes
Steve Jobs Inspirational Quotes
 

Dernier

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Dernier (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

High-performance Hadoop Clusters with Joyent’s SmartOS (10, 100 & 200 nodes)

  • 1. High-performance Hadoop Clusters with Joyent’s SmartOS NoSQL Now! Aug 21, 2013 Ben Wen, Joyent Renat Khasanshyn, Altoros
  • 2. About Altoros  Cloud Foundry PaaS Consulting & Integration  Hadoop/NoSQL performance engineering  Cluster Automation & Server Templates on Joyent, AWS, SoftLayer, Rackspace, CloudStack and OpenStack using Chef/Puppet, RightScale  200+ employees globally (US, Eastern Europe, Argentina, UK, Denmark, Switzerland, Norway)  Vertical application experience: • Automated device analytics • Advertising analytics • Big data warehouse Customers Partners
  • 3. About Joyent  The high-performance public cloud infrastructure provider  Cloud IaaS Virtual Machines:  Linux, Windows, BSD, SmartOS (fka Solaris) with Zones  Core founding sponsors of Node.js  Four global datacenters  Key markets:  Big data, mobile, e-commerce, finsvc, SaaS  Open Source contributions:  Node.js, KVM, DTrace, ZFS, SmartOS
  • 4. The Problem  Running bare-metal only practical for some organizations  Performance varies significantly across various job types  In fact, for many jobs, less = more  Utilization of most clusters in production is low  Optimizing Hadoop/MapReduce performance is hard 4
  • 5. Hadoop Vendors  Get upset when truth comes out!  Biased (to the shiny side of the coin)  Often add controversy and confusion 5
  • 6. Goals of the Study - For Hadoop, what is the impact of Container-based virtualization vs Hardware emulation (KVM)* - What are the Hadoop optimization strategies? Is there a “rule of thumb” when it comes to determining the optimization approach? - What are the optimal Hadoop cluster settings for 1TB TeraSort benchmark on 100 and 400 node clusters running Linux and SmartOS on the Joyent Public Cloud? 6
  • 7. Factors Influencing Performance Physical (disks, cpu, network) OS/Hypervisor (especially for virtualized environments) Hadoop/MapReduce (tons of settings) Algorithmic (data structures, join strategies, big-O…) Implementation (code efficiency, architecture decisions that fit all other factors) 7
  • 8. Benchmarking tool set: operating system based on the Debian Linux distribution and distributed as free and open source software. Open source Unix operating system based on the active fork of Open Solaris technology (illumos) for the cloud. Uses containerized OS virtualization, called Zones (think a mature LXC with secure RBAC and auditing) Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. Derived from Google's MapReduce and Google File System (GFS) papers, Hadoop enables applications to work with thousands of computationindependent computers and petabytes of data. 8
  • 9. Benchmarking tool set: Written by Opscode and released as open source under the Apache License 2.0., Chef is a DevOps tool used for configuring cloud services or to streamline the task of configuring a company's internal servers. Chef automatically sets up and tweaks the operating systems and programs that run in massive data centers. Developed by creators of the Starfish project from Duke University, Unravel brings run-time profiling of Hadoop jobs followed by a cost-based database query optimization. Unravel connects to streams of Hadoop and system instrumentation data, and applies statistical machine learning to optimize cost of Hadoop jobs and increase cluster utilization. 9
  • 10. Comparing I/O Path on Bare Metal Unix Vs Zones Vs KVM Bare-metal Kernel Virtualization OS Virtualization • • 1 0 • Zones partition at the OS level KVM is encapsulated by hypervisor • Code path is much more circuitous in a KVM process. • • Code path is essentially the same as bare metal Performance is impacted Performance is higher
  • 11. Bare Metal Joyent Zone (aka SmartMachine) Start 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 Start Skips stepping through 39 functions required when Fedora is running on KVM/qemu 1 1 mac`mac_tx+0xda dld`str_mdata_fastpath_put+0x53 ip`ip_xmit+0x82d ip`ire_send_wire_v4+0x3e9 ip`conn_ip_output+0x190 ip`tcp_send_data+0x59 ip`tcp_output+0x58c ip`squeue_enter+0x426 ip`tcp_sendmsg+0x14f sockfs`so_sendmsg+0x26b sockfs`socket_sendmsg+0x48 sockfs`socket_vop_write+0x6c genunix`fop_write+0x8b genunix`write+0x250 genunix`write32+0x1e unix`_sys_sysenter_post_swapgs+0x14 Fedora VM on KVM VM Start Note that a Joyent Zone is exactly the same as “Bare Metal” mac`mac_tx+0xda dld`str_mdata_fastpath_put+0x53 ip`ip_xmit+0x82d ip`ire_send_wire_v4+0x3e9 ip`conn_ip_output+0x190 ip`tcp_send_data+0x59 ip`tcp_output+0x58c ip`squeue_enter+0x426 ip`tcp_sendmsg+0x14f sockfs`so_sendmsg+0x26b sockfs`socket_sendmsg+0x48 sockfs`socket_vop_write+0x6c genunix`fop_write+0x8b genunix`write+0x250 genunix`write32+0x1e unix`_sys_sysenter_post_swapgs+0x14 kernel`start_xmit kernel`dtrace_int3_handler+0xd2 kernel`kmem_cache_free+0x2f kernel`dtrace_int3+0x3a kernel`eth_header kernel`__kfree_skb+0x47 kernel`start_xmit+0x1 kernel`dev_hard_start_xmit+0x322 kernel`sch_direct_xmit+0xef kernel`dev_queue_xmit+0x184 kernel`eth_header+0x3a kernel`neigh_resolve_output+0x11e kernel`nf_hook_slow+0x75 kernel`ip_finish_output kernel`ip_finish_output+0x17e kernel`ip_output+0x98 kernel`__ip_local_out+0xa4 kernel`ip_local_out+0x29 kernel`ip_queue_xmit+0x14f kernel`tcp_transmit_skb+0x3e4 kernel`__kmalloc_node_track_caller+0x185 kernel`sk_stream_alloc_skb+0x41 kernel`tcp_write_xmit+0xf7 kernel`__alloc_skb+0x8c kernel`__tcp_push_pending_frames+0x26 kernel`tcp_sendmsg+0x895 kernel`inet_sendmsg+0x64 kernel`sock_aio_write+0x13a kernel`do_sync_write+0xd2 kernel`security_file_permission+0x2c kernel`rw_verify_area+0x61 kernel`vfs_write+0x16d kernel`sys_write+0x4a kernel`sys_rt_sigprocmask+0x84 kernel`system_call_fastpath+0x16 igb`igb_tx_ring_send+0x33 mac`mac_hwring_tx+0x1d mac`mac_tx_send+0x5dc mac`mac_tx_single_ring_mode+0x6e mac`mac_tx+0xda dld`str_mdata_fastpath_put+0x53 ip`ip_xmit+0x82d ip`ire_send_wire_v4+0x3e9 ip`conn_ip_output+0x190 ip`tcp_send_data+0x59 ip`tcp_output+0x58c ip`squeue_enter+0x426 ip`tcp_sendmsg+0x14f sockfs`so_sendmsg+0x26b sockfs`socket_sendmsg+0x48 sockfs`socket_vop_write+0x6c genunix`fop_write+0x8b genunix`write+0x250 genunix`write32+0x1e unix`_sys_sysenter_post_swapgs+0x149 No over head for Zones: Stack traces show how a network packet is transmitted from: Bare Metal vs Joyent Zone vs Fedora VM on KVM
  • 12. Benchmarking setup: Three identical Apache Hadoop 1.0.4 clusters were provisioned on Joyent infrastructure using Joyent REST API and Opscode Chef Each cluster was tweaked for optimal performance following best practices for TeraSort benchmark.
  • 13. Benchmarking: 1) Cluster of 100 virtual machines Script launches virtual machines and stores information about them in a json file. 13
  • 14. Benchmarking: 2) We used Chef to install and configure Hadoop Each machine in cluster is being configured according to its role in cluster using Chef cookbooks. 14
  • 15. Benchmarking: 3) We ran the Teragen program generate 1TB of data As part of TeraSort benchmark a dataset is generated using TeraGen utility included in Apache Hadoop. 15
  • 16. Benchmarking: 4) We ran the Terasort benchmark On one of the nodes a Hadoop TeraSort job using previously generated dataset is submitted. 16
  • 17. Benchmarking: 6) The Hadoop output file was as following See: Hadoop job_201210261134_0010 on hadoop-smartos-r-1.html The key difference between the two clusters was unveiled when monitoring I/O and CPU utilization. Ubuntu cluster was spending too much time in OS kernel while performing I/O operations as demonstrated on Figure 1. 17
  • 18. Hadoop Cluster Specifications for Linux and SmartOS SmartOS cluster was using CPU much more efficiently and was able to utilize larger number of Hadoop mappers and reducers, key configuration parameters for Hadoop: Operating System Base Memory Image CPUs Nodes (Virtual Instances) Input Size Run time (seconds, lower is better) Mappers Reducers io.sort.mb io.sort.factor dfs.block.size (mb) mapred.reduce.child.java.opts mapred.job.shuffle.input.buffer.percent mapred.reduce.slowstart.completed.maps Linux 32 GB sdc:jpc:ubuntu12.04:2.0.2 4 VCPUs 98 1T 819 6 3 610 300 512 -Xmx=2700m 1 0.5 SmartOS 32 GB sdc:sdc:base64:1.8.1 4 VCPUs 100 1T 360 10 8 610 300 512 -Xmx=2500m 1 0.5
  • 19. Measure: CPU, Memory, Disk, Network (Ganglia, Cacti) CPU metrics example
  • 20. A Samlpe Hive Query Plan 20
  • 21. Measure/Optimize MapReduce jobs using 3rd party tools 21
  • 23. OS/hypervisor choice matters – more benchmarks coming? The key difference between the clusters was unveiled when monitoring I/O and CPU utilization. Ubuntu cluster was spending too much time in OS kernel while performing I/O (for copies of config files and job reports – email renat.k@altoros.com)
  • 24. Simple ways to increase Hadoop performance 1) Basic cluster configuration is key (one time effort for typical workloads) DATA DISK SCALING COMPRESSION JVM REUSE POLICY HDFS BLOCK SIZE MAP-SIDE SPILLS COPY/SHUFFLE PHASE TUNING REDUCE-SIDE SPILLS 2) Tune the number of map and reduce tasks appropriately 3) Consider GPU for some workloads 24
  • 25. Joyent’s Brendan Gregg’s performance book Systems Performance • Forthcoming in October • Includes cloud performance • Co-author DTrace book • More here on his techniques: • http://dtrace.org/blogs/brendan/ 25
  • 26. Thank you! Ben Wen: benwen@joyent.com Renat Khasanshyn: renat.k@altoros.com @renatco (650) 395-7002 26