Optimizing PowerEdge Configurations for Hadoop

Optimizing PowerEdge
Configurations for
Hadoop
Michael Pittaro
Principal Architect, Big Data Solutions
Dell

Big Data is when the data
itself is part of the problem.
Volume
• A large amount of data, growing at
large rates
Velocity
• The speed at which the data must be
processed
Variety
• The range of data types and data
structure
What is Big Data ?

Dell | Cloudera Apache Hadoop Solution
3
Retail Telco Media WebFinance

• A Proven Big Data Platform
– Cloudera CDH4 Hadoop Distribution with Cloudera Manager
– Validated and Supported Reference Architecture
– Production deployments across all verticals
• Dell Crowbar provides deployment and management at scale
– Integrated with Cloudera Manager
– Bare metal to deployed cluster in hours
– Lifecycle management for ongoing operations
• Dell Partner Ecosystem
– Pentaho for Data Integration
– Pentaho for Reporting and Visualization
– Datameer for Spreadsheet style analytics and visualization
– Clarity and Dell Implementation Services
Dell | Cloudera Apache Hadoop Solution
4

• Customers want results
– Performance
– Predictability
– Reliability
– Availability
– Management
– Monitoring
• Customers want value
• Big Data has many options
– Servers
– Networking
– Software
– Tools
– Application Code
– Fast Evolution
• Wide range of applications
The Problem with Big Data Projects
5

• Tested Server Configurations
• Tested Network Configurations
• Base Software Configuration
– Big Data Software
– OS Infrastructure
– Operational Infrastructure
• Predefined configuration
– Recommended starting point
• Patterns, Use Cases, and Best
Practices are emerging in Big Data
• Reference Architectures help
package this knowledge for reuse
A Reference Architecture Fills The Gap
6

• PowerEdge R720, R720XD
– Balanced Compute and Storage
• PowerEdge C6105
– Scale Out Computing
– Large Disk Capacity
• PowerEdge C8000
– Scale Out Computing
– Flexible Configuration
7
Reference Architecture : Servers

1GbE 10GbE
Top of Rack
Force 10 S60 Force 10 S4810
Cluster
Aggregation
Force 10 S4810 Force 10 S4810
Bonded
Connections
Redundant
Networking
Reference Architecture: Networking
8

• Hadoop
– Cloudera CDH 4
– Cloudera Manager
– Hadoop Tools
• Infrastructure Management
– Nagios
– Ganglia
• Configuration Management
– Predefined parameters
– Role based configuration
9
Reference Architecture: Software
Hive
Pig
HBase
Sqoop
Oozie
Hue
Flume
Whirr
Zookeeper

Tying it all Together: Crowbar
10
Dell“Crowbar”
OpsManagement
Core Components &
Operating Systems
Big Data
Infrastructure & Dell
Extensions
Physical Resources
APIs, User Access, &
Ecosystem Partners
Crowbar
Deployer
Provisioner
Network RAID
BIOS IPMI
NTP
DNS Logging
HDFS HBase Hive
Nagios Ganglia
Pentaho
Cloudera
Cloudera PigForce10

11 Revolutionary Cloud SolutionsConfidential
Hadoop Node Architecture
Cloudera Manager
Hadoop Clients
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Job
Tracker
Job
Tracker
Crowbar
Nagios
Ganglia
Admin Node
Edge Node Data Node Data Node Data Node
Master Name Node Secondary Name Node
Standby
Name
Node
Journal
Node
Journal
Node
Standby
Name
Node
High Availability Node
Active
Name
Node
Journal
Node
Job
Tracker

12 Revolutionary Cloud SolutionsConfidential
Hadoop Cluster Scaling

Learning The Reference Architecture
• Read It !
– Read it again
– Keep it under your pillow
• Three Documents
– Reference Architecture
– Deployment Guide
– Users Guide
• Deploy it
– Works on 4 or 5 nodes
• Available through the Dell Sales Team
13

Leveraging the Reference Architecture
• Start with the base configuration
– It works, and eliminates mix and match problems
– There are a lot of subtle details hidden behind the configurations
• Easy changes: processor, memory, disk
– Will generally not break anything
– Will affect performance, however
• Harder changes: Hadoop configuration
– Mainly, need to know what you're doing here
– We have experience and recommendations
•
Hardest Changes: Optimization for workloads
– The default configuration is a general purpose one
– Specific workloads must be tested and benchmarked
14

• Assume 1.5 Hadoop Tasks per physical core
– Turn Hyperthreading on
– This allows headroom for other processes
• Configure Hadoop Task slots
– 2/3 map tasks
– 1/3 reduce tasks
• Dual Socket 6 core Xeon example
› mapred.tasktracker.map.tasks.maximum: 12
› mapred.task.tracker.reduce.tasks.maximum: 6
• Faster is better
– Hadoop compression uses processor cycles
– Most Hadoop jobs are I/O bound, not processor bound
– The Map / Reduce balance depends on actual workload
– It’s hard to optimize more without knowing the actual workload
Selecting Processors
15

• Hadoop scales processing and storage together
– The cluster grows by adding more data nodes
– The ratio of processor to storage is the main adjustment
• Generally, aim for a 1 spindle / 1 core ratio
– I/O is large blocks (64Mb to 256Mb)
– Primarily sequential read/write, very little random I/O
– 8 tasks will be reading or writing 8 individual spindles
• Drive Sizes and Types
– NL SAS or Enterprise SATA 6 Gb/sec
– Drive size is mainly a price decision
• Depth per node
– Up to 48 TB/node is common
– 112 Tb / node is possible
– Consider how much data is ‘active’
– Very deep storage impacts recovery performance
Spindle / Core / Storage Depth Optimization
16

PowerEdge C8000 Hadoop Scaling - 16 core Xeon
17
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
1
15
29
43
57
71
85
99
113
127
141
155
169
183
197
211
225
239
TbStorage
(1) 12 spindle 3Tb versus (3) 6 spindle 3Tb
Cores (1)
Storage (1)
IOPS (1)
Storage (3)
IOPS (3)

• Workload optimization requires profiling and benchmarking
• HBase versus pure Map/Reduce are different
– I/O patterns are different
– Hbase requires more memory
– Cloudera RTQ (Impala) is I/O Intensive
• Map Reduce usage varies
– I/O intensive to CPU intensive
• Ingestion and Transfer impact the edge (gateway) nodes
• Heterogenous Cluster versus dedicated Clusters ?
– Cloudera have added support for heterogenous clusters and nodes
– Dedicated cluster makes sense if workload is consistent
› Primarily for ‘data’ businesses
Workload Optimization :
Hadoop has widely varying workloads
18

Reference Architecture Options
• High Availability
– Networking configuration
– Master / Secondary Name Node configuration
• Alternative Switches
– It’s possible
– Contact us for advice
• Cluster Size
– The Reference Architecture Scales Easily to Around 720 Nodes
– Beyond that, a network engineer needs to take a closer look
• Node Size
– Memory recommendations are a starting point
– Disk / Core balance is a never ending debate
19

Model Data Node Configuration Comments RA
R720Xd Dual socket, 12 cores,
24 x 2.5” spindles
Most popular platform for
Hadoop
C8000 Dual socket, 16 cores,
Popular for deep/dense Hadoop
applications
C6100 /
C6105
Dual socket, 8/12 cores,
Two node version. C6100 is
hardware EOL
C2100 Dual Socket, 12 cores,
Popular, hardware EOL but often
repurposed for Hadoop
R620 Dual Socket, 8 cores,
1U form factor
C6220 Dual-socket, 8 cores,
6 x 2.5” spindles
Core/spindle ratio is not ideal for
Hadoop.
In the Wild – Dell Customer Hadoop Configurations
20

SecureWorks : Based on R720xd Reference Architecture
SecureWorks
24 hours a day, 365 days a year, helping protect
the security of its customers’ assets in real time
Challenge
Collecting, processing, and analyzing massive
amounts of data from customer environments
Results
• Reduced cost of data storage to ~21 cents
per gigabyte
• 80% savings over previous proprietary
solution
• 6 months faster deployment
• < 1 yr. payback on entire investment
• Data doubles every 18 months, magnifying
savings

Further Information
• Dell Hadoop Home Page
– http://www.dell.com/hadoop
• Dell Cloudera Apache Hadoop install with Crowbar (video)
– http://www.youtube.com/watch?v=ZWPJv_OsjEk
• Cloudera CDH4 Documentation
– http://ccp.cloudera.com/display/CDH4DOC/CDH4+Documentation
• Crowbar homepage and documentation on GitHub
– http://github.com/dellcloudedge/crowbar/wiki
• Open Source Crowbar Installers
– http://crowbar.zehicle.com/
22

25
Notices & Disclaimers
Copyright © 2013 by Dell, Inc.
No part of this document may be reproduced or transmitted in any form without the written permission from Dell, Inc.
This document could include technical inaccuracies or typographical errors. Dell may make improvements or changes in the product(s)
or program(s) described herein at any time without notice. Any statements regarding Dell’s future direction and intent are subject to
change or withdrawal without notice, and represent goals and objectives only.
References in this document to Dell products, programs, or services does not imply that Dell intends to make such products, programs
or services available in all countries in which Dell operates or does business. Any reference to an Dell Program Product in this
document is not intended to state or imply that only that program product may be used. Any functionality equivalent program, that
does not infringe Dell’s intellectual property rights, may be used.
The information provided in this document is distributed “AS IS” without any warranty, either expressed or implied. Dell EXPRESSLY
DISCLAIMS any warranties of merchantability, fitness for a particular purpose OR INFRINGEMENT. Dell shall have no responsibility to
update this information.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any Dell patents or
copyrights.
Dell, Inc.
300 Innovative Way
Nashua, NH 03063 USA

Optimizing PowerEdge Configurations for Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Optimizing PowerEdge Configurations for Hadoop

Similar to Optimizing PowerEdge Configurations for Hadoop (20)

Recently uploaded

Recently uploaded (20)

Optimizing PowerEdge Configurations for Hadoop