This document discusses optimizing Dell PowerEdge server configurations for Hadoop deployments. It recommends tested server configurations like the PowerEdge R720 and R720XD that provide balanced compute and storage. It also recommends a reference architecture using these servers along with networking best practices and validated software configurations from Cloudera to provide a proven, optimized big data platform.
2. Big Data is when the data
itself is part of the problem.
Volume
• A large amount of data, growing at
large rates
Velocity
• The speed at which the data must be
processed
Variety
• The range of data types and data
structure
What is Big Data ?
4. • A Proven Big Data Platform
– Cloudera CDH4 Hadoop Distribution with Cloudera Manager
– Validated and Supported Reference Architecture
– Production deployments across all verticals
• Dell Crowbar provides deployment and management at scale
– Integrated with Cloudera Manager
– Bare metal to deployed cluster in hours
– Lifecycle management for ongoing operations
• Dell Partner Ecosystem
– Pentaho for Data Integration
– Pentaho for Reporting and Visualization
– Datameer for Spreadsheet style analytics and visualization
– Clarity and Dell Implementation Services
Dell | Cloudera Apache Hadoop Solution
4
5. • Customers want results
– Performance
– Predictability
– Reliability
– Availability
– Management
– Monitoring
• Customers want value
• Big Data has many options
– Servers
– Networking
– Software
– Tools
– Application Code
– Fast Evolution
• Wide range of applications
The Problem with Big Data Projects
5
6. • Tested Server Configurations
• Tested Network Configurations
• Base Software Configuration
– Big Data Software
– OS Infrastructure
– Operational Infrastructure
• Predefined configuration
– Recommended starting point
• Patterns, Use Cases, and Best
Practices are emerging in Big Data
• Reference Architectures help
package this knowledge for reuse
A Reference Architecture Fills The Gap
6
7. • PowerEdge R720, R720XD
– Balanced Compute and Storage
• PowerEdge C6105
– Scale Out Computing
– Large Disk Capacity
• PowerEdge C8000
– Scale Out Computing
– Flexible Configuration
7
Reference Architecture : Servers
8. 1GbE 10GbE
Top of Rack
Force 10 S60 Force 10 S4810
Cluster
Aggregation
Force 10 S4810 Force 10 S4810
Bonded
Connections
Redundant
Networking
Reference Architecture: Networking
8
13. Learning The Reference Architecture
• Read It !
– Read it again
– Keep it under your pillow
• Three Documents
– Reference Architecture
– Deployment Guide
– Users Guide
• Deploy it
– Works on 4 or 5 nodes
• Available through the Dell Sales Team
13
14. Leveraging the Reference Architecture
• Start with the base configuration
– It works, and eliminates mix and match problems
– There are a lot of subtle details hidden behind the configurations
• Easy changes: processor, memory, disk
– Will generally not break anything
– Will affect performance, however
• Harder changes: Hadoop configuration
– Mainly, need to know what you're doing here
– We have experience and recommendations
•
Hardest Changes: Optimization for workloads
– The default configuration is a general purpose one
– Specific workloads must be tested and benchmarked
14
15. • Assume 1.5 Hadoop Tasks per physical core
– Turn Hyperthreading on
– This allows headroom for other processes
• Configure Hadoop Task slots
– 2/3 map tasks
– 1/3 reduce tasks
• Dual Socket 6 core Xeon example
› mapred.tasktracker.map.tasks.maximum: 12
› mapred.task.tracker.reduce.tasks.maximum: 6
• Faster is better
– Hadoop compression uses processor cycles
– Most Hadoop jobs are I/O bound, not processor bound
– The Map / Reduce balance depends on actual workload
– It’s hard to optimize more without knowing the actual workload
Selecting Processors
15
16. • Hadoop scales processing and storage together
– The cluster grows by adding more data nodes
– The ratio of processor to storage is the main adjustment
• Generally, aim for a 1 spindle / 1 core ratio
– I/O is large blocks (64Mb to 256Mb)
– Primarily sequential read/write, very little random I/O
– 8 tasks will be reading or writing 8 individual spindles
• Drive Sizes and Types
– NL SAS or Enterprise SATA 6 Gb/sec
– Drive size is mainly a price decision
• Depth per node
– Up to 48 TB/node is common
– 112 Tb / node is possible
– Consider how much data is ‘active’
– Very deep storage impacts recovery performance
Spindle / Core / Storage Depth Optimization
16
18. • Workload optimization requires profiling and benchmarking
• HBase versus pure Map/Reduce are different
– I/O patterns are different
– Hbase requires more memory
– Cloudera RTQ (Impala) is I/O Intensive
• Map Reduce usage varies
– I/O intensive to CPU intensive
• Ingestion and Transfer impact the edge (gateway) nodes
• Heterogenous Cluster versus dedicated Clusters ?
– Cloudera have added support for heterogenous clusters and nodes
– Dedicated cluster makes sense if workload is consistent
› Primarily for ‘data’ businesses
Workload Optimization :
Hadoop has widely varying workloads
18
19. Reference Architecture Options
• High Availability
– Networking configuration
– Master / Secondary Name Node configuration
• Alternative Switches
– It’s possible
– Contact us for advice
• Cluster Size
– The Reference Architecture Scales Easily to Around 720 Nodes
– Beyond that, a network engineer needs to take a closer look
• Node Size
– Memory recommendations are a starting point
– Disk / Core balance is a never ending debate
19
20. Model Data Node Configuration Comments RA
R720Xd Dual socket, 12 cores,
24 x 2.5” spindles
Most popular platform for
Hadoop
C8000 Dual socket, 16 cores,
16 x 3.5” spindles
Popular for deep/dense Hadoop
applications
C6100 /
C6105
Dual socket, 8/12 cores,
12 x 3.5” spindles
Two node version. C6100 is
hardware EOL
C2100 Dual Socket, 12 cores,
12 x 3.5” spindles
Popular, hardware EOL but often
repurposed for Hadoop
R620 Dual Socket, 8 cores,
10 x 2.5” spindles
1U form factor
C6220 Dual-socket, 8 cores,
6 x 2.5” spindles
Core/spindle ratio is not ideal for
Hadoop.
In the Wild – Dell Customer Hadoop Configurations
20
21. SecureWorks : Based on R720xd Reference Architecture
SecureWorks
24 hours a day, 365 days a year, helping protect
the security of its customers’ assets in real time
Challenge
Collecting, processing, and analyzing massive
amounts of data from customer environments
Results
• Reduced cost of data storage to ~21 cents
per gigabyte
• 80% savings over previous proprietary
solution
• 6 months faster deployment
• < 1 yr. payback on entire investment
• Data doubles every 18 months, magnifying
savings
22. Further Information
• Dell Hadoop Home Page
– http://www.dell.com/hadoop
• Dell Cloudera Apache Hadoop install with Crowbar (video)
– http://www.youtube.com/watch?v=ZWPJv_OsjEk
• Cloudera CDH4 Documentation
– http://ccp.cloudera.com/display/CDH4DOC/CDH4+Documentation
• Crowbar homepage and documentation on GitHub
– http://github.com/dellcloudedge/crowbar/wiki
• Open Source Crowbar Installers
– http://crowbar.zehicle.com/
22