Contenu connexe Similaire à Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters (20) Plus de Emulex Corporation (20) Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters 2. Agenda
Digital Content, today and tomorrow
What is Big Data?
Information as an Asset
A Solution to the Problem
The Moving Bottleneck
Hadoop on 10GbE
Testing Configurations and Objectives
Testing Results
Comparison Analysis – The Tale of the Tape
Q&A
© 2011 Emulex Corporation 2
3. Digital Content – Big Data’s Singularity
A Decade of Digital Universe Growth: Storage in Exabytes
Sources of growth:
10000
– Consumer participation
8000
– Photo and video archiving
– eCommerce
6000 – Social media
– Social networking
4000 – Mobile applications
– Search engine indexing
2000 – Web logs
– Medical records
0 – Financial transactions
2005 2010 2015
– Scientific research
Source: IDC's Digital Universe Study, sponsored by EMC, June 2011
– Surveillance
© 2011 Emulex Corporation 3
4. What is Big Data?
Collections of data exceeding the capabilities of traditional
database management tools…
– with dynamic, incremental data created around the data preceding it
– scaling with advances in technology
– from a growing number of sources
Think Big Bang theory…
– but in the order of bytes
Spawning an entire ecosystem of
new technologies and services
– Powerful
– Dynamic
– Scalable
© 2011 Emulex Corporation 4
5. Tapping into Information as an Asset
Organizations actively analyze data rather than just store it
Increased Velocity Actionable Data
Larger Volume Competitive Differentiation
Greater Variety Unlocking Value
© 2011 Emulex Corporation 5
6. A Solution to the Problem – Hadoop
A powerful, fault-tolerant, self-healing open source
platform, allowing for the distributed computing on commodity
clusters
Scaling to thousands of compute nodes, and efficiently
managing petabytes of data
Leverages two key pieces of technology:
– Hadoop Distributed File System (HDFS)
– Hadoop MapReduce
Capable of being deployed alongside legacy
Enabling old and new data to be combined in powerful ways
Accessed by data intensive applications
© 2011 Emulex Corporation 6
8. Agenda
Digital Content, today and tomorrow
What is Big Data?
Information as an Asset
A Solution to the Problem
The Moving Bottleneck
Hadoop on 10GbE
Testing Configurations and Objectives
Testing Results
Comparison Analysis – The Tale of the Tape
Q&A
© 2011 Emulex Corporation 8
9. The Moving Bottleneck in Hadoop Clusters
Designed to run on 1GbE performance characteristics
– Ubiquity
– Availability
– Cost
Today’s commodity servers deliver astounding performance
gains over their predecessors
Multi-core multi-threaded processors, fast DDR, and expanded
memory space, faster and larger internal system drives have
moved the bottleneck to the legacy 1GbE network
Performance characteristics available on today’s servers:
– Processor (4 cores, 8 threads): 25.6GB/s max. memory bandwidth
– PCIe 3.0 bus: 8GT/s bit rate
– DDR4 memory modules: up to 3,200 MT/s
– Storage: SSDs capable of 6Gb/s; SATA drivers capable of 600MB/s
© 2011 Emulex Corporation 9
10. Hadoop Cluster Hardware – Then and Now
4 Processor Generations
DDR2 to DDR3 Transition
Higher Density Drives & SSDs
No Change – 1GbE
© 2011 Emulex Corporation 10
11. Hadoop on 10GbE
Network I/O performance must scale with the increase in…
– Processing power
– Memory capacity
– Storage performance
Network performance is essential to support larger and faster
systems
Migrating from a 1GbE to a 10GbE network, leveraging Emulex
OneConnect adapters resulted in a massive performance gain
© 2011 Emulex Corporation 11
12. Fine Tuning Hadoop
Hadoop workloads vary greatly
– No “one size fits all” approach
– 200+ cluster-wide and job-specific parameters that can be fine tuned
With the workload variety comes a disparity in the distribution
of resource demands, which can be classed as:
CPU Intensive I/O Intensive
– Machine learning – Indexing
– Complex data/text mining – Searching
– Natural language processing – Grouping
– Feature extraction – Decoding/decompressing
– Data importing/exporting
© 2011 Emulex Corporation 12
13. The Setup
Servers: Storage:
– HP ML350 G6 – SATA II 500GB 7200rpm Disk
• Dual, Quad core Xeon 2GHz Drives, 6 per node
• 16 GB DDR3 – HP Smart Array G6 RAID
• Broadcom 1GbE BCM5715 Controller (JBOD - No RAID
• Emulex OneConnect 10GbE configured)
OCe11102 Ethernet Adapter
Cluster Configuration:
OS and Software: – 15 servers with discrete roles
– Ubuntu 64 bit • 1 NameNode
– Hadoop (Cloudera Distribution) • 11 DataNodes
• 3 Clients
– 1GbE and 10GbE Switches
© 2011 Emulex Corporation 13
14. The Setup
NameNode
DataNode 11
Client 1
10Gb Switch
Client 2
1Gb Switch
DataNode 2
Client 3
DataNode 1
© 2011 Emulex Corporation 14
15. Test Objective
Measure HDFS throughput ingesting data into a Hadoop cluster
– Examining multiple client configurations
– Raising HDFS „put‟ operations per client
– Transferring a constant 5GB file
– Replication factor set to three
– Duplicated for 1GbE and 10GbE Networks
Clients 1 2 3
DataNodes 11 11 11
„Put‟ Operations 1, 2, 4, 6, 8 1, 2, 4, 6, 8 1, 2, 4, 6, 8
Total Operations 1, 2, 4, 6, 8 2, 4, 8, 12, 16 3, 6, 12, 18, 24
© 2011 Emulex Corporation 15
16. Test Results – Legacy 1GbE
Data Import – Single Client, Single „Put‟ Operation
1000 A single client, running a
800
single operation makes
maximal use of the network
600
MBps
400
HDFS efficiently transfers
200 data to DataNodes within the
cluster, averaging 108MBps
0
0 8 16 24 32 40 48 56 out of the client server
Time (sec)
1 Operation
© 2011 Emulex Corporation 16
17. Test Results – Legacy 1GbE
Data Import – Single Client, Multiple „Put‟ Operations
1000 When more than one ‘put’
800
operation runs on a
client, the 1GbE network
600 becomes the bottleneck
MBps
400
200 Increasing the number of
operations did not increase
0
0 8 16 24 32 40 48 56 client throughput – restricted
Time (sec) by the network connection
1 Operation 4 Operations 8 Operations
© 2011 Emulex Corporation 17
18. Test Results – Legacy 1GbE
Data Import – Multiple Clients, Multiple „Put‟ Operations
1000 Expected to observe
800
throughput scale with
additional clients
600
MBps
400
Combined In and Out traffic
200 averaged 225MBps
0
0 8 16 24 32 40 48 56
Time (sec)
1 Operation 4 Operations 8 Operations
© 2011 Emulex Corporation 18
19. Test Results – Legacy 1GbE
Data Import – Multiple Clients, Multiple „Put‟ Operations
1000 As network load increases
800
600
MBps
1GbE quickly reaches saturation
400
200
0
0 8 16 24 32 40 48 56
becomes the system bottleneck
Time (sec)
1 Operation 4 Operations 8 Operations
© 2011 Emulex Corporation 19
20. Test Results – Emulex OneConnect 10GbE
Data Import – Single Client, Single „Put‟ Operation
180 Immediate performance
160
improvement of 50%
140
compared to 1GbE network
120
MBps
100
80
60 Data transfer completed in
40 less than three quarters of
20
the time
0
0 8 16 24 32 40 48 56
Time (sec)
1GbE 10GbE
© 2011 Emulex Corporation 20
21. Test Results – Emulex OneConnect 10GbE
Data Import – Single Client, Multiple „Put‟ Operations
1000 Increased network load is
800
met with increased
throughput
600
MBps
400
Achieved transfer rates of
200 800MBps, nearly 8X the
observed throughput of the
0
0 8 16 24 32 40 48 56 64 72 80 1GbE configuration
Time (sec)
1 Operation 4 Operations 8 Operations
© 2011 Emulex Corporation 21
22. Test Results – Emulex OneConnect 10GbE
Data Import – Multiple Clients, Multiple „Put‟ Operations
1800 Throughput scales with
1600
additional clients being
1400
brought on-line
1200
MBps
1000
800
600 The 10GbE network does not
400 limit transfer rates as the
200
clients and their operations
0
0 25 50 75 100 125 150 increase
Time (sec)
1 Operation 4 Operations 8 Operations
© 2011 Emulex Corporation 22
23. Tale of the Tape – 1GbE vs 10GbE
Maximum Throughput Achieved
1800 Clients 3
1600
1400
DataNodes 11
1200
MBps
1000
„Put‟ Operations 6
800
600
400
Total Operations 18
200
0 Data Size 270GB
1 101 201 301 401
Time (sec)
1GbE Max MBps 250
1G 10G
10GbE Max MBps 1,674 (6.7X faster)
© 2011 Emulex Corporation 23
24. Tale of the Tape – 1GbE vs 10GbE
Average Throughput Achieved
1000 ~4X throughput enables more Clients 3
efficient real time analysis
800
DataNodes 11
600
MBps
„Put‟ Operations 6
400
Total Operations 18
200
0 Data Size 270GB
1 2 4 8 12 18
Number of 'put' operations
1GbE Avg MBps 216
1G 10G
10GbE Avg MBps 831 (3.85X faster)
© 2011 Emulex Corporation 24
25. Tale of the Tape – 1GbE vs 10GbE
Time to Completion (seconds)
600 Load times reduced by 75% Clients 3
500 improving batch analysis
DataNodes 11
400
Time (sec)
300 „Put‟ Operations 6
200
Total Operations 18
100
0 Data Size 270GB
1 2 4 8 12 18
Numer of 'put' operations
1GbE Completion 453
1G 10G
10GbE Completion 115 (3.94X faster)
© 2011 Emulex Corporation 25
26. Key Takeaways
Hadoop runs faster with 10G
– Up to 8 times faster in some scenarios
Fine tuning parameters is important for performance
– Improvements may not be possible without proper configuration
Future performance gains are possible
– Hadoop was designed for 1GbE, but small changes will enable the full
potential of 10GbE
Hadoop is better with Emulex OneConnect Ethernet Adapters
– “It just works” – right out of the box
– Leverage our expertise to configure your Hadoop installation for
maximum performance
© 2011 Emulex Corporation 26
28. Questions
Which 1GbE and 10GbE switches were included in our tests?
And would we see better performance with a switch that had
lower latency?
We used several different models of Cisco switches – each with
different latency attributes. We found that latency didn’t impact
throughput performance in a significant way. In one case, when
moving to a switch with double the latency performance, we
only witnessed roughly 1% increase in the throughput
performance. Within the construct of our tests, we did not find
that latency was critical to the performance results.
© 2011 Emulex Corporation 28
29. Questions
Did we find the network being the bottleneck prior to the disk
subsystem becoming the bottleneck?
Yes, and it comes through in our graphs. It’s important to note
that at the beginning of our tests, we encountered some disk
performance bottlenecks due to some configuration issues.
Proving that it is essential to understand the configuration
settings for your Hadoop cluster in order to tap the full potential
of your disks. With commodity disks, the standard performance
characteristics is 100MBps per disk, typical environments have
6 disks per node, totaling 600MBps in performance potential. In
some cases, you don’t need disk operations to actually happen
– data is moved from memory to memory, but in most
cases, data is moved from disk to disk on different machines.
In those cases, disk performance is important. However, in our
test cases, the disk performance was not a bottleneck.
© 2011 Emulex Corporation 29
30. Questions
How many 1GbE NICs were used? Were multiple 1GbE NICs
bridged together, or just a single 10GbE NIC?
Our configuration used a single 1GbE NIC with two ports.
Which is the typical commodity server configuration.
Theoretically, you can install multiple cards, and get better
performance, but it is a more difficult proposition, and would
cost more than a single 10GbE NIC, aside from the fact that
there likely would not be enough slots on the motherboard to
accommodate that many cards.
© 2011 Emulex Corporation 30
31. Questions
What is the maximum throughput of 10GbE?
10GbE maximum throughput is 1.25GB/s for single direction
data transfer. When aggregated with receiving data, 2.5GB/s is
the maximum. Hadoop is not designed to accommodate this
speed, yet. Hopefully, it will be there soon. It’s important to
mention that most 10GbE solutions today come with two
ports, which means that you can achieve up to 5GB/s
performance. Of course, in order to leverage that
performance, you have to have a disk sub-system that operates
close to that level. We observed that in cases where two 10GbE
ports were used, you have 12 high performance disks. Today, it
is not necessary because Hadoop does not use the network
efficiently, so even with 6 disks, you will see a significant
performance gain.
© 2011 Emulex Corporation 31
32. Questions
Do we have a list of the parameters that need to be tuned within
Hadoop in order to maximize the performance of our 10GbE
NICs?
The settings will vary depending on the environment. There
isn’t a one-size-fits-all approach. Some of these parameters
have been published in our white paper, and we will review that
paper to ensure that all of those parameters are addressed.
© 2011 Emulex Corporation 32
33. Questions
Are these results comparable to other 10GbE NICs or is this
something unique to the Emulex technology portfolio?
We included multiple cards from our competitors in this
research project. Emulex cards did offer performance
advantages over our competition – approximately 10%. The
important observation was that competitors cards were more
prone to failures – servers stopped responding, system reboots
needed, etc. Emulex cards were far more reliable across the
board, which we believe is more important than fractional
performance gains.
© 2011 Emulex Corporation 33
34. Questions
If the tests did not saturated the bandwidth of a 1GbE link, is the
cause of the performance increase with 10GbE attributable to
the “bursty” nature of the transfer itself?
Hadoop is not optimized for networking, which is why there are
some odd observations from time to time. There are times
when even on 1GbE connections, it’s possible to not reach 50%
of maximum throughput – a by product of its design. Hadoop
was designed to run multiple jobs and operations, and in those
instances these performance issues do not manifest
themselves.
© 2011 Emulex Corporation 34
35. Questions
Would a round-robin bonding configuration be possible with
10GbE, and would there be a performance gain from that?
Theoretically, it is possible. Practically, it is unlikely due to the
underlying disk system becoming the bottleneck (for the
moment). If there are SSDs, or more than 6 disks being
used, there is potential for performance improvement.
© 2011 Emulex Corporation 35
36. Questions
Have we run tests with SSDs, higher RPM spindles, or larger
spindle configurations?
Yes, we did. And we encountered some interesting results.
While we did see improvements of approximately 40%, we
anticipated much better results with SSDs. The biggest issue
with SSDs has to do with the way Hadoop interfaces with them
– it does not tap into the full potential of the disk. Ultimately, we
landed on throughput being the most important factor for
performance, not necessarily I/O.
© 2011 Emulex Corporation 36