It is a common believe that Hadoop should run on physical servers. However, this requires huge capital investment in the beginning while you have no guarantee for the returns. Therefore, things usually end up in proving big-data with not-that-big data. One approach to workaround this dilemma is to run Cloud Computing in the Cloud. With the elastic that AWS provides, you could spend little but run big!! However, is it really a good idea? In this sharing, we will try to answer it – based on the result from an 1-year journey with real application and real big-data.
2. Jeff
Hung
• Trend
Micro
– Manager
of
SPN-‐Infra
Team
– SPN
compute/data
infra
like
Hadoop
• Experience
– Played
Hadoop
since
2009
– Distributed
System,
Cloud,
and
Big-‐data:
10+
years
• github.com/jeRung
4. Wheat
and
Chessboard
problem
• The
ruler
of
India
would
like
to
offer
reward
to
the
wise
man,
who
invented
the
game
of
chess.
• The
wise
man
just
want
one
grain
of
rice
on
the
first
square
of
the
chess
board,
double
the
grains
of
the
second
square,
and
so
on…
4
6. Soluons?
Migrang
to
another
datacenter
• Be^er
infrastructure
• Opmized
configuraon
• Reduced
running
cost
Evaluate
if
AWS
is
a
viable
soluon
• Much
cheaper
storage
cost
• More
elasc
than
datacenter
• No
more
CAPEX
burst
6
Introduced in HadoopCon 2015:
7. Is
it
really
a
good
idea?
Common
Believe:
Hadoop
cluster
running
in
virtual
environment
is
significantly
perform
lower
than
the
cluster
running
on
physical
machines
7
Reference: http://www.cs.wustl.edu/~jain/cse570-13/ftp/bigdatap/index.html
8. Hadoop
on
AWS:
EC2
+
EBS
Run
exisng
SPN
Hadoop
sodware
stack
as
is
on
EC2
with
EBS
persistence.
à
Cost
esmaon
shows
it
is
not
praccal
8
Configuration EBS IOPS vs. Datacenter
Production workload
with 3-year heavily
reserved instances
300 5 x
2000 9 x
4000 14 x
Cost is too High!!
9. Hadoop
on
AWS:
EMR
+
S3
Use
AWS
Elasc
MapReduce
(EMR)
managed
service
with
data
persist
in
S3.
Experiments:
1. Benchmark
to
compare
current
PROD
and
EMR
2. Evaluate
business
readiness
by
real
applicaon
9
Computing Storage
12. Disk
I/O
Comparison
(fio)
• IOPS
for
sequenal
access
• IOPS
for
random
access
12
- 70% Read
- 30% Write
- File Size: 64 MB
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Sequential
Read
Sequential
Write
Random
Read
Random
Write
Datacenter
EMR root: /
EMR SSD: /mnt
13. Network
I/O
Comparison
(iperf)
• Run
30
minutes
to
see
how
fast
it
can
be
• Datacenter:
cross
rack
communicaon
13
0
20
40
60
80
100
120
140
Datacenter
#1
Datacenter
#2
EMR #1 EMR #2 EMR #3
Mbits/Sec
A -> B
B -> A
14. TestDFSIO
• 70MB,
140MB,
and
1GB
files
• Run
in
70
mappers
14
Based on datacenter
file size and block #
distribution
0
10
20
30
40
50
60
70
80
90
100
(MBytes/Sec)
70 MB 140 MB 1 GB
Datacenter
EMR: default
EMR: custom
15. mrbench
• 10
Runs,
70
DataLines
• 70
Maps,
42
Reduces
15
Avg. Time (sec) Datacenter EMR
On Map Tasks 2.8 4.9
On Shuffle Tasks 4.6 7.3
On Reduce Tasks 1.0 1.0
Job Running Time 12.7 21.1
EMR is slower than Datacenter
16. TeraSort
• 70
Mappers,
1
Reducer
• EMR
with
Local
HDFS,
S3,
and
S3-‐Encrypted
16
Avg. Time (sec) Datacenter EMR: HDFS EMR: S3 EMR: S3 Enc
On Map Tasks 6.9 13.4 12.6 12.4
On Shuffle Tasks 45.3 38.2 40.6 40.4
On Reduce Tasks 32.7 56.0 510.8 608.6
Job Running Time 110.8 755.4 836.0
EMR mappers is slower than Datacenter
17. RandomWriter
/
RandomTextWriter
• 70
Mappers,
1
Reducer
• EMR
with
Local
HDFS,
S3,
and
S3-‐Encrypted
• Outliers
are
due
to
S3
hang
–
easy
to
reproduce
– Ader
reporng
to
AWS,
this
problem
has
been
fixed
17
Avg. Time (sec) Datacenter EMR: HDFS EMR: S3 EMR: S3 Enc
On Map Tasks 110.6 134.4 101.7 116.8
Job Running Time 229.0 213.0 463.6 300.0
18. Observaons
• EC2
performs
very
well
– Thanks
to
SSD
– Current
hardware
vs.
4-‐year
old
cluster
• EMR
is
unexpected
slower
– Not
mature
enough
when
test
– AWS
evolve
fast,
in
most
of
the
me
18
20. PoC
with
Real
World
Applicaon
• PE
file
metadata
and
distribuon
in
real
world
– Analyzed
500
billion
of
log
entries
so
far
– Idenfies
850
million
of
disnct
files
– Serves
750
million
of
requests
per
day
• One
of
the
biggest
applicaons
we
have
– Consumes
huge
amount
of
workload
in
primary
cluster
– Validates
that
AWS
is
a
viable
soluon
in
terms
of
volume
21. Data
Processing
Flow
21
Hadoop
Data
Ingestion
API
Service
Solr
Cloud
Run hourly and daily jobs to analysis the data.
Then install the Solr index for real-time query.
Run analysis and indexing
jobs in EMR instead.
Skip tests since the architecture
is common seen.
22. EMR
+
S3
• Store
persist
data
in
S3
–
low
cost
• Process
in
EMR
–
easy
to
upgrade
• Allow
mulple
cluster
and
resized
cluster
22
EMR v1
Read Write
S3S3 EMR v2
23. EMR
Instance
Groups
23
Master Node
Core Nodes
Task Nodes
AWS Cloud
Runs NN, RM
No HA Support
No KRB Security
Runs DN, NM
Data is volatile
Cannot scale-in
Runs NM only
Resize cluster
Spot Instance!
24. How
to
evaluate?
24
Scope
(features, data-to-process)
Cost Time
(processing time)(resource, money)
Given the same amount
of data & work load…
Jobs must be finished
within time constraints
Would the cost be competitive?
Optimize
for...
25. The
Jobs
and
Time
Constraints
• There
are
2
hourly
jobs
and
6
daily
jobs:
• Find
a
combinaon
of
instance
types
and
EMR
cluster
size
that
have
low-‐enough
cost
25
# Job Program Time Constraint
1
Hourly Jobs
census_hourly.pig
55 mins
2 census_index_hourly.pig
3
Daily Jobs
census_daily.pig
2 hours
4 census_index_daily.pig
6 hours
5 vsapi_stats.pig
30 mins
6 vsapi_index.pig
60 mins
7 vsapi_dname_stats.pig
20 mins
8 vsapi_dname_index.pig
10 mins
27. AWS
Bugs
Discovered
Confidenal
|
Copyright
2013
TrendMicro
Inc.
27
Job Stage
Job Stage
Action
issue
Status
Job launch
Job
initialization
S3 reading
[Performance] The s3 access takes long time, for example, some
census_hourly MR takes two to three minutes in this part.
Unsolved
Fixed
Pig analysis
[Performance] The pig analysis takes long time, for example, some
census_hourly MR takes four minutes in this part.
Unsolved
Fixed
Submit job
& assign
AM
[Performance] The job already show on RM Web UI, but have not
been assigned to any AM. It may take about 5 minutes pending on this
status.
Fixed
Computation
Mapper
phase
[Performance] Mapper utilization very low while the job has been
initialized
Fixed
Reducer
phase
[Performance] Reducer startup too early
Fixed
[Bug] Census_daily_index.pig met 5G upload limitation while write
output to S3. Job failed.
Not sure
Fixed
[Bug] Most of the index pig script met multipleUpload error while write
output to S3. Job failed in AMI 3.1.1.
Unsolved
Fixed
Finalization
[Performance] Even though all mapper/reducers are finished. The job
still seek through all S3 files for long time. For example, some
census_hourly MR takes three to four minutes in this part.
Unsolved
Fixed
28. The
final
result
• c3.4xlarge
– 40
core
nodes
running
24hr/day
– 25
task
nodes
running
2hr/day
28
Only slightly greater than
Datacenter Cost
(but there are other hidden cost in DC)
29. The
Near
Future…
29
Data Center
• TM-Hadoop Stack
• Optimize for Data App
• SolrCloud for Query
Public Cloud
• Amazon EMR/S3
• Optimize for Ad-hoc Use
• Big-data Query Service
Streamline
Architecture
End-to-end
Data Processing
Light-speed
Provisioning
Flexible
Scalability
31. On-‐premises
(physical)
vs.
EMR
(virtual)
• The
gap
is
not
that
big
– EMR
is
a
good
choice
for
startups
– There
are
other
benefits
like
elascity
• The
key
is
opmizaon
– Engineers’
duty
and
nature
is
to
opmize!!
– Apps
opmized
for
DC
runs
costly
in
AWS
32. AWS
could
be
the
way
to
go
• AWS
evolves
fast
and
listen
to
customers
– Lots
of
issues
being
fixed
during
test
period
– New
features
are
realized
if
there
are
true
needs
• AWS
model
is
more
flexible
in
configuraons
– More
low
cost
opons
to
leverage
– Less
lead
me
for
configuraon
change
32
33. Mindset
Change
• Think
in
terms
of
business
goal
– Instead
of
limited
performance
metrics
– Hidden
issues
could
be
measured
• Opmize
for
cost
instead
of
me
– In
on-‐premises
DC
we
opmize
for
me
– On
AWS
we
opmize
for
cost
33