4. Design Choices
1. AWS Elastic MapReduce (EMR) vs. Mesosphere (with Docker)
zero deployment efforts
EMR with YARN, dynamically scale your cluster
2. batch-only design
lower-maintenance (no Kafka, no Storm/Spark Streaming)
3. S3+HDFS together
cheap unlimited storage with HDFS batch performance
safely shut down your cluster
4
5. System Scalability Test With Spot Instances
5
Slave
Nodes
Batch
Process
Time
Overall
Throughput
Cost
Rate
Total
Cost
5 48 mins 10.4 GB/min $2.1/h $1.7
10 32 mins 15.6 GB/min $3.8/h $2.0
15 21 mins 23.8 GB/min $5.5/h $1.9
Load: +500 GB / +19M videos
Cluster Setup: Master: m3.xlarge
Slave: m2.4xlarge (bided at $0.1/h from $1.0/h)
6. Scalability Challenges & Salted Tables
6
Data Column
Channel Name
Channel ID
Video Title
Transcript Word Set
Word Frequency
RowKey = SALT+Brand+MetricType+…
…NumMetric+VideoID
SALTING = predefining “table” (region) splits using the SALT
HBase Schema
RowKey0
RowKey1
RowLey2
…
7. Born in Brazil
Love mixing with different cultures!
Former Data Analytics Engineer
Boston MA
Studying machine learning (part-time)
OMSCS, Georgia Tech
MSc, Geophysics
King Abdullah University of Science and Tech.
BSc, Mechanical Engineering
University of Massachusetts Lowell
7
8. System Scalability Test With Spot Instances[2]
8
Core EC2
Instances
Spark
Executors
Spark
Jobs Time
Batch
Process Time
Overall
Throughput
Total Cost Savings
5 119 23 mins 48 mins (25) 10.4 GB/min $2.1/h ($1.7) 68%
10 239 12 mins 32 mins (20) 15.6 GB/min $3.8/h ($2.0) 70%
15 359 7 mins 21 mins (14) 23.8 GB/min $5.5/h ($1.9) 70%
Load: +500 GB / +19M videos
Cluster Setup: Master:
Core:
Spark Setup: 2 GB Memory/Executor
m3.xlarge 15 GB RAM 13 units 4 cores
m2.4xlarge 68 GB RAM 26 units 8 cores