1. BIG DATA ON THE CLOUD
@ugurarpaci
@SercanKaraoglu
2. CONTENTS
3V Model
Development & Operational Challenges
Distributed Processing
Hadoop & Spark
AWS Spot Instance Management
Use Case: Apache Zeppelin, Spark
3. WHO WE ARE
Financial Data Provider Merging Different Markets
Applications on Different Platforms (Web, Mobile,
Desktop, APIs)
Software Development Team ~50 People, 130 Total
Financial Data Application Management
4. 3V MODEL
90% of the data in the
world today has been
created over the last
two years alone
VOLUME VELOCITY VARIETY
High data generation
speed
Data is formatted by
any shape
HIGH HIGH HIGH
5. METADATA, EVENTS, ACTIONS ARE BIGDATA
What you see is not the whole
picture!
An actual tweet to end user is similar as follows:
{
text: “This is a 140 chars”,
created_at: date();
favourited: boolean;
}
9. HADOOP - DISTRIBUTED PROCESSING
Hadoop Distributed File SystemHadoop Common
Hadoop Map-ReduceHadoop YARN
The common utilities that
support the other Hadoop
modules
A distributed file system that
provides high-throughput
access to application data
A framework for job scheduling
and cluster resource
Management
A YARN based system for
parallel processing of large
datasets
15. RESOURCE MANAGEMENT ON THE CLOUD
Resource
Requirement
Orchestrated Cluster
Management
Accesibility
16. CLOUD STORAGE (AMAZON S3)
Separate compute and storage
Resize and shutdown Spark
Instance(EMR, EC2) with no data loss
Point multiple Spark Clusters at the
same data in S3
Easily evolve your analytic infrastructure
as technology evolves
17. SPOT INSTANCE PROVISIONING PROCESS
Provisioning
Spinning-up
Service DiscoveryService Registry
Data Persistence
18. val conf = new SparkConf().setAppName("Trading
Statistics").setMaster("spark://foreks.sparkcluster.com:18080")
val sc = new SparkContext(conf)
tasks
tasks
tasks
Read
HDFS&S3
block
Read
HDFS&S3
block
Read
HDFS&S3
block
Process&Cache Data
Process&Cache Data
Process&Cache Data
Results
Results
Results
USE CASE: SPARK + ZEPPELIN + S3
var logFile = sc.textFile("s3://../../2016/*/*/*.log.gz")
logFile = logFile.filter(line => line.startsWith("t;"))
.map(toTradeObject)
.groupBy(_.getSecurityName)
logFile.count().show()
19. USE CASE: SPARK + ZEPPELIN + S3
Data Engineers write
necessary queries for
Marketing Department
Marketing Department
can view & evaluate
analytics graphics and
several statistics showed
on Zeppelin nice and
smooth
Access logs uploaded to S3
Spark Cluster pulls access logs from
s3://../../2016/*/*.log.gz