SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Cloud Computing in the Cloud
Jeff	
  Hung,	
  Trend	
  Micro	
  SPN	
  
November	
  23,	
  2015
Jeff	
  Hung	
  
•  Trend	
  Micro	
  
–  Manager	
  of	
  SPN-­‐Infra	
  Team	
  
–  SPN	
  compute/data	
  infra	
  
like	
  Hadoop	
  
•  Experience	
  
–  Played	
  Hadoop	
  since	
  2009	
  
–  Distributed	
  System,	
  Cloud,	
  
and	
  Big-­‐data:	
  10+	
  years	
  
•  github.com/jeRung	
  
#TrendInsight	
  
Story	
  of	
  the	
  Journey	
  
3
Wheat	
  and	
  Chessboard	
  problem	
  
•  The	
  ruler	
  of	
  India	
  would	
  like	
  to	
  offer	
  reward	
  to	
  
the	
  wise	
  man,	
  who	
  invented	
  the	
  game	
  of	
  chess.	
  
•  The	
  wise	
  man	
  just	
  want	
  one	
  grain	
  of	
  rice	
  on	
  the	
  
first	
  square	
  of	
  the	
  chess	
  board,	
  double	
  the	
  grains	
  
of	
  the	
  second	
  square,	
  and	
  so	
  on…	
  
4
DataVolumeinHDFS
Data	
  Volume	
  Forecast	
  
•  Volume	
  increases	
  1.5	
  ~	
  2x	
  every	
  year	
  
Soluons?	
  
Migrang	
  to	
  another	
  datacenter	
  
•  Be^er	
  infrastructure	
  
•  Opmized	
  configuraon	
  
•  Reduced	
  running	
  cost	
  
Evaluate	
  if	
  AWS	
  is	
  a	
  viable	
  soluon	
  
•  Much	
  cheaper	
  storage	
  cost	
  
•  More	
  elasc	
  than	
  datacenter	
  
•  No	
  more	
  CAPEX	
  burst	
  
6	
Introduced in HadoopCon 2015:
Is	
  it	
  really	
  a	
  good	
  idea?	
  
Common	
  Believe:	
  
Hadoop	
  cluster	
  running	
  in	
  virtual	
  environment	
  is	
  significantly	
  
perform	
  lower	
  than	
  the	
  cluster	
  running	
  on	
  physical	
  machines	
  
7	
Reference: http://www.cs.wustl.edu/~jain/cse570-13/ftp/bigdatap/index.html
Hadoop	
  on	
  AWS:	
  EC2	
  +	
  EBS	
  
Run	
  exisng	
  SPN	
  Hadoop	
  sodware	
  stack	
  as	
  is	
  on	
  EC2	
  
with	
  EBS	
  persistence.	
  
	
  
à	
  Cost	
  esmaon	
  shows	
  it	
  is	
  not	
  praccal	
  
	
  
8	
Configuration EBS IOPS vs. Datacenter
Production workload
with 3-year heavily
reserved instances
300 5 x
2000 9 x
4000 14 x
Cost is too High!!
Hadoop	
  on	
  AWS:	
  EMR	
  +	
  S3	
  
Use	
  AWS	
  Elasc	
  MapReduce	
  (EMR)	
  managed	
  service	
  
with	
  data	
  persist	
  in	
  S3.	
  
Experiments:	
  
1.  Benchmark	
  to	
  compare	
  current	
  PROD	
  and	
  EMR	
  
2.  Evaluate	
  business	
  readiness	
  by	
  real	
  applicaon	
  
9	
Computing Storage
#TrendInsight	
  
Benchmarking	
  
10
Benchmarks	
  
•  Server/OS	
  Level	
  
–  Disk	
  I/O	
  (fio)	
  
–  Network	
  I/O	
  (iperf)	
  
•  Hadoop	
  Level	
  
–  TestDFSIO	
  
–  mrbench	
  
–  TeraSort	
  
–  RandomWriter	
  /	
  RandomTextWriter	
  
Disk	
  I/O	
  Comparison	
  (fio)	
  
•  IOPS	
  for	
  sequenal	
  access	
  
•  IOPS	
  for	
  random	
  access	
  
12	
-  70% Read
-  30% Write
-  File Size: 64 MB
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Sequential
Read
Sequential
Write
Random
Read
Random
Write
Datacenter
EMR root: /
EMR SSD: /mnt
Network	
  I/O	
  Comparison	
  (iperf)	
  
•  Run	
  30	
  minutes	
  to	
  see	
  how	
  fast	
  it	
  can	
  be	
  
•  Datacenter:	
  cross	
  rack	
  communicaon	
  
13	
0
20
40
60
80
100
120
140
Datacenter
#1
Datacenter
#2
EMR #1 EMR #2 EMR #3
Mbits/Sec
A -> B
B -> A
TestDFSIO	
  
•  70MB,	
  140MB,	
  and	
  1GB	
  files	
  
•  Run	
  in	
  70	
  mappers	
  
14	
Based on datacenter
file size and block #
distribution
0
10
20
30
40
50
60
70
80
90
100
(MBytes/Sec)
70 MB 140 MB 1 GB
Datacenter
EMR: default
EMR: custom
mrbench	
  
•  10	
  Runs,	
  70	
  DataLines	
  
•  70	
  Maps,	
  42	
  Reduces	
  
15	
Avg. Time (sec) Datacenter EMR
On Map Tasks 2.8 4.9
On Shuffle Tasks 4.6 7.3
On Reduce Tasks 1.0 1.0
Job Running Time 12.7 21.1
EMR is slower than Datacenter
TeraSort	
  
•  70	
  Mappers,	
  1	
  Reducer	
  
•  EMR	
  with	
  Local	
  HDFS,	
  S3,	
  and	
  S3-­‐Encrypted	
  
16	
Avg. Time (sec) Datacenter EMR: HDFS EMR: S3 EMR: S3 Enc
On Map Tasks 6.9 13.4 12.6 12.4
On Shuffle Tasks 45.3 38.2 40.6 40.4
On Reduce Tasks 32.7 56.0 510.8 608.6
Job Running Time 110.8 755.4 836.0
EMR mappers is slower than Datacenter
RandomWriter	
  /	
  RandomTextWriter	
  
•  70	
  Mappers,	
  1	
  Reducer	
  
•  EMR	
  with	
  Local	
  HDFS,	
  S3,	
  and	
  S3-­‐Encrypted	
  
•  Outliers	
  are	
  due	
  to	
  S3	
  hang	
  –	
  easy	
  to	
  reproduce	
  
–  Ader	
  reporng	
  to	
  AWS,	
  this	
  problem	
  has	
  been	
  fixed	
  
17	
Avg. Time (sec) Datacenter EMR: HDFS EMR: S3 EMR: S3 Enc
On Map Tasks 110.6 134.4 101.7 116.8
Job Running Time 229.0 213.0 463.6 300.0
Observaons	
  
•  EC2	
  performs	
  very	
  well	
  
–  Thanks	
  to	
  SSD	
  
–  Current	
  hardware	
  vs.	
  4-­‐year	
  old	
  cluster	
  
•  EMR	
  is	
  unexpected	
  slower	
  
–  Not	
  mature	
  enough	
  when	
  test	
  
–  AWS	
  evolve	
  fast,	
  in	
  most	
  of	
  the	
  me	
  
18
#TrendInsight	
  
PoC:	
  Real	
  ApplicaBon	
  
PoC	
  with	
  Real	
  World	
  Applicaon	
  
•  PE	
  file	
  metadata	
  and	
  distribuon	
  in	
  real	
  world	
  
–  Analyzed	
  500	
  billion	
  of	
  log	
  entries	
  so	
  far	
  
–  Idenfies	
  850	
  million	
  of	
  disnct	
  files	
  
–  Serves	
  750	
  million	
  of	
  requests	
  per	
  day	
  
•  One	
  of	
  the	
  biggest	
  applicaons	
  we	
  have	
  
–  Consumes	
  huge	
  amount	
  of	
  workload	
  in	
  primary	
  cluster	
  
–  Validates	
  that	
  AWS	
  is	
  a	
  viable	
  soluon	
  in	
  terms	
  of	
  volume	
  
Data	
  Processing	
  Flow	
  
21	
Hadoop
Data
Ingestion
API
Service
Solr
Cloud
Run hourly and daily jobs to analysis the data.
Then install the Solr index for real-time query.
Run analysis and indexing
jobs in EMR instead.
Skip tests since the architecture
is common seen.
EMR	
  +	
  S3	
  
•  Store	
  persist	
  data	
  in	
  S3	
  –	
  low	
  cost	
  
•  Process	
  in	
  EMR	
  –	
  easy	
  to	
  upgrade	
  
•  Allow	
  mulple	
  cluster	
  and	
  resized	
  cluster	
  
22	
EMR v1
Read Write
S3S3 EMR v2
EMR	
  Instance	
  Groups	
  
23	
Master Node
Core Nodes
Task Nodes
AWS Cloud
Runs NN, RM
No HA Support
No KRB Security
Runs DN, NM
Data is volatile
Cannot scale-in
Runs NM only
Resize cluster
Spot Instance!
How	
  to	
  evaluate?	
  
24	
Scope
(features, data-to-process)
Cost Time
(processing time)(resource, money)
Given the same amount
of data & work load…
Jobs must be finished
within time constraints
Would the cost be competitive?
Optimize
for...
The	
  Jobs	
  and	
  Time	
  Constraints	
  
•  There	
  are	
  2	
  hourly	
  jobs	
  and	
  6	
  daily	
  jobs:	
  
•  Find	
  a	
  combinaon	
  of	
  instance	
  types	
  and	
  EMR	
  
cluster	
  size	
  that	
  have	
  low-­‐enough	
  cost	
  
25	
# Job Program Time Constraint
1
Hourly Jobs
census_hourly.pig	
  
55 mins
2 census_index_hourly.pig	
  
3
Daily Jobs
census_daily.pig	
   2 hours
4 census_index_daily.pig	
   6 hours
5 vsapi_stats.pig	
   30 mins
6 vsapi_index.pig	
   60 mins
7 vsapi_dname_stats.pig	
   20 mins
8 vsapi_dname_index.pig	
   10 mins
The	
  combinaons	
  that	
  failed	
  
•  It’s	
  a	
  try-­‐n-­‐error	
  process…	
  
26	
Test	
  
#	
  
EMR	
  Instance	
  Group	
  
AMI	
  Version	
  
Job	
  Finish	
  Time	
  (min)	
  
Special	
  Parameters	
  Master	
   Core	
   Task	
   #1	
   #2	
   #3	
   #4	
   #5	
   #6	
   #7	
   #8	
  
1	
   i2.2xL	
   i2.2xL	
  *	
  30	
   	
  -­‐	
   3.1.0	
   74	
   24	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   slowstart=0.05	
  
2	
   c3.4xL	
   c3.4xL	
  *	
  60	
   	
  -­‐	
   (private)	
   96	
   32	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   slowstart=0.05	
  
3	
   i2.2xL	
   i2.2xL	
  *	
  30	
   	
  -­‐	
   3.1.0	
   60	
   23	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   slowstart=0.9	
  
4	
   c3.4xL	
   c3.4xL	
  *	
  60	
   	
  -­‐	
   3.1.0	
   30	
   19	
   17	
   17+	
   14	
   24	
   27	
   4	
   slowstart=0.9	
  
5	
   c3.4xL	
   c3.4xL	
  *	
  70	
   	
  -­‐	
   3.1.0	
   35	
   18	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   slowstart=0.9	
  
6	
   c3.4xL	
   c3.4xL	
  *	
  60	
   	
  -­‐	
   3.1.0	
  (ba73a7d2)	
   96	
   29	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   slowstart=0.9	
  
7	
   c3.4xL	
   c3.4xL	
  *	
  60	
   	
  -­‐	
   3.1.1	
  (fcad7f94)	
   89	
   67	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   slowstart=0.9	
  
8	
   c3.4xL	
   c3.4xL	
  *	
  60	
   	
  -­‐	
   3.1.1	
  (fcad7f94)	
   32	
   16	
   16	
   xxx	
   16	
   21	
   10	
   2	
   slowstart=0.9,	
  y.s.c.node-­‐locality-­‐delay	
  =	
  -­‐1	
  	
  
9	
   c3.4xL	
   c3.4xL	
  *	
  60	
   	
  -­‐	
   3.1.1	
  (fcad7f94)	
   30	
   17	
   15	
   95	
   17	
   22	
   17	
   21	
   slowstart=0.9,	
  y.s.c.node-­‐locality-­‐delay	
  =	
  -­‐1	
  	
  
10	
   c3.4xL	
   c3.4xL	
  *	
  60	
   	
  -­‐	
   3.1.1	
  (fcad7f94)	
   26	
   17	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   slowstart=0.9,	
  y.s.c.node-­‐locality-­‐delay	
  =	
  -­‐1	
  	
  
11	
   c3.4xL	
   c3.4xL	
  *	
  40	
   	
  -­‐	
   3.1.1	
  (fcad7f94)	
   27	
   17	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   slowstart=0.9,	
  y.s.c.node-­‐locality-­‐delay	
  =	
  -­‐1	
  	
  
12	
   c3.4xL	
   c3.4xL	
  *	
  100	
   	
  -­‐	
   3.1.1	
  (fcad7f94)	
   19	
   16	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   	
  	
   slowstart=0.9,	
  y.s.c.node-­‐locality-­‐delay	
  =	
  -­‐1	
  	
  
AWS	
  Bugs	
  Discovered	
  
Confidenal	
  |	
  Copyright	
  2013	
  TrendMicro	
  Inc.	
   27	
  
Job Stage	
   Job Stage
Action	
  
issue	
   Status	
  
Job launch	
    	
    	
    	
  
Job
initialization	
  
S3 reading
 	
  
[Performance]  The s3 access takes long time, for example, some
census_hourly MR takes two to three minutes in this part.
Unsolved
Fixed	
  
Pig analysis
 	
  
[Performance]  The pig analysis takes long time, for example, some
census_hourly MR takes four minutes in this part.	
  
Unsolved
Fixed	
  
Submit job
& assign
AM	
  
[Performance]  The job already show on RM Web UI, but have not
been assigned to any AM. It may take about 5 minutes pending on this
status.	
  
Fixed	
  
Computation	
   Mapper
phase	
  
[Performance]  Mapper utilization very low while the job has been
initialized
 	
  
Fixed	
  
Reducer
phase	
  
[Performance]  Reducer startup too early
 	
  
Fixed	
  
[Bug] Census_daily_index.pig met 5G upload limitation while write
output to S3. Job failed.	
  
Not sure
Fixed	
  
[Bug] Most of the index pig script met multipleUpload error while write
output to S3. Job failed in AMI 3.1.1.	
  
Unsolved
Fixed
Finalization	
    	
   [Performance]  Even though all mapper/reducers are finished. The job
still seek through all S3 files for long time. For example, some
census_hourly MR takes three to four minutes in this part.	
  
Unsolved
Fixed	
  
The	
  final	
  result	
  
•  c3.4xlarge	
  
–  40	
  core	
  nodes	
  running	
  24hr/day	
  
–  25	
  task	
  nodes	
  running	
  2hr/day	
  
28	
Only slightly greater than
Datacenter Cost
(but there are other hidden cost in DC)
The	
  Near	
  Future…	
  
29	
Data Center
• TM-Hadoop Stack
• Optimize for Data App
• SolrCloud for Query
Public Cloud
• Amazon EMR/S3
• Optimize for Ad-hoc Use
• Big-data Query Service
Streamline
Architecture
End-to-end
Data Processing
Light-speed
Provisioning
Flexible
Scalability
#TrendInsight	
  
Lesson	
  Learned	
  
On-­‐premises	
  (physical)	
  vs.	
  EMR	
  (virtual)	
  
•  The	
  gap	
  is	
  not	
  that	
  big	
  
–  EMR	
  is	
  a	
  good	
  choice	
  for	
  startups	
  
–  There	
  are	
  other	
  benefits	
  like	
  elascity	
  
•  The	
  key	
  is	
  opmizaon	
  
–  Engineers’	
  duty	
  and	
  nature	
  is	
  to	
  opmize!!	
  
–  Apps	
  opmized	
  for	
  DC	
  runs	
  costly	
  in	
  AWS	
  
AWS	
  could	
  be	
  the	
  way	
  to	
  go	
  
•  AWS	
  evolves	
  fast	
  and	
  listen	
  to	
  customers	
  
–  Lots	
  of	
  issues	
  being	
  fixed	
  during	
  test	
  period	
  
–  New	
  features	
  are	
  realized	
  if	
  there	
  are	
  true	
  needs	
  
•  AWS	
  model	
  is	
  more	
  flexible	
  in	
  configuraons	
  
–  More	
  low	
  cost	
  opons	
  to	
  leverage	
  
–  Less	
  lead	
  me	
  for	
  configuraon	
  change	
  
32
Mindset	
  Change	
  
•  Think	
  in	
  terms	
  of	
  business	
  goal	
  
–  Instead	
  of	
  limited	
  performance	
  metrics	
  
–  Hidden	
  issues	
  could	
  be	
  measured	
  
•  Opmize	
  for	
  cost	
  instead	
  of	
  me	
  
–  In	
  on-­‐premises	
  DC	
  we	
  opmize	
  for	
  me	
  
–  On	
  AWS	
  we	
  opmize	
  for	
  cost	
  
33
#TrendInsight	
  
QuesBons?	
  
THANKS	
  YOU~	
  
34

Contenu connexe

Tendances

Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonData Con LA
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData StoryLynn Langit
 
Genomics on aws-webinar-april2018
Genomics on aws-webinar-april2018Genomics on aws-webinar-april2018
Genomics on aws-webinar-april2018Brendan Bouffler
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AITorsten Steinbach
 
A Microservice Architecture for Big Data Pipelines
A Microservice Architecture for Big Data PipelinesA Microservice Architecture for Big Data Pipelines
A Microservice Architecture for Big Data PipelinesDaniel Mescheder
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
Big data on AWS
Big data on AWSBig data on AWS
Big data on AWSStylight
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopLynn Langit
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 

Tendances (20)

Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
Genomics on aws-webinar-april2018
Genomics on aws-webinar-april2018Genomics on aws-webinar-april2018
Genomics on aws-webinar-april2018
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
A Microservice Architecture for Big Data Pipelines
A Microservice Architecture for Big Data PipelinesA Microservice Architecture for Big Data Pipelines
A Microservice Architecture for Big Data Pipelines
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Big data on AWS
Big data on AWSBig data on AWS
Big data on AWS
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 

Similaire à Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)

Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Amazon Web Services
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
DAT202_Getting started with Amazon Aurora
DAT202_Getting started with Amazon AuroraDAT202_Getting started with Amazon Aurora
DAT202_Getting started with Amazon AuroraAmazon Web Services
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuningMongoDB
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...Amazon Web Services
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...DataStax Academy
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv
 
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
Scylla Summit 2019 Keynote - Dor Laor - Beyond CassandraScylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
Scylla Summit 2019 Keynote - Dor Laor - Beyond CassandraScyllaDB
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...DataStax Academy
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...ScyllaDB
 

Similaire à Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23) (20)

Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
DAT202_Getting started with Amazon Aurora
DAT202_Getting started with Amazon AuroraDAT202_Getting started with Amazon Aurora
DAT202_Getting started with Amazon Aurora
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
Scylla Summit 2019 Keynote - Dor Laor - Beyond CassandraScylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 

Dernier

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 

Dernier (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 

Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)

  • 1. Cloud Computing in the Cloud Jeff  Hung,  Trend  Micro  SPN   November  23,  2015
  • 2. Jeff  Hung   •  Trend  Micro   –  Manager  of  SPN-­‐Infra  Team   –  SPN  compute/data  infra   like  Hadoop   •  Experience   –  Played  Hadoop  since  2009   –  Distributed  System,  Cloud,   and  Big-­‐data:  10+  years   •  github.com/jeRung  
  • 3. #TrendInsight   Story  of  the  Journey   3
  • 4. Wheat  and  Chessboard  problem   •  The  ruler  of  India  would  like  to  offer  reward  to   the  wise  man,  who  invented  the  game  of  chess.   •  The  wise  man  just  want  one  grain  of  rice  on  the   first  square  of  the  chess  board,  double  the  grains   of  the  second  square,  and  so  on…   4
  • 5. DataVolumeinHDFS Data  Volume  Forecast   •  Volume  increases  1.5  ~  2x  every  year  
  • 6. Soluons?   Migrang  to  another  datacenter   •  Be^er  infrastructure   •  Opmized  configuraon   •  Reduced  running  cost   Evaluate  if  AWS  is  a  viable  soluon   •  Much  cheaper  storage  cost   •  More  elasc  than  datacenter   •  No  more  CAPEX  burst   6 Introduced in HadoopCon 2015:
  • 7. Is  it  really  a  good  idea?   Common  Believe:   Hadoop  cluster  running  in  virtual  environment  is  significantly   perform  lower  than  the  cluster  running  on  physical  machines   7 Reference: http://www.cs.wustl.edu/~jain/cse570-13/ftp/bigdatap/index.html
  • 8. Hadoop  on  AWS:  EC2  +  EBS   Run  exisng  SPN  Hadoop  sodware  stack  as  is  on  EC2   with  EBS  persistence.     à  Cost  esmaon  shows  it  is  not  praccal     8 Configuration EBS IOPS vs. Datacenter Production workload with 3-year heavily reserved instances 300 5 x 2000 9 x 4000 14 x Cost is too High!!
  • 9. Hadoop  on  AWS:  EMR  +  S3   Use  AWS  Elasc  MapReduce  (EMR)  managed  service   with  data  persist  in  S3.   Experiments:   1.  Benchmark  to  compare  current  PROD  and  EMR   2.  Evaluate  business  readiness  by  real  applicaon   9 Computing Storage
  • 11. Benchmarks   •  Server/OS  Level   –  Disk  I/O  (fio)   –  Network  I/O  (iperf)   •  Hadoop  Level   –  TestDFSIO   –  mrbench   –  TeraSort   –  RandomWriter  /  RandomTextWriter  
  • 12. Disk  I/O  Comparison  (fio)   •  IOPS  for  sequenal  access   •  IOPS  for  random  access   12 -  70% Read -  30% Write -  File Size: 64 MB 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Sequential Read Sequential Write Random Read Random Write Datacenter EMR root: / EMR SSD: /mnt
  • 13. Network  I/O  Comparison  (iperf)   •  Run  30  minutes  to  see  how  fast  it  can  be   •  Datacenter:  cross  rack  communicaon   13 0 20 40 60 80 100 120 140 Datacenter #1 Datacenter #2 EMR #1 EMR #2 EMR #3 Mbits/Sec A -> B B -> A
  • 14. TestDFSIO   •  70MB,  140MB,  and  1GB  files   •  Run  in  70  mappers   14 Based on datacenter file size and block # distribution 0 10 20 30 40 50 60 70 80 90 100 (MBytes/Sec) 70 MB 140 MB 1 GB Datacenter EMR: default EMR: custom
  • 15. mrbench   •  10  Runs,  70  DataLines   •  70  Maps,  42  Reduces   15 Avg. Time (sec) Datacenter EMR On Map Tasks 2.8 4.9 On Shuffle Tasks 4.6 7.3 On Reduce Tasks 1.0 1.0 Job Running Time 12.7 21.1 EMR is slower than Datacenter
  • 16. TeraSort   •  70  Mappers,  1  Reducer   •  EMR  with  Local  HDFS,  S3,  and  S3-­‐Encrypted   16 Avg. Time (sec) Datacenter EMR: HDFS EMR: S3 EMR: S3 Enc On Map Tasks 6.9 13.4 12.6 12.4 On Shuffle Tasks 45.3 38.2 40.6 40.4 On Reduce Tasks 32.7 56.0 510.8 608.6 Job Running Time 110.8 755.4 836.0 EMR mappers is slower than Datacenter
  • 17. RandomWriter  /  RandomTextWriter   •  70  Mappers,  1  Reducer   •  EMR  with  Local  HDFS,  S3,  and  S3-­‐Encrypted   •  Outliers  are  due  to  S3  hang  –  easy  to  reproduce   –  Ader  reporng  to  AWS,  this  problem  has  been  fixed   17 Avg. Time (sec) Datacenter EMR: HDFS EMR: S3 EMR: S3 Enc On Map Tasks 110.6 134.4 101.7 116.8 Job Running Time 229.0 213.0 463.6 300.0
  • 18. Observaons   •  EC2  performs  very  well   –  Thanks  to  SSD   –  Current  hardware  vs.  4-­‐year  old  cluster   •  EMR  is  unexpected  slower   –  Not  mature  enough  when  test   –  AWS  evolve  fast,  in  most  of  the  me   18
  • 19. #TrendInsight   PoC:  Real  ApplicaBon  
  • 20. PoC  with  Real  World  Applicaon   •  PE  file  metadata  and  distribuon  in  real  world   –  Analyzed  500  billion  of  log  entries  so  far   –  Idenfies  850  million  of  disnct  files   –  Serves  750  million  of  requests  per  day   •  One  of  the  biggest  applicaons  we  have   –  Consumes  huge  amount  of  workload  in  primary  cluster   –  Validates  that  AWS  is  a  viable  soluon  in  terms  of  volume  
  • 21. Data  Processing  Flow   21 Hadoop Data Ingestion API Service Solr Cloud Run hourly and daily jobs to analysis the data. Then install the Solr index for real-time query. Run analysis and indexing jobs in EMR instead. Skip tests since the architecture is common seen.
  • 22. EMR  +  S3   •  Store  persist  data  in  S3  –  low  cost   •  Process  in  EMR  –  easy  to  upgrade   •  Allow  mulple  cluster  and  resized  cluster   22 EMR v1 Read Write S3S3 EMR v2
  • 23. EMR  Instance  Groups   23 Master Node Core Nodes Task Nodes AWS Cloud Runs NN, RM No HA Support No KRB Security Runs DN, NM Data is volatile Cannot scale-in Runs NM only Resize cluster Spot Instance!
  • 24. How  to  evaluate?   24 Scope (features, data-to-process) Cost Time (processing time)(resource, money) Given the same amount of data & work load… Jobs must be finished within time constraints Would the cost be competitive? Optimize for...
  • 25. The  Jobs  and  Time  Constraints   •  There  are  2  hourly  jobs  and  6  daily  jobs:   •  Find  a  combinaon  of  instance  types  and  EMR   cluster  size  that  have  low-­‐enough  cost   25 # Job Program Time Constraint 1 Hourly Jobs census_hourly.pig   55 mins 2 census_index_hourly.pig   3 Daily Jobs census_daily.pig   2 hours 4 census_index_daily.pig   6 hours 5 vsapi_stats.pig   30 mins 6 vsapi_index.pig   60 mins 7 vsapi_dname_stats.pig   20 mins 8 vsapi_dname_index.pig   10 mins
  • 26. The  combinaons  that  failed   •  It’s  a  try-­‐n-­‐error  process…   26 Test   #   EMR  Instance  Group   AMI  Version   Job  Finish  Time  (min)   Special  Parameters  Master   Core   Task   #1   #2   #3   #4   #5   #6   #7   #8   1   i2.2xL   i2.2xL  *  30    -­‐   3.1.0   74   24                           slowstart=0.05   2   c3.4xL   c3.4xL  *  60    -­‐   (private)   96   32                           slowstart=0.05   3   i2.2xL   i2.2xL  *  30    -­‐   3.1.0   60   23                           slowstart=0.9   4   c3.4xL   c3.4xL  *  60    -­‐   3.1.0   30   19   17   17+   14   24   27   4   slowstart=0.9   5   c3.4xL   c3.4xL  *  70    -­‐   3.1.0   35   18                           slowstart=0.9   6   c3.4xL   c3.4xL  *  60    -­‐   3.1.0  (ba73a7d2)   96   29                           slowstart=0.9   7   c3.4xL   c3.4xL  *  60    -­‐   3.1.1  (fcad7f94)   89   67                           slowstart=0.9   8   c3.4xL   c3.4xL  *  60    -­‐   3.1.1  (fcad7f94)   32   16   16   xxx   16   21   10   2   slowstart=0.9,  y.s.c.node-­‐locality-­‐delay  =  -­‐1     9   c3.4xL   c3.4xL  *  60    -­‐   3.1.1  (fcad7f94)   30   17   15   95   17   22   17   21   slowstart=0.9,  y.s.c.node-­‐locality-­‐delay  =  -­‐1     10   c3.4xL   c3.4xL  *  60    -­‐   3.1.1  (fcad7f94)   26   17                           slowstart=0.9,  y.s.c.node-­‐locality-­‐delay  =  -­‐1     11   c3.4xL   c3.4xL  *  40    -­‐   3.1.1  (fcad7f94)   27   17                           slowstart=0.9,  y.s.c.node-­‐locality-­‐delay  =  -­‐1     12   c3.4xL   c3.4xL  *  100    -­‐   3.1.1  (fcad7f94)   19   16                           slowstart=0.9,  y.s.c.node-­‐locality-­‐delay  =  -­‐1    
  • 27. AWS  Bugs  Discovered   Confidenal  |  Copyright  2013  TrendMicro  Inc.   27   Job Stage   Job Stage Action   issue   Status   Job launch               Job initialization   S3 reading     [Performance]  The s3 access takes long time, for example, some census_hourly MR takes two to three minutes in this part. Unsolved Fixed   Pig analysis     [Performance]  The pig analysis takes long time, for example, some census_hourly MR takes four minutes in this part.   Unsolved Fixed   Submit job & assign AM   [Performance]  The job already show on RM Web UI, but have not been assigned to any AM. It may take about 5 minutes pending on this status.   Fixed   Computation   Mapper phase   [Performance]  Mapper utilization very low while the job has been initialized     Fixed   Reducer phase   [Performance]  Reducer startup too early     Fixed   [Bug] Census_daily_index.pig met 5G upload limitation while write output to S3. Job failed.   Not sure Fixed   [Bug] Most of the index pig script met multipleUpload error while write output to S3. Job failed in AMI 3.1.1.   Unsolved Fixed Finalization       [Performance]  Even though all mapper/reducers are finished. The job still seek through all S3 files for long time. For example, some census_hourly MR takes three to four minutes in this part.   Unsolved Fixed  
  • 28. The  final  result   •  c3.4xlarge   –  40  core  nodes  running  24hr/day   –  25  task  nodes  running  2hr/day   28 Only slightly greater than Datacenter Cost (but there are other hidden cost in DC)
  • 29. The  Near  Future…   29 Data Center • TM-Hadoop Stack • Optimize for Data App • SolrCloud for Query Public Cloud • Amazon EMR/S3 • Optimize for Ad-hoc Use • Big-data Query Service Streamline Architecture End-to-end Data Processing Light-speed Provisioning Flexible Scalability
  • 31. On-­‐premises  (physical)  vs.  EMR  (virtual)   •  The  gap  is  not  that  big   –  EMR  is  a  good  choice  for  startups   –  There  are  other  benefits  like  elascity   •  The  key  is  opmizaon   –  Engineers’  duty  and  nature  is  to  opmize!!   –  Apps  opmized  for  DC  runs  costly  in  AWS  
  • 32. AWS  could  be  the  way  to  go   •  AWS  evolves  fast  and  listen  to  customers   –  Lots  of  issues  being  fixed  during  test  period   –  New  features  are  realized  if  there  are  true  needs   •  AWS  model  is  more  flexible  in  configuraons   –  More  low  cost  opons  to  leverage   –  Less  lead  me  for  configuraon  change   32
  • 33. Mindset  Change   •  Think  in  terms  of  business  goal   –  Instead  of  limited  performance  metrics   –  Hidden  issues  could  be  measured   •  Opmize  for  cost  instead  of  me   –  In  on-­‐premises  DC  we  opmize  for  me   –  On  AWS  we  opmize  for  cost   33