SlideShare a Scribd company logo
1 of 74
Download to read offline
From 0 to Cassandra on AWS in 30 days
Tsunami alerting with Cassandra
Andrei Arion - Geoffrey Bérard - Charles Herriau
@BeastieFurets
1
1 Context
2 The project, step by step
3 Feedbacks
4 Conclusion
2
3
LesFurets.com
•  1st independant insurance aggregator
•  A unique place to compare and buy hundreds of insurance products
•  Key figures:
–  2 500 000 quotes/year
–  31% market share on car comparison (january 2015)
–  more than 50 insurers
4
LesFurets.com
5
@BeastieFurets
6
4 Teams | 25 Engineers | Lean Kanban | Daily delivery
LesFurets.com
7
Bigdata@Telcom ParisTech
8
Context
•  Master Level course
•  Big Data Analysis and Management
–  Non relational Databases
–  30 hours
•  Lectures: 25 %
•  Hand’s on: 75 %
9http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html
Context
•  30 students:
–  various backgrounds: students and professionals
–  various skill levels
–  various expertise area: devops, data engineer, marketing
–  SQL and Hadoop exposure
10http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html
Main topics:
Project goals
–  Choose the right technology
–  Implement from scratch a full stack solution
–  Optimize data model for a particular use case
–  Deploy on cloud
–  Discuss tradeoffs
One month to deliver
11http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html
The project,
Step by step
12
Let’s all become students
13
14
Labor day
1
15
Discovering the subject
16
Discovering the subject
An earthquake occurs in the Sea of Japan. A tsunami is likely to hit the coast.
Notify all the population within a 500km range as soon as possible.
Constraints:
•  Team work (3 members)
•  Choose a technology seen in the module
•  Deploy on AWS (300€ budget)
•  The closest data center is lost
•  Pre-load the data
2
17
Discovering Data
18
How to get the data ?
3
•  Generated logs from telecom providers
•  One month of position tracking
•  1 Mb, 1 Gb, 10 Gb, 100 Gb files
•  Available on Amazon S3
•  10 biggest cities
•  100 network cells/city
•  1.000.000 different phone numbers
19
Data distribution
4
20
Data format
Timestamp Cell ID Coordinates Phone Number
4
2015-01-04 17:10:52,834;Osa_61;34.232793;135.906451;829924
Full dataset (100 GB):
•  240 000 logs per hour / city
•  ~ 2 000 000 000 rows
21
Data distribution
5
22
Data distribution
5
2015-01-04 17:10:52,834;Osa_61;34.232793;135.906451;829924
2015-01-04 17:10:52,928;Yok_85;35.494120;139.424249;121737
2015-01-04 17:10:52,423;Tok_14;35.683104;139.755020;731737
2015-01-04 17:10:53,923;Osa_61;34.232793;135.906451;861343
2015-01-04 17:10:53,153;Kyo_06;34.980933;135.777283;431737
...
2015-01-04 17:10:55,928;Yok_99;35.030989;140.126021;829924
23
Data distribution
5
Choosing the stack
24
25
Storage Technology short list
•  Efficient for logs
•  Fault Tolerance
–  Geolocation
–  Easy transform logs into documents
6
26
Install a local Cassandra cluster
7
27
Data Modeling
28
Naive model
Osa_19
2015-05-19 15:25:57,369 2015-05-21 15:00:57,551
456859 012564
CREATE TABLE phones1 (
cell text,
instant timestamp,
phoneNumber text,
PRIMARY KEY (cell, instant)
);
9
29
Avoid overwrites
Osa_19
2015-05-19 15:25:57,369 2015-05-21 15:00:57,551
456859, 659842 012564
9
CREATE TABLE phones2 (
cell text,
instant timestamp,
phoneNumbers Set<text>,
PRIMARY KEY (cell, instant)
);
30
Compound partition key
Osa_19 : 2015-05-19 15:25:57,369
456859 659842
- -
10
CREATE TABLE phones3 (
cell text,
instant timestamp,
phoneNumber text,
PRIMARY KEY ((cell, instant), phoneNumber)
);
34
Hourly bucketing
Osa_19 : 2015-05-19 15:00
456859 659842
- -
10
CREATE TABLE phones4 (
cell text,
instant timestamp,
phoneNumber text,
PRIMARY KEY ((cell, instant), phoneNumber)
);
CREATE TABLE phones5 (
cell text,
instant timestamp,
numbers text,
PRIMARY KEY ((cell, instant))
);
32
Query first
Osa_19, 2015-05-19 15:00
456859,659842
-
11
33
Importing Data
34
Import data into Cassandra:
13
100GB
2 billions rows
?
35
Import data into Cassandra
13
36
Import data into Cassandra: Copy
13
•  Pre-­‐processing:	
  
–  project	
  only	
  the	
  needed	
  informa9on	
  
–  cleanup	
  date	
  format	
  (ISO-8601 vs RFC3339)
37
Import data into Cassandra: Copy
14
100 hours*200 hours*
Pre-process COPY
100GB
2 billions rows
sed 's/,/./g' | awk -F',' '{ printf ""%s",
"%s","%s"",$2,$1,$5;print ""}'
•  Naive	
  parallelized	
  COPY	
  	
  
–  launch	
  one	
  copy	
  per	
  node	
  
–  launch	
  mul9ple	
  COPY	
  threads	
  /	
  node	
  
38
Import data into Cassandra: parallelize Copy
14
pre-processing,
splitting
parallelized copy
100GB
2 billions rows
39
Import data into Cassandra: Copy
15
 
•  What	
  really	
  happens	
  during	
  the	
  import?	
  
•  IMPORT	
  =	
  WRITE 	
  	
  
•  What	
  really	
  happens	
  during	
  a	
  WRITE	
  
	
  
40
How to speed up Cassandra import?
16
41
Cassandra Write Path Documentation
17
•  How	
  to	
  speed	
  up	
  Cassandra	
  import	
  =>	
  SSTableLoader	
  	
  
–  generate	
  SSTables	
  using	
  the	
  SSTableWriter	
  API	
  
–  stream	
  the	
  SSTables	
  to	
  your	
  Cassandra	
  cluster	
  using	
  sstableloader	
  
42
SSTableLoader
17
SSTableWriter sstableloader
100GB
2 billions rows
300 hours* 2 hours*
•  S3	
  is	
  distributed!	
  
–  op9mized	
  for	
  high-­‐throughput	
  when	
  used	
  in	
  parallel	
  
–  very	
  high	
  throughput	
  to/from	
  EC2	
  	
  
43
Import data into Cassandra from Amazon S3
17
©2011 http://improve.dk/pushing-the-limits-of-amazon-s3-upload-performance/
•  S3	
  is	
  a	
  distributed	
  file	
  system	
  
–  op9mized	
  for	
  high-­‐throughput	
  when	
  used	
  in	
  parallel	
  
•  Use	
  Spark	
  as	
  ETL	
  
44
Import data into Cassandra from Amazon S3
18
S3
100GB
2 billions rows
2 billions
inserts
•  use	
  Spark	
  to	
  read/pre-­‐process/write	
  to	
  Cassandra	
  in	
  parallel	
  
•  map	
  
	
  
45
Using Spark to parallelize the insert
18
S3
100GB
2 billions rows
2 billions
inserts
•  instead	
  of	
  impor9ng	
  all	
  the	
  lines	
  from	
  the	
  csv,	
  use	
  Spark	
  to	
  reduce	
  the	
  lines	
  
•  map+reduce	
  
	
  
46
Cheating / Modeling before inserting
18
S3
100GB
2 billions rows
2 billions
inserts
•  instead	
  of	
  impor9ng	
  all	
  the	
  lines	
  from	
  the	
  csv,	
  use	
  Spark	
  to	
  reduce	
  the	
  lines	
  
•  map+reduce	
  
	
  
47
Cheating / Modeling before inserting
19
S3
100GB
2 billions rows
1 million
inserts
48
Using AWS
49
Choosing the right AWS Instance type
22
50
Shopping on Amazon
•  What can 300E get you: xHours of y Instance Types + Z EBS
22
Good starting point: 6 analytics nodes, M3.large or M3.xlarge, SSD
Finally : not so difficult to choose
51
23
Capacity planning
52
•  Avantages
–  install an open source stack
–  full control
–  latest version
•  Inconvenients
–  devops skills required
–  lot of configuration to do
–  version compatibility
–  time / budget consuming
Install from scratch
24
•  DataStax Enterprise AMI : 1 analytics node shared between Cassandra and Spark
53
Start small
25
•  Requirement:
–  Preload data
•  Goal:
–  Minimize cluster uptime
•  Strategies:
–  use EBS -> persistent storage (expensive, slow)
–  load the dataset on a node and turn off the others (free, time consuming)
–  do everything “the night before” (free, risky)
54
Optimizing budget
26
55
Preparing for D-Day
•  Preload the data into Cassandra on the AWS cluster
•  Input: earthquake date and location
•  Kill the closest node
•  Start simulation
–  monitor in realtime the alerting performance
•  Tuning (model, import)
Lather, rinse, repeat...
56
Preparing for the D Day
28
•  Alternatives:
–  stop/kill the Cassandra process
–  turn off the gossip
–  AWS instance shutdown
•  The effect is the same
57
How to stop a specific node of the cluster ?
29
58
Pre-loading data
29
59
Waiting our turn
30
Conclusion
60
61
62
Data Model
No model
Use Input File Model
Query First
Optimized Query First
Smart Cheaters
0 1 2 3 4
63
Modelling in SQL vs CQL
●  SQL => CQL : state of mind shifting
○  3NF vs denormalization
○  model first vs query first
●  SQL and CQL are syntactically very similar that can be confusing
○  no Joins, range queries on partition keys...
64
Modeling data: cqlsh vs cli
65
Importing Data
0 Mo
10 Mo
1,000 Mo
100,000 Mo
66
Development
Notification System
Visualisation System
Real Time Notification
Killing node
Manage Aftershoks
0 1 2 3 4 5 6 7 8 9 10
Yes No
67
Architecture
70%
10%
20%
Cassandra
MongoDB
Both
68
Amazon Web Services
80%
20%
Yes
No
69
Quality
100%
Yes
No
70
What is the main problem ?
Modeling Data
Importing Data
Architecture
Development
Deployment
Quality
=> Multidisciplinary Approach
71
Multidisciplinary approach
Team 1 2 3 4 5 6 7 8 9 10
Modeling Data
√	
   √	
   √	
   √	
   √	
   √	
   √	
   √	
  
Importing Data
√	
   √	
   √	
  
Architecture
√	
   √	
   √	
   √	
   √	
   √	
  
Development
√	
   √	
   √	
   √	
   √	
   √	
  
Deployment
√	
   √	
   √	
   √	
   √	
   √	
   √	
   √	
  
Quality
√	
   √	
   √	
   √	
  
Grade 8 10 12 13 13 14 14.5 16 16.5 17
72
Team Skills Balance
0
1
2
3
4
5
Data Modeling
Data import
ArchitectureDevelopment
Deployment
17/20
14/20
08/20
Conclusion
•  Easy to learn and easy to teach
•  Easy to deploy on AWS
•  Performance monitor for tuning the models
•  Passing from SQL vs CQL can be challenging
•  Fun to implement on a concrete use-case with time/budget constraints
73
74
Thank you
Andrei Arion - aar@lesfurets.com
Geoffrey Bérard - gbe@lesfurets.com
Charles Herriau - che@lesfurets.com
@BeastieFurets

More Related Content

What's hot

Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
 
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Julia Angell
 
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backendscodecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backendsDataStax Academy
 
Measuring Database Performance on Bare Metal AWS Instances
Measuring Database Performance on Bare Metal AWS InstancesMeasuring Database Performance on Bare Metal AWS Instances
Measuring Database Performance on Bare Metal AWS InstancesScyllaDB
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
 
Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAwaretwilmes
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series WorldMapR Technologies
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkDatabricks
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data AppTrieu Nguyen
 
Understanding Storage I/O Under Load
Understanding Storage I/O Under LoadUnderstanding Storage I/O Under Load
Understanding Storage I/O Under LoadScyllaDB
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetupamarsri
 
Counters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleCounters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleEric Lubow
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexesDaniel Lemire
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kitehuguk
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaminghuguk
 

What's hot (20)

Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
 
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backendscodecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
 
Measuring Database Performance on Bare Metal AWS Instances
Measuring Database Performance on Bare Metal AWS InstancesMeasuring Database Performance on Bare Metal AWS Instances
Measuring Database Performance on Bare Metal AWS Instances
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAware
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
 
Understanding Storage I/O Under Load
Understanding Storage I/O Under LoadUnderstanding Storage I/O Under Load
Understanding Storage I/O Under Load
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Counters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleCounters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary Tale
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaming
 

Similar to LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting System with Cassandra (Français)

[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Crate.io
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Dataconomy Media
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerAndrew Yongjoon Kong
 
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStackAdam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStackShapeBlue
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...DataStax Academy
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsJulien Anguenot
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot modePrakash Chockalingam
 
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...RightScale
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaScyllaDB
 
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...Coburn Watson
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 

Similar to LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting System with Cassandra (Français) (20)

[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
Dice presents-feb2014
Dice presents-feb2014Dice presents-feb2014
Dice presents-feb2014
 
Self-Service Supercomputing
Self-Service SupercomputingSelf-Service Supercomputing
Self-Service Supercomputing
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStackAdam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
 
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
 
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting System with Cassandra (Français)

  • 1. From 0 to Cassandra on AWS in 30 days Tsunami alerting with Cassandra Andrei Arion - Geoffrey Bérard - Charles Herriau @BeastieFurets 1
  • 2. 1 Context 2 The project, step by step 3 Feedbacks 4 Conclusion 2
  • 3. 3
  • 4. LesFurets.com •  1st independant insurance aggregator •  A unique place to compare and buy hundreds of insurance products •  Key figures: –  2 500 000 quotes/year –  31% market share on car comparison (january 2015) –  more than 50 insurers 4
  • 6. @BeastieFurets 6 4 Teams | 25 Engineers | Lean Kanban | Daily delivery
  • 9. Context •  Master Level course •  Big Data Analysis and Management –  Non relational Databases –  30 hours •  Lectures: 25 % •  Hand’s on: 75 % 9http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html
  • 10. Context •  30 students: –  various backgrounds: students and professionals –  various skill levels –  various expertise area: devops, data engineer, marketing –  SQL and Hadoop exposure 10http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html Main topics:
  • 11. Project goals –  Choose the right technology –  Implement from scratch a full stack solution –  Optimize data model for a particular use case –  Deploy on cloud –  Discuss tradeoffs One month to deliver 11http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html
  • 13. Let’s all become students 13
  • 16. 16 Discovering the subject An earthquake occurs in the Sea of Japan. A tsunami is likely to hit the coast. Notify all the population within a 500km range as soon as possible. Constraints: •  Team work (3 members) •  Choose a technology seen in the module •  Deploy on AWS (300€ budget) •  The closest data center is lost •  Pre-load the data 2
  • 18. 18 How to get the data ? 3 •  Generated logs from telecom providers •  One month of position tracking •  1 Mb, 1 Gb, 10 Gb, 100 Gb files •  Available on Amazon S3
  • 19. •  10 biggest cities •  100 network cells/city •  1.000.000 different phone numbers 19 Data distribution 4
  • 20. 20 Data format Timestamp Cell ID Coordinates Phone Number 4 2015-01-04 17:10:52,834;Osa_61;34.232793;135.906451;829924
  • 21. Full dataset (100 GB): •  240 000 logs per hour / city •  ~ 2 000 000 000 rows 21 Data distribution 5
  • 22. 22 Data distribution 5 2015-01-04 17:10:52,834;Osa_61;34.232793;135.906451;829924 2015-01-04 17:10:52,928;Yok_85;35.494120;139.424249;121737 2015-01-04 17:10:52,423;Tok_14;35.683104;139.755020;731737 2015-01-04 17:10:53,923;Osa_61;34.232793;135.906451;861343 2015-01-04 17:10:53,153;Kyo_06;34.980933;135.777283;431737 ... 2015-01-04 17:10:55,928;Yok_99;35.030989;140.126021;829924
  • 25. 25 Storage Technology short list •  Efficient for logs •  Fault Tolerance –  Geolocation –  Easy transform logs into documents 6
  • 26. 26 Install a local Cassandra cluster 7
  • 28. 28 Naive model Osa_19 2015-05-19 15:25:57,369 2015-05-21 15:00:57,551 456859 012564 CREATE TABLE phones1 ( cell text, instant timestamp, phoneNumber text, PRIMARY KEY (cell, instant) ); 9
  • 29. 29 Avoid overwrites Osa_19 2015-05-19 15:25:57,369 2015-05-21 15:00:57,551 456859, 659842 012564 9 CREATE TABLE phones2 ( cell text, instant timestamp, phoneNumbers Set<text>, PRIMARY KEY (cell, instant) );
  • 30. 30 Compound partition key Osa_19 : 2015-05-19 15:25:57,369 456859 659842 - - 10 CREATE TABLE phones3 ( cell text, instant timestamp, phoneNumber text, PRIMARY KEY ((cell, instant), phoneNumber) );
  • 31. 34 Hourly bucketing Osa_19 : 2015-05-19 15:00 456859 659842 - - 10 CREATE TABLE phones4 ( cell text, instant timestamp, phoneNumber text, PRIMARY KEY ((cell, instant), phoneNumber) );
  • 32. CREATE TABLE phones5 ( cell text, instant timestamp, numbers text, PRIMARY KEY ((cell, instant)) ); 32 Query first Osa_19, 2015-05-19 15:00 456859,659842 - 11
  • 34. 34 Import data into Cassandra: 13 100GB 2 billions rows ?
  • 35. 35 Import data into Cassandra 13
  • 36. 36 Import data into Cassandra: Copy 13
  • 37. •  Pre-­‐processing:   –  project  only  the  needed  informa9on   –  cleanup  date  format  (ISO-8601 vs RFC3339) 37 Import data into Cassandra: Copy 14 100 hours*200 hours* Pre-process COPY 100GB 2 billions rows sed 's/,/./g' | awk -F',' '{ printf ""%s", "%s","%s"",$2,$1,$5;print ""}'
  • 38. •  Naive  parallelized  COPY     –  launch  one  copy  per  node   –  launch  mul9ple  COPY  threads  /  node   38 Import data into Cassandra: parallelize Copy 14 pre-processing, splitting parallelized copy 100GB 2 billions rows
  • 39. 39 Import data into Cassandra: Copy 15
  • 40.   •  What  really  happens  during  the  import?   •  IMPORT  =  WRITE     •  What  really  happens  during  a  WRITE     40 How to speed up Cassandra import? 16
  • 41. 41 Cassandra Write Path Documentation 17
  • 42. •  How  to  speed  up  Cassandra  import  =>  SSTableLoader     –  generate  SSTables  using  the  SSTableWriter  API   –  stream  the  SSTables  to  your  Cassandra  cluster  using  sstableloader   42 SSTableLoader 17 SSTableWriter sstableloader 100GB 2 billions rows 300 hours* 2 hours*
  • 43. •  S3  is  distributed!   –  op9mized  for  high-­‐throughput  when  used  in  parallel   –  very  high  throughput  to/from  EC2     43 Import data into Cassandra from Amazon S3 17 ©2011 http://improve.dk/pushing-the-limits-of-amazon-s3-upload-performance/
  • 44. •  S3  is  a  distributed  file  system   –  op9mized  for  high-­‐throughput  when  used  in  parallel   •  Use  Spark  as  ETL   44 Import data into Cassandra from Amazon S3 18 S3 100GB 2 billions rows 2 billions inserts
  • 45. •  use  Spark  to  read/pre-­‐process/write  to  Cassandra  in  parallel   •  map     45 Using Spark to parallelize the insert 18 S3 100GB 2 billions rows 2 billions inserts
  • 46. •  instead  of  impor9ng  all  the  lines  from  the  csv,  use  Spark  to  reduce  the  lines   •  map+reduce     46 Cheating / Modeling before inserting 18 S3 100GB 2 billions rows 2 billions inserts
  • 47. •  instead  of  impor9ng  all  the  lines  from  the  csv,  use  Spark  to  reduce  the  lines   •  map+reduce     47 Cheating / Modeling before inserting 19 S3 100GB 2 billions rows 1 million inserts
  • 49. 49 Choosing the right AWS Instance type 22
  • 50. 50 Shopping on Amazon •  What can 300E get you: xHours of y Instance Types + Z EBS 22
  • 51. Good starting point: 6 analytics nodes, M3.large or M3.xlarge, SSD Finally : not so difficult to choose 51 23 Capacity planning
  • 52. 52 •  Avantages –  install an open source stack –  full control –  latest version •  Inconvenients –  devops skills required –  lot of configuration to do –  version compatibility –  time / budget consuming Install from scratch 24
  • 53. •  DataStax Enterprise AMI : 1 analytics node shared between Cassandra and Spark 53 Start small 25
  • 54. •  Requirement: –  Preload data •  Goal: –  Minimize cluster uptime •  Strategies: –  use EBS -> persistent storage (expensive, slow) –  load the dataset on a node and turn off the others (free, time consuming) –  do everything “the night before” (free, risky) 54 Optimizing budget 26
  • 56. •  Preload the data into Cassandra on the AWS cluster •  Input: earthquake date and location •  Kill the closest node •  Start simulation –  monitor in realtime the alerting performance •  Tuning (model, import) Lather, rinse, repeat... 56 Preparing for the D Day 28
  • 57. •  Alternatives: –  stop/kill the Cassandra process –  turn off the gossip –  AWS instance shutdown •  The effect is the same 57 How to stop a specific node of the cluster ? 29
  • 61. 61
  • 62. 62 Data Model No model Use Input File Model Query First Optimized Query First Smart Cheaters 0 1 2 3 4
  • 63. 63 Modelling in SQL vs CQL ●  SQL => CQL : state of mind shifting ○  3NF vs denormalization ○  model first vs query first ●  SQL and CQL are syntactically very similar that can be confusing ○  no Joins, range queries on partition keys...
  • 65. 65 Importing Data 0 Mo 10 Mo 1,000 Mo 100,000 Mo
  • 66. 66 Development Notification System Visualisation System Real Time Notification Killing node Manage Aftershoks 0 1 2 3 4 5 6 7 8 9 10 Yes No
  • 70. 70 What is the main problem ? Modeling Data Importing Data Architecture Development Deployment Quality => Multidisciplinary Approach
  • 71. 71 Multidisciplinary approach Team 1 2 3 4 5 6 7 8 9 10 Modeling Data √   √   √   √   √   √   √   √   Importing Data √   √   √   Architecture √   √   √   √   √   √   Development √   √   √   √   √   √   Deployment √   √   √   √   √   √   √   √   Quality √   √   √   √   Grade 8 10 12 13 13 14 14.5 16 16.5 17
  • 72. 72 Team Skills Balance 0 1 2 3 4 5 Data Modeling Data import ArchitectureDevelopment Deployment 17/20 14/20 08/20
  • 73. Conclusion •  Easy to learn and easy to teach •  Easy to deploy on AWS •  Performance monitor for tuning the models •  Passing from SQL vs CQL can be challenging •  Fun to implement on a concrete use-case with time/budget constraints 73
  • 74. 74 Thank you Andrei Arion - aar@lesfurets.com Geoffrey Bérard - gbe@lesfurets.com Charles Herriau - che@lesfurets.com @BeastieFurets