SlideShare une entreprise Scribd logo
1  sur  38
Analytique temps réel sur des données transactionnelles
= Cassandra + Spark 20/02/15
Victor Coustenoble Ingénieur Solutions
victor.coustenoble@datastax.com
@vizanalytics
Comment utilisez vous Cassandra?
En contrôlant votre
consommation d’énergie
En regardant des films
en streaming
En naviguant
sur des sites Internet
En achetant
en ligne
En effectuant un règlement
via Smart Phone
En jouant à des
jeux-vidéo très
connus
• Collections/Playlists
• Recommandation/Pe
rsonnalisation
• Détection de Fraude
• Messagerie
• Objets Connectés
Aperçu
Fondé en avril 2010
~35 500+
Santa Clara, Austin, New York, London, Paris, Sydney
400+
Employés Pourcent Clients
4
Straightening the road
RELATIONAL DATABASES
CQL SQL
OpsCenter / DevCenter Management tools
DSE for search & analytics Integration
Security Security
Support, consulting & training 30 years ecosystem
Apache Cassandra™
• Apache Cassandra™ est une base de données NoSQL, Open Source, Distribuée et créée pour
les applications en ligne, modernes, critiques et avec des montée en charge massive.
• Java , hybride entre Amazon Dynamo et Google BigTable
• Sans Maître-Esclave (peer-to-peer), sans Point Unique de Défaillance (No SPOF)
• Distribuée avec la possibilité de Data Center
• 100% Disponible
• Massivement scalable
• Montée en charge linéaire
• Haute Performance (lecture ET écriture)
• Multi Data Center
• Simple à Exploiter
• Language CQL (comme SQL)
• Outils OpsCenter / DevCenter
6
Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Node 1
Node 2
Node 3Node 4
Node 5
Haute Disponibilité et Cohérence
• La défaillance d’un seul noeud ne doit pas entraîner de défaillance du système
• Cohérence choisie au niveau du client
• Facteur de Réplication (RF) + Niveau de Cohérence (CL) = Succès
• Exemple:
• RF = 3
• CL = QUORUM (= 51% des replicas)
©2014 DataStax Confidential. Do not distribute without consent. 7
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
12 μs ack
> 51% de réponses – donc la requête est réussie
CL(Read) + CL(Write) > RF => Cohérence Immédiate/Forte
DataStax Enterprise
Cassandra
Certifié,
Prêt pour
l’Entreprise
8
Security Analytics Search Visual
Monitoring
Management
Services
In-Memory
Dev.IDE&
Drivers
Professional
Services
Support&
Training
Confiance
d’utilisation
Fonctionnalités
d’Entreprise
DataStax Enterprise - Analytique
• Conçu pour faire des analyses sur des données Cassandra
• Il y a 4 façons de faire de l’Analytique sur des données Cassandra:
1. Recherche (Solr)
2. Analytique en mode Batch (Hadoop)
3. Analytique en mode Batch avec des outils Externe (Cloudera, Hortonworks)
4. Analytique Temps Réel
©2014 DataStax Confidential. Do not distribute without consent.
Partenariat
©2014 DataStax Confidential. Do not distribute without consent. 10
Why Spark on Cassandra?
• Analytics on transactional data and operational applications
• Data model independent queries
• Cross-table operations (JOIN, UNION, etc.)
• Complex analytics (e.g. machine learning)
• Data transformation, aggregation, etc.
• Stream processing
• Better performances than Hadoop Map/Reduce
Real-time Big Data
©2014 DataStax Confidential. Do not distribute without consent. 12
Data Enrichment
Batch Processing
Machine Learning
Pre-computed
aggregates
Data
NO ETL
Real-Time Big Data Use Cases
• Recommendation Engine
• Internet of Things
• Fraud Detection
• Risk Analysis
• Buyer Behaviour Analytics
• Telematics, Logistics
• Business Intelligence
• Infrastructure Monitoring
• …
©2014 DataStax Confidential. Do not distribute without consent. 13
Composants Sparks
Shark
or
Spark SQL
Structured
Spark
Streaming
Real-time
MLlib
Machine learning
Spark (General execution engine)
GraphX
Graph
Cassandra
Compatible
Isolation des ressources
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
DSE Spark Integration Architecture
Node 1
Node 2
Node 3
Node 4
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
Spark
Master
(JVM)
App
Driver
Spark Cassandra Connector
C*
C*
C*C*
Spark Executor
C* Java Driver
Spark-Cassandra Connector
User Application
Cassandra
Cassandra Spark Driver
•Cassandra tables exposed as Spark RDDs
•Load data from Cassandra to Spark
•Write data from Spark to Cassandra
•Object mapper : Mapping of C* tables and rows to Scala objects
•Type conversions : All Cassandra types supported and converted to Scala types
•Server side data selection
•Virtual Nodes support
•Scala and Java APIs
DSE Spark Interactive Shell
$ dse spark
...
Spark context available as sc.
HiveSQLContext available as hc.
CassandraSQLContext available as csc.
scala> sc.cassandraTable("test", "kv")
res5: com.datastax.spark.connector.rdd.CassandraRDD
[com.datastax.spark.connector.CassandraRow] =
CassandraRDD[2] at RDD at CassandraRDD.scala:48
scala> sc.cassandraTable("test", "kv").collect
res6: Array[com.datastax.spark.connector.CassandraRow] =
Array(CassandraRow{k: 1, v: foo})
cqlsh> select * from
test.kv;
k | v
---+-----
1 | foo
(1 rows)
Connecting to Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objects
import com.datastax.driver.spark._
// Spark connection options
val conf = new SparkConf(true)
.setMaster("spark://192.168.123.10:7077")
.setAppName("cassandra-demo")
.set("cassandra.connection.host", "192.168.123.10") // initial
contact
.set("cassandra.username", "cassandra")
.set("cassandra.password", "cassandra")
val sc = new SparkContext(conf)
Reading Data
val table = sc
.cassandraTable[CassandraRow]("db", "tweets")
.select("user_name", "message")
.where("user_name = ?", "ewa")
row
representation keyspace table
server side column
and row selection
Writing Data
CREATE TABLE test.words(word TEXT PRIMARY KEY, count INT);
val collection = sc.parallelize(Seq(("foo", 2), ("bar", 5)))
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
cqlsh:test> select * from words;
word | count
------+-------
bar | 5
foo | 2
(2 rows)
Mapping Rows to Objects
CREATE TABLE test.cars (
id text PRIMARY KEY,
model text,
fuel_type text,
year int
);
case class Vehicle(
id: String,
model: String,
fuelType: String,
year: Int
)
sc.cassandraTable[Vehicle]("test", "cars").toArray
//Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009),
// Vehicle(MT8787, Hyundai x35, Diesel, 2011)

* Mapping rows to Scala Case Classes
* CQL underscore case column mapped to Scala camel case property
* Custom mapping functions (see docs)
Type Mapping
CQL Type Scala Type
ascii String
bigint Long
boolean Boolean
counter Long
decimal BigDecimal, java.math.BigDecimal
double Double
float Float
inet java.net.InetAddress
int Int
list Vector, List, Iterable, Seq, IndexedSeq, java.util.List
map Map, TreeMap, java.util.HashMap
set Set, TreeSet, java.util.HashSet
text, varchar String
timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime
timeuuid java.util.UUID
uuid java.util.UUID
varint BigInt, java.math.BigInteger
*nullable values Option
Shark
• SQL query engine on top of Spark
• Not part of Apache Spark
• Hive compatible (JDBC, UDFs, types, metadata, etc.)
• Supports in-memory tables
• Available as a part of DataStax Enterprise
Spark SQL
• Spark SQL supports a subset of SQL-92 language
• Spark SQL optimized for Spark internals (e.g. RDDs) , better performances than Shark
• Support for in-memory computation
•From Spark command line
•Mapping of Cassandra keyspaces and tables
•Read and write on Cassandra tables
Usage of Spark SQL & HiveQL query
import com.datastax.spark.connector._
// Connect to the Spark cluster
val conf = new SparkConf(true)...
val sc = new SparkContext(conf)
// Create Cassandra SQL context
val cc = new CassandraSQLContext(sc)
// Execute SQL query
val rdd = cc.sql("INSERT INTO ks.t1 SELECT c1,c2 FROM ks.t2")
// Execute HQL query
val rdd = cc.hql("SELECT * FROM keyspace.table JOIN ... WHERE ...")
Spark Streaming
• For real time analytics
• Push or pull model
• Stream TO and FROM Cassandra
• Micro batching (each batch represented as RDD)
• Fault tolerant
• Data processed in small batches
• Exactly-once processing
• Unified stream and batch processing framework
• Supports Kafka, Flume, ZeroMQ, Kinesis, MQTT
producers
Usage of Spark Streaming
• Due to the unifying Spark architecture,
portions of batch and streaming
development can be reused
• Given that Spark Streaming is backed by
Cassandra, no need to depend upon
solutions like Apache Zookeeper ™ in
production
import com.datastax.spark.connector.streaming._
// Spark connection options
val conf = new SparkConf(true)...
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
// stream input
val lines = ssc.socketTextStream(serverIP, serverPort)
// count words
val wordCounts = lines.flatMap(_.split(" ")).map(word =>
(word, 1)).reduceByKey(_ + _)
// stream output
wordCounts.saveToCassandra("test", "words")
// start processing
ssc.start()
ssc.awaitTermination()
Python API
$ dse pyspark
Python 2.7.8 (default, Oct 20 2014, 15:05:19)
[GCC 4.9.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 1.1.0
/_/
Using Python version 2.7.8 (default, Oct 20 2014 15:05:19)
SparkContext available as sc.
>>> sc.cassandraTable("test", "kv").collect()
[Row(k=1, v=u'foo')]
DataStax Enterprise + Spark Special Features
•Easy setup and config
• no need to setup a separate Spark cluster
• no need to tweak classpaths or config files
•High availability of Spark Master
•Enterprise security
• Password / Kerberos / LDAP authentication
• SSL for all Spark to Cassandra connections
•CFS integration (no SPOF distributed file system)
•Cassandra access through Spark Python API
•Certified and Supported on Cassandra
•Shark availability
DataStax Enterprise - High Availability
• All nodes are Spark Workers
• By default resilient to Worker failures
• First Spark node promoted as Spark Master (state saved
in CFS, no SPOF)
• Standby Master promoted on failure (New Spark Master
reconnects to Workers and the driver app and continues the job)
Without DataStax Enterprise
33
C*
SparkM
SparkW
C* SparkW
C* SparkWC* SparkW
C* SparkW
With DataStax Enterprise
34
C*
SparkM
SparkW
C*
SparkW*
C* SparkWC* SparkW
C* SparkW
Master state in C*
Spare master for H/A
Spark Use Cases
35
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
DataStax Enterprise
© 2014 DataStax, All Rights Reserved. Company Confidential
External Hadoop Distribution
Cloudera, Hortonworks
OpsCenter
Services
Hadoop
Monitoring
Operations
Operational
Application
Real Time
Search
Real Time
Analytics
Batch
Analytics
SGBDR
Analytics
Transformation
s
36
Cassandra Cluster – Nodes Ring – Column Family Storage
High Performance – Alway Available – Massive Scalability
Advanced
Security
In-Memory
How to Spark on Cassandra?
DataStax Cassandra Spark driver
https://github.com/datastax/cassandra-driver-spark
Compatible with
•Spark 1.2
•Cassandra 2.0.x and 2.1.x
•DataStax Enterprise 4.5 et 4.6
DataStax Enterprise 4.6 = Cassandra 2.0 + Driver + Spark 1.1
Spark 1.2 in next DSE 4.7 version (March)
Merci Questions ?
We power the big data apps
that transform business.
©2013 DataStax Confidential. Do not distribute without consent.
victor.coustenoble@datastax.com
@vizanalytics

Contenu connexe

Tendances

Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMongoDB
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversOptimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversScyllaDB
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
MariaDB 10: The Complete Tutorial
MariaDB 10: The Complete TutorialMariaDB 10: The Complete Tutorial
MariaDB 10: The Complete TutorialColin Charles
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 

Tendances (20)

Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Spark overview
Spark overviewSpark overview
Spark overview
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversOptimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database Drivers
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
MariaDB 10: The Complete Tutorial
MariaDB 10: The Complete TutorialMariaDB 10: The Complete Tutorial
MariaDB 10: The Complete Tutorial
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 

Similaire à Spark + Cassandra = Real Time Analytics on Operational Data

5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupVictor Coustenoble
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016 Hiromitsu Komatsu
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016 Hiromitsu Komatsu
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkVictor Coustenoble
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developersChristopher Batey
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Cassandra 2.0 to 2.1
Cassandra 2.0 to 2.1Cassandra 2.0 to 2.1
Cassandra 2.0 to 2.1Johnny Miller
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkTim Vincent
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introductionfardinjamshidi
 

Similaire à Spark + Cassandra = Real Time Analytics on Operational Data (20)

5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Cassandra 2.0 to 2.1
Cassandra 2.0 to 2.1Cassandra 2.0 to 2.1
Cassandra 2.0 to 2.1
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 

Plus de Victor Coustenoble

Préparation de Données pour la Détection de Fraude
Préparation de Données pour la Détection de FraudePréparation de Données pour la Détection de Fraude
Préparation de Données pour la Détection de FraudeVictor Coustenoble
 
Préparation de Données dans le Cloud
Préparation de Données dans le CloudPréparation de Données dans le Cloud
Préparation de Données dans le CloudVictor Coustenoble
 
Préparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaPréparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaVictor Coustenoble
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - TrifactaVictor Coustenoble
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTVictor Coustenoble
 
DataStax et Cassandra dans Azure au Microsoft Techdays
DataStax et Cassandra dans Azure au Microsoft TechdaysDataStax et Cassandra dans Azure au Microsoft Techdays
DataStax et Cassandra dans Azure au Microsoft TechdaysVictor Coustenoble
 
Quelles stratégies de Recherche avec Cassandra ?
Quelles stratégies de Recherche avec Cassandra ?Quelles stratégies de Recherche avec Cassandra ?
Quelles stratégies de Recherche avec Cassandra ?Victor Coustenoble
 
DataStax Enterprise - La plateforme de base de données pour le Cloud
DataStax Enterprise - La plateforme de base de données pour le CloudDataStax Enterprise - La plateforme de base de données pour le Cloud
DataStax Enterprise - La plateforme de base de données pour le CloudVictor Coustenoble
 
Datastax Cassandra + Spark Streaming
Datastax Cassandra + Spark StreamingDatastax Cassandra + Spark Streaming
Datastax Cassandra + Spark StreamingVictor Coustenoble
 
DataStax Enterprise et Cas d'utilisation de Apache Cassandra
DataStax Enterprise et Cas d'utilisation de Apache CassandraDataStax Enterprise et Cas d'utilisation de Apache Cassandra
DataStax Enterprise et Cas d'utilisation de Apache CassandraVictor Coustenoble
 

Plus de Victor Coustenoble (13)

Préparation de Données pour la Détection de Fraude
Préparation de Données pour la Détection de FraudePréparation de Données pour la Détection de Fraude
Préparation de Données pour la Détection de Fraude
 
Préparation de Données dans le Cloud
Préparation de Données dans le CloudPréparation de Données dans le Cloud
Préparation de Données dans le Cloud
 
Préparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaPréparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec Trifacta
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
 
DataStax et Cassandra dans Azure au Microsoft Techdays
DataStax et Cassandra dans Azure au Microsoft TechdaysDataStax et Cassandra dans Azure au Microsoft Techdays
DataStax et Cassandra dans Azure au Microsoft Techdays
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
 
Quelles stratégies de Recherche avec Cassandra ?
Quelles stratégies de Recherche avec Cassandra ?Quelles stratégies de Recherche avec Cassandra ?
Quelles stratégies de Recherche avec Cassandra ?
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
 
DataStax Enterprise - La plateforme de base de données pour le Cloud
DataStax Enterprise - La plateforme de base de données pour le CloudDataStax Enterprise - La plateforme de base de données pour le Cloud
DataStax Enterprise - La plateforme de base de données pour le Cloud
 
Datastax Cassandra + Spark Streaming
Datastax Cassandra + Spark StreamingDatastax Cassandra + Spark Streaming
Datastax Cassandra + Spark Streaming
 
DataStax Enterprise et Cas d'utilisation de Apache Cassandra
DataStax Enterprise et Cas d'utilisation de Apache CassandraDataStax Enterprise et Cas d'utilisation de Apache Cassandra
DataStax Enterprise et Cas d'utilisation de Apache Cassandra
 

Dernier

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Spark + Cassandra = Real Time Analytics on Operational Data

  • 1. Analytique temps réel sur des données transactionnelles = Cassandra + Spark 20/02/15 Victor Coustenoble Ingénieur Solutions victor.coustenoble@datastax.com @vizanalytics
  • 2.
  • 3. Comment utilisez vous Cassandra? En contrôlant votre consommation d’énergie En regardant des films en streaming En naviguant sur des sites Internet En achetant en ligne En effectuant un règlement via Smart Phone En jouant à des jeux-vidéo très connus • Collections/Playlists • Recommandation/Pe rsonnalisation • Détection de Fraude • Messagerie • Objets Connectés
  • 4. Aperçu Fondé en avril 2010 ~35 500+ Santa Clara, Austin, New York, London, Paris, Sydney 400+ Employés Pourcent Clients 4
  • 5. Straightening the road RELATIONAL DATABASES CQL SQL OpsCenter / DevCenter Management tools DSE for search & analytics Integration Security Security Support, consulting & training 30 years ecosystem
  • 6. Apache Cassandra™ • Apache Cassandra™ est une base de données NoSQL, Open Source, Distribuée et créée pour les applications en ligne, modernes, critiques et avec des montée en charge massive. • Java , hybride entre Amazon Dynamo et Google BigTable • Sans Maître-Esclave (peer-to-peer), sans Point Unique de Défaillance (No SPOF) • Distribuée avec la possibilité de Data Center • 100% Disponible • Massivement scalable • Montée en charge linéaire • Haute Performance (lecture ET écriture) • Multi Data Center • Simple à Exploiter • Language CQL (comme SQL) • Outils OpsCenter / DevCenter 6 Dynamo BigTable BigTable: http://research.google.com/archive/bigtable-osdi06.pdf Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf Node 1 Node 2 Node 3Node 4 Node 5
  • 7. Haute Disponibilité et Cohérence • La défaillance d’un seul noeud ne doit pas entraîner de défaillance du système • Cohérence choisie au niveau du client • Facteur de Réplication (RF) + Niveau de Cohérence (CL) = Succès • Exemple: • RF = 3 • CL = QUORUM (= 51% des replicas) ©2014 DataStax Confidential. Do not distribute without consent. 7 Node 1 1st copy Node 4 Node 5 Node 2 2nd copy Node 3 3rd copy Parallel Write Write CL=QUORUM 5 μs ack 12 μs ack 12 μs ack > 51% de réponses – donc la requête est réussie CL(Read) + CL(Write) > RF => Cohérence Immédiate/Forte
  • 8. DataStax Enterprise Cassandra Certifié, Prêt pour l’Entreprise 8 Security Analytics Search Visual Monitoring Management Services In-Memory Dev.IDE& Drivers Professional Services Support& Training Confiance d’utilisation Fonctionnalités d’Entreprise
  • 9. DataStax Enterprise - Analytique • Conçu pour faire des analyses sur des données Cassandra • Il y a 4 façons de faire de l’Analytique sur des données Cassandra: 1. Recherche (Solr) 2. Analytique en mode Batch (Hadoop) 3. Analytique en mode Batch avec des outils Externe (Cloudera, Hortonworks) 4. Analytique Temps Réel ©2014 DataStax Confidential. Do not distribute without consent.
  • 10. Partenariat ©2014 DataStax Confidential. Do not distribute without consent. 10
  • 11. Why Spark on Cassandra? • Analytics on transactional data and operational applications • Data model independent queries • Cross-table operations (JOIN, UNION, etc.) • Complex analytics (e.g. machine learning) • Data transformation, aggregation, etc. • Stream processing • Better performances than Hadoop Map/Reduce
  • 12. Real-time Big Data ©2014 DataStax Confidential. Do not distribute without consent. 12 Data Enrichment Batch Processing Machine Learning Pre-computed aggregates Data NO ETL
  • 13. Real-Time Big Data Use Cases • Recommendation Engine • Internet of Things • Fraud Detection • Risk Analysis • Buyer Behaviour Analytics • Telematics, Logistics • Business Intelligence • Infrastructure Monitoring • … ©2014 DataStax Confidential. Do not distribute without consent. 13
  • 14. Composants Sparks Shark or Spark SQL Structured Spark Streaming Real-time MLlib Machine learning Spark (General execution engine) GraphX Graph Cassandra Compatible
  • 16. Cassandra Executor ExecutorSpark Worker (JVM) Cassandra Executor ExecutorSpark Worker (JVM) DSE Spark Integration Architecture Node 1 Node 2 Node 3 Node 4 Cassandra Executor ExecutorSpark Worker (JVM) Cassandra Executor ExecutorSpark Worker (JVM) Spark Master (JVM) App Driver
  • 17. Spark Cassandra Connector C* C* C*C* Spark Executor C* Java Driver Spark-Cassandra Connector User Application Cassandra
  • 18. Cassandra Spark Driver •Cassandra tables exposed as Spark RDDs •Load data from Cassandra to Spark •Write data from Spark to Cassandra •Object mapper : Mapping of C* tables and rows to Scala objects •Type conversions : All Cassandra types supported and converted to Scala types •Server side data selection •Virtual Nodes support •Scala and Java APIs
  • 19. DSE Spark Interactive Shell $ dse spark ... Spark context available as sc. HiveSQLContext available as hc. CassandraSQLContext available as csc. scala> sc.cassandraTable("test", "kv") res5: com.datastax.spark.connector.rdd.CassandraRDD [com.datastax.spark.connector.CassandraRow] = CassandraRDD[2] at RDD at CassandraRDD.scala:48 scala> sc.cassandraTable("test", "kv").collect res6: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{k: 1, v: foo}) cqlsh> select * from test.kv; k | v ---+----- 1 | foo (1 rows)
  • 20. Connecting to Cassandra // Import Cassandra-specific functions on SparkContext and RDD objects import com.datastax.driver.spark._ // Spark connection options val conf = new SparkConf(true) .setMaster("spark://192.168.123.10:7077") .setAppName("cassandra-demo") .set("cassandra.connection.host", "192.168.123.10") // initial contact .set("cassandra.username", "cassandra") .set("cassandra.password", "cassandra") val sc = new SparkContext(conf)
  • 21. Reading Data val table = sc .cassandraTable[CassandraRow]("db", "tweets") .select("user_name", "message") .where("user_name = ?", "ewa") row representation keyspace table server side column and row selection
  • 22. Writing Data CREATE TABLE test.words(word TEXT PRIMARY KEY, count INT); val collection = sc.parallelize(Seq(("foo", 2), ("bar", 5))) collection.saveToCassandra("test", "words", SomeColumns("word", "count")) cqlsh:test> select * from words; word | count ------+------- bar | 5 foo | 2 (2 rows)
  • 23. Mapping Rows to Objects CREATE TABLE test.cars ( id text PRIMARY KEY, model text, fuel_type text, year int ); case class Vehicle( id: String, model: String, fuelType: String, year: Int ) sc.cassandraTable[Vehicle]("test", "cars").toArray //Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009), // Vehicle(MT8787, Hyundai x35, Diesel, 2011)  * Mapping rows to Scala Case Classes * CQL underscore case column mapped to Scala camel case property * Custom mapping functions (see docs)
  • 24. Type Mapping CQL Type Scala Type ascii String bigint Long boolean Boolean counter Long decimal BigDecimal, java.math.BigDecimal double Double float Float inet java.net.InetAddress int Int list Vector, List, Iterable, Seq, IndexedSeq, java.util.List map Map, TreeMap, java.util.HashMap set Set, TreeSet, java.util.HashSet text, varchar String timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime timeuuid java.util.UUID uuid java.util.UUID varint BigInt, java.math.BigInteger *nullable values Option
  • 25. Shark • SQL query engine on top of Spark • Not part of Apache Spark • Hive compatible (JDBC, UDFs, types, metadata, etc.) • Supports in-memory tables • Available as a part of DataStax Enterprise
  • 26. Spark SQL • Spark SQL supports a subset of SQL-92 language • Spark SQL optimized for Spark internals (e.g. RDDs) , better performances than Shark • Support for in-memory computation
  • 27. •From Spark command line •Mapping of Cassandra keyspaces and tables •Read and write on Cassandra tables Usage of Spark SQL & HiveQL query import com.datastax.spark.connector._ // Connect to the Spark cluster val conf = new SparkConf(true)... val sc = new SparkContext(conf) // Create Cassandra SQL context val cc = new CassandraSQLContext(sc) // Execute SQL query val rdd = cc.sql("INSERT INTO ks.t1 SELECT c1,c2 FROM ks.t2") // Execute HQL query val rdd = cc.hql("SELECT * FROM keyspace.table JOIN ... WHERE ...")
  • 28. Spark Streaming • For real time analytics • Push or pull model • Stream TO and FROM Cassandra • Micro batching (each batch represented as RDD) • Fault tolerant • Data processed in small batches • Exactly-once processing • Unified stream and batch processing framework • Supports Kafka, Flume, ZeroMQ, Kinesis, MQTT producers
  • 29. Usage of Spark Streaming • Due to the unifying Spark architecture, portions of batch and streaming development can be reused • Given that Spark Streaming is backed by Cassandra, no need to depend upon solutions like Apache Zookeeper ™ in production import com.datastax.spark.connector.streaming._ // Spark connection options val conf = new SparkConf(true)... // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) // stream input val lines = ssc.socketTextStream(serverIP, serverPort) // count words val wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) // stream output wordCounts.saveToCassandra("test", "words") // start processing ssc.start() ssc.awaitTermination()
  • 30. Python API $ dse pyspark Python 2.7.8 (default, Oct 20 2014, 15:05:19) [GCC 4.9.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 1.1.0 /_/ Using Python version 2.7.8 (default, Oct 20 2014 15:05:19) SparkContext available as sc. >>> sc.cassandraTable("test", "kv").collect() [Row(k=1, v=u'foo')]
  • 31. DataStax Enterprise + Spark Special Features •Easy setup and config • no need to setup a separate Spark cluster • no need to tweak classpaths or config files •High availability of Spark Master •Enterprise security • Password / Kerberos / LDAP authentication • SSL for all Spark to Cassandra connections •CFS integration (no SPOF distributed file system) •Cassandra access through Spark Python API •Certified and Supported on Cassandra •Shark availability
  • 32. DataStax Enterprise - High Availability • All nodes are Spark Workers • By default resilient to Worker failures • First Spark node promoted as Spark Master (state saved in CFS, no SPOF) • Standby Master promoted on failure (New Spark Master reconnects to Workers and the driver app and continues the job)
  • 33. Without DataStax Enterprise 33 C* SparkM SparkW C* SparkW C* SparkWC* SparkW C* SparkW
  • 34. With DataStax Enterprise 34 C* SparkM SparkW C* SparkW* C* SparkWC* SparkW C* SparkW Master state in C* Spare master for H/A
  • 35. Spark Use Cases 35 Load data from various sources Analytics (join, aggregate, transform, …) Sanitize, validate, normalize data Schema migration, Data conversion
  • 36. DataStax Enterprise © 2014 DataStax, All Rights Reserved. Company Confidential External Hadoop Distribution Cloudera, Hortonworks OpsCenter Services Hadoop Monitoring Operations Operational Application Real Time Search Real Time Analytics Batch Analytics SGBDR Analytics Transformation s 36 Cassandra Cluster – Nodes Ring – Column Family Storage High Performance – Alway Available – Massive Scalability Advanced Security In-Memory
  • 37. How to Spark on Cassandra? DataStax Cassandra Spark driver https://github.com/datastax/cassandra-driver-spark Compatible with •Spark 1.2 •Cassandra 2.0.x and 2.1.x •DataStax Enterprise 4.5 et 4.6 DataStax Enterprise 4.6 = Cassandra 2.0 + Driver + Spark 1.1 Spark 1.2 in next DSE 4.7 version (March)
  • 38. Merci Questions ? We power the big data apps that transform business. ©2013 DataStax Confidential. Do not distribute without consent. victor.coustenoble@datastax.com @vizanalytics

Notes de l'éditeur

  1. Qui nous connait parmi vous. En fait dans votre vie quotienne, vous utilisez la technologie DataStax sans le savoir : ebay pour les recommandations produit, bientot NetFlix pour visonner des films en streaming, un achat par SmartSphone grace à nouveau un service offert par un grande banque mutualiste, un échange de de message instantanée avec un service du plus gros opérateur de téléphonie en France etc… Finallement vous utilisez dans votre vie de tous les jours les différents types d’applications proposées par nos 500 clients et qui s’appuie sur notre technologie de base de données We are growing so fast, and in so many ways, I'm willing to bet you’ve used our technology several times in just the past few days and don’t even realize it.  Whether you did some online banking, browsed news sites, did a bit of retail shopping, filled a few prescriptions, or watched movies online -- basically, if you lived your life -- you used the kinds of applications that we power for over 400 customers, including over 20 of the Fortune 100.
  2. Key Takeaway- Introduce the company, our incredible growth and global presence, that we are in about 25% of the FORTUNE 100, and the fact that many of the online and mobile applications you already use every day are actually built on DataStax. Talk Track- DataStax, the leading distributed database technology, delivers Apache Cassandra to the world’s most innovative companies such as Netflix, Rackspace, Pearson Education and Constant Contact. DataStax is built to be agile, always-on, and predictably scalable to any size. We were founded in April 2010, so we are a little over 4 years old. We are headquartered in Santa Clara, California and have offices in Austin TX, New York, London, England and Sydney Australia. We now have over 330 employees; this number will reach well over 400 by the end of our fiscal year (Jan 31 2015) and double by the end of FY16. Currently 25% of the Fortune 100 use us, and our success has been built on our customers success and today and we have over 500 customers worldwide, in over 40 countries. The logos you see here are ones that you are already using every day. These applications are all built on DataStax and Apache Cassandra. So how have we come so far in such a short time…..?
  3. En fait la mission de DataStax est de vos libérer de ces incertitudes et vous faciliter la route sur cette nouvelle voie. A cette fin, nous vous offrons un DML DDL appelé CQL très proche du SQL maitrisé par vos équipes, des outils complets d’administration et de monitoring, So, What DataStax is doing is trying to straightened that bend in the road. We are providing things like CQL, and management tools called DevCenter and OpsCenter. DataStax Enterprise provides integration into analytics and search capabilities and we do it all within a secure environment. We also provide consultants and training courses, including free virtual training to help get you up to speed.
  4. Cassandra is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with continuous availability without compromising performance. It uses aspects of Dynamos partitioning and replication and a log-structured data model similar to Bigtable’s. It takes its distribution algorithm from Dynamo and its data model from Bigtable. Cassandra is a reinvented database which is lightening fast and always on ideal for todays online applications where relational databases like Oracle can’t keep up. This means that in todays world, cassandra stores and processes real time information at fast, predictive performance and built in fault tolerance
  5. Key Takeaway- DataStax Enterprise delivers the commercial confidence and the additional enterprise functionality that you need to support your online business applications. Talk Track- DataStax takes the latest version of open source Apache Cassandra, certifies and prepares it for bullet-proof enterprise deployment. We deliver commercial confidence in the form of training and support, development tools and drivers and professional implementation services to ensure that you have everything you need to successfully deploy Cassandra on support of your mainstream business applications. We also offer additional functionality such as Management Services, that allow you to automatically manage administration and performance Security and encryption to ensure that your data remains perfectly safe and free from corruption In-Memory option that allows you to deliver online applications with lightening fast response times Analytics that allow you to gain valuable insights into data center performance Search which easily allows you search you database, and Visual Monitoring, our Ops Center product that allows you to easily manage and monitor data center performance from anywhere, and on any device
  6. Databricks is the company behind Apache Spark.
  7. Predictive analytics Does this simple architecture look familiar to you? Lambda Nathan Marz
  8. Shark is hive compatible – you can run the same application on Shark Shark integration is only on DSE, otherwise you have to wait for Spark SQL Separate projects – Spark is totally different project Spark SQL has borrowed from Spark Both promising to be Hive compatible
  9. Cassandra spark driver will NOT connect to remote DC Different nodes, profile etc..
  10. Master HA out of the box with DSE A Spark Master controls the workflow, and a Spark Worker launches executors responsible for executing part of the job submitted to the Spark master.
  11. DUYHAI