SlideShare une entreprise Scribd logo
1  sur  46
codecentric AG
Matthias Niehoff
Big Data Analytics with Cassandra, Spark & MLLib
codecentric AG
– Spark
 Basics
 In A Cluster
 Cassandra Spark Connector
 Use Cases
– Spark Streaming
– Spark SQL
– Spark MLLib
– Live Demo
AGENDA
codecentric AG
SELECT * FROM performer WHERE name = 'ACDC’
ok
SELECT * FROM performer WHERE name = 'ACDC’
and country = ’Australia’
not ok
SELECT country, COUNT(*) as quantity FROM artists GROUP BY country ORDER
BY quantity DESC
not possible
CQL – QUERYING LANGUAGE WITH LIMITATIONS
18.06.2015 5
performer
name text K
style text
country text
type text
codecentric AG
– Open Source & Apache project since 2010
– Data processing framework
 Batch processing
 Stream processing
WHAT IS APACHE SPARK?
18/06/2015 6
codecentric AG
– Fast
 up to 100 times faster than Hadoop
 a lot of in-memory processing
 scalable about nodes
– Easy
 Scala, Java and Python API
 Clean Code (e.g. with lambdas in Java 8)
 expanded API: map, reduce, filter, groupBy, sort, union, join, reduceByKey,
groupByKey, sample, take, first, count ..
– Fault-Tolerant
 easily reproducible
WHY USE SPARK?
18.06.2015 7
codecentric AG
– RDD‘s – Resilient Distributed Dataset
 Read – Only Collection of Objects
 Distributed among the Cluster (on memory or disk)
 Determined through transformations
 Automatically rebuild on failure
– Operations
 Transformations (map,filter,reduce...)  new RDD
 Actions (count, collect, save)
– Only Actions start processing!
EASILY REPRODUCIBLE?
18.06.2015 8
codecentric AG
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
scala> linesWithSpark.count()
res0: Long = 126
RDD EXAMPLE
18.06.2015 9
codecentric AG
REPRODUCE RDD USING A TREE
18.06.2015 10
Datenquelle
rdd1
rdd3
val1 rdd5
rdd2
rdd4
val2
rdd6
val3
map(..)filter(..)
union(..)
count()
count() count()
sample(..)
cache()
codecentric AG
– Spark Cassandra Connector by Datastax
 https://github.com/datastax/spark-cassandra-connector
– Cassandra tables as Spark RDD (read & write)
– Mapping of C* tables and rows onto Java/Scala objects
– Server-Side filtering („where“)
– Compatible with
 Spark ≥ 0.9
 Cassandra ≥ 2.0
– Clone & Compile with SBT or download at Maven Central
SPARK ON CASSANDRA
18.06.2015 16
codecentric AG
– Start Spark Shell
bin/spark-shell
--jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar
--conf spark.cassandra.connection.host=localhost
--driver-memory 2g
--executor-memory 3g
– Import Cassandra Classes
scala> import com.datastax.spark.connector._,
USE SPARK CONNECTOR
18.06.2015 17
codecentric AG
– Read complete table
val movies = sc.cassandraTable("movie", "movies")
// Return type: CassandraRDD[CassandraRow]
– Read selected columns
val movies = sc.cassandraTable("movie", "movies").select("title","year")
– Filter Rows
val movies = sc.cassandraTable("movie", "movies").where("title = 'Die Hard'")
READ TABLE
18.06.2015 18
codecentric AG
– Access columns in ResultSet
movies.collect.foreach(r => println(r.get[String]("title")))
– As Scala tuple
sc.cassandraTable[(String,Int)]("movie","movies")
.select("title","year")
or
sc.cassandraTable("movie","movies").select("title","year")
.as((_: String, _:Int))
CassandraRDD[(String,Int)]
READ TABLE
18.06.2015 19
getString,
getStringOption,
getOrElse, getMap,
codecentric AG
case class Movie(name: String, year: Int)
sc.cassandraTable[Movie]("movie","movies")
.select("title","year")
sc.cassandraTable("movie","movies").select("title","year")
.as(Movie)
CassandraRDD[Movie]
READ TABLE – CASE CLASS
18.06.2015 20
codecentric AG
– Every RDD can be saved (not only CassandraRDD)
– Tuple
val tuples = sc.parallelize(Seq(("Hobbit",2012),
("96 Hours",2008)))
tuples.saveToCassandra("movie","movies",
SomeColumns("title","year")
- Case Class
case class Movie (title:String, year: int)
val objects = sc.parallelize(Seq( Movie("Hobbit",2012),
Movie("96 Hours",2008)))
objects.saveToCassandra("movie","movies")
WRITE TABLE
18.06.2015 21
codecentric AG
( [atomic,collection,object] , [atomic,collection,object])
val fluege= List( ("Thomas", "Berlin") ,("Mark", "Paris"), ("Thomas", "Madrid") )
val pairRDD = sc.parallelize(fluege)
pairRDD.filter(_._1 == "Thomas")
.collect
.foreach(t => println(t._1 + " flog nach " + t._2))
PAIR RDD‘S
18.06.2015 23
key – not unique value
codecentric AG
– Parallelization!
 keys are use for partitioning
 pairs with different keys are distributed across the cluster
– Efficient processing of
 aggregate by key
 group by key
 sort by key
 joins, union based on keys
WHY USE PAIR RDD'S?
18.06.2015 24
codecentric AG
– keys(), values()
– mapValues(func), flatMapValues(func)
– lookup(key), collectAsMap(), countByKey()
– reduceByKey(func), foldByKey(zeroValue)(func)
– groupByKey(), cogroup(otherDataset)
– sortByKey([ascending])
– join(otherDataset), leftOuterJoin(otherDataset), rightOuterJoin(otherDataset)
– union(otherDataset), substractByKey(otherDataset)
SPECIAL OP‘S FOR PAIR RDD‘S
18.06.2015 25
codecentric AG
val pairRDD = sc.cassandraTable("movie","director")
.map(r => (r.getString("country"),r))
// Directors / Country, sorted
pairRDD.mapValues(v => 1).reduceByKey(_+_).sortBy(-
_._2).collect.foreach(println)
// or, unsorted
pairRDD.countByKey().foreach(println)
// All Countries
pairRDD.keys()
CASSANDRA EXAMPLE
18.06.2015 26
director
name text K
country text
codecentric AG
pairRDD.groupByCountry()
RDD[(String,Iterable[CassandraRow])]
val directors = sc.cassandraTable(..).map(r => r.getString("name"),r))
val movies = sc.cassandraTable().map(r => r.getString("director"),r))
directors.cogroup(movies)
RDD[(String,
(Iterable[CassandraRow], Iterable[CassandraRow]))]
CASSANDRA EXAMPLE
18.06.2015 27
director
name text K
country text
movie
title text K
director text
Director Movies
codecentric AG
– Joins could be expensive
 partitions for same keys in different
tables on different nodes
 requires shuffling
val directors = sc.cassandraTable(..).map(r => (r.getString("name"),r))
val movies = sc.cassandraTable().map(r => (r.getString("director"),r))
movies.join(directors)
 RDD[(String, (CassandraRow, CassandraRow))]
CASSANDRA EXAMPLE - JOINS
18.06.2015 28
director
name text K
country text
movie
title text K
director text
DirectorMovie
codecentric AG
USE CASES
18.06.2015 30
DataLoading
Validation& Normalization
Analyses (Joins, Transformations,..)
Schema Migration
DataConversion
codecentric AG
– In particular for huge amounts of external data
– Support for CSV, TSV, XML, JSON und other
– Example:
case class User (id: java.util.UUID, name: String)
val users = sc.textFile("users.csv")
.repartition(2*sc.defaultParallelism)
.map(line => line.split(",") match { case Array(id,name) => User(java.util.UUID.fromString(id), name)})
users.saveToCassandra("keyspace","users")
DATA LOADING
18.06.2015 31
codecentric AG
– Validate consistency in a Cassandra Database
 syntactic
 Uniqueness (only relevant for columns not in the PK)
 Referential integrity
 Integrity of the duplicates
 semantic
 Business- or Application constraints
 e.g.: At least one genre per movies, a maximum of 10 tags per blog post
VALIDATION
18.06.2015 32
codecentric AG
– Modelling, Mining, Transforming, ....
– Use Cases
 Recommendation
 Fraud Detection
 Link Analysis (Social Networks, Web)
 Advertising
 Data Stream Analytics ( Spark Streaming)
 Machine Learning ( Spark ML)
ANALYSES
18.06.2015 33
codecentric AG
– Changes on existing tables
 New table required when changing primary key
 Otherwise changes could be performed in-place
– Creating new tables
 data derived from existing tables
 Support new queries
– Use the CassandraConnectors in Spark
val cc = CassandraConnector(sc.getConf)
cc.withSessionDo{session => session.execute(ALTER TABLE movie.movies ADD
year timestamp}
SCHEMA MIGRATION & DATA CONVERSION
18.06.2015 34
codecentric AG
STREAM PROCESSING
18.06.2015 35
.. with Spark Streaming
– Real Time Processing
– Supported sources: TCP, HDFS, S3, Kafka, Twitter,..
– Data as Discretized Stream (DStream)
– All Operations of the GenericRDD
– Stateful Operations
– Sliding Windows
codecentric AG
val ssc = new StreamingContext(sc,Seconds(1))
val stream = ssc.socketTextStream("127.0.0.1",9999)
stream.map(x => 1).reduce(_ + _).print()
ssc.start()
// await manual termination or error
ssc.awaitTermination()
// manual termination
ssc.stop()
SPARK STREAMING - BEISPIEL
18.06.2015 36
codecentric AG
– SQL Queries with Spark (SQL & HiveQL)
 On structured data
 On SchemaRDD‘s
 Every result of Spark SQL are SchemaRDD‘s
 All operations of the GenericRDD‘s available
– Unterstützt, auch auf nicht PK Spalten,...
 Joins
 Union
 Group By
 Having
 Order By
SPARK SQL
18.06.2015 37
codecentric AG
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("musicdb")
val result = csc.sql("SELECT country, COUNT(*) as anzahl" +
"FROM artists GROUP BY country" +
"ORDER BY anzahl DESC");
result.collect.foreach(println);
SPARK SQL – CASSANDRA BEISPIEL
18.06.2015 38
performer
name text K
style text
country text
type text
codecentric AG
val sqlContext = new SQLContext(sc)
val personen = sqlContext.jsonFile(path)
// Show the schema
personen.printSchema()
personen.registerTempTable("personen")
val erwachsene =
sqlContext.sql("SELECT name FROM personen WHERE alter > 18")
erwachsene.collect.foreach(println)
SPARK SQL – RDD / JSON BEISPIEL
18.06.2015 39
{"name":"Michael"}
{"name":"Jan", "alter":30}
{"name":"Tim", "alter":17}
codecentric AG
– Fully integrated in Spark
 Scalable
 Scala, Java & Python APIs
 Use with Spark Streaming & Spark SQL
– Packages various algorithms for machine learning
– Includes
 Clustering
 Classification
 Prediction
 Collaborative Filtering
– Still under development
 performance, algorithms
SPARK MLLIB
18.06.2015 40
codecentric AG
age
EXAMPLE – CLUSTERING
18.06.2015 41
set of data points meaningful clusters
income
codecentric AG
// Load and parse data
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(
s.split(' ').map(_.toDouble))).cache()
// Cluster the data into 3 classes using KMeans with 20 iterations
val clusters = KMeans.train(parsedData, 2, 20)
// Evaluate clustering by computing Sum of Squared Errors
val SSE = clusters.computeCost(parsedData)
println("Sum of Squared Errors = " + WSSSE)
EXAMPLE – CLUSTERING (K-MEANS)
18.06.2015 42
codecentric AG
EXAMPLE – CLASSIFICATION
18.06.2015 43
codecentric AG
EXAMPLE – CLASSIFICATION
18.06.2015 44
codecentric AG
// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt")
// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
EXAMPLE – CLASSIFICATION (LINEAR SVM)
18.06.2015 45
codecentric AG
// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
EXAMPLE – CLASSIFICATION (LINEAR SVM)
18.06.2015 46
codecentric AG
COLLABORATIVE FILTERING
18.06.2015 47
codecentric AG
// Load and parse the data (userid,itemid,rating)
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate)
=> Rating(user.toInt, item.toInt, rate.toDouble) })
// Build the recommendation model using ALS
val rank = 10
val numIterations = 20
val model = ALS.train(ratings, rank, numIterations, 0.01)
COLLABORATIVE FILTERING (ALS)
18.06.2015 48
codecentric AG
// Load and parse the data (userid,itemid,rating)
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate)
=> Rating(user.toInt, item.toInt, rate.toDouble) })
// Build the recommendation model using ALS
val rank = 10
val numIterations = 20
val model = ALS.train(ratings, rank, numIterations, 0.01)
COLLABORATIVE FILTERING (ALS)
18.06.2015 49
codecentric AG
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate)
=> (user, product) }
val predictions = model.predict(usersProducts).map {
case Rating(user, product, rate) => ((user, product), rate) }
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
((user, product), rate)}.join(predictions)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2); err * err }.mean()
println("Mean Squared Error = " + MSE)
COLLABORATIVE FILTERING (ALS)
18.06.2015 50
codecentric AG
– Setup, Codes & Examples on Github
 https://github.com/mniehoff/spark-cassandra-playground
– Install Cassandra
 For local experiments: Cassandra Cluster Manager
https://github.com/pcmanus/ccm
 ccm create test -v binary:2.0.5 -n 3 –s
 ccm status
 ccm node1 status
– Install Spark
 Tar Ball (Pre Built for Hadoop 2.6)
 HomeBrew
 MacPorts
CURIOUS? GETTING STARTED!
18.06.2015 51
codecentric AG
– Spark Cassandra Connector
 https://github.com/datastax/spark-cassandra-connector
 Clone, ./sbt/sbt assembly
or
 Download Jar @ Maven Central
– Start Spark shell
spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-
SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost
– Import Cassandra classes
scala> import com.datastax.spark.connector._,
CURIOUS? GETTING STARTED!
18.06.2015 52
codecentric AG
– Production ready Cassandra
– Integrates
 Spark
 Hadoop
 Solr
– OpsCenter & DevCenter
– Bundled and ready to go
– Free for development and test
– http://www.datastax.com/download
(One time registration required)
DATASTAX ENTERPRISE EDITION
18.06.2015 53
codecentric AG
QUESTIONS?
18.06.2015 57
codecentric AG
Matthias Niehoff
codecentric AG
Zeppelinstraße 2
76185 Karlsruhe
tel +49 (0) 721 – 95 95 681
matthias.niehoff@codecentric.de
KONTAKT
18.06.2015 58

Contenu connexe

Tendances

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionPatrick McFadin
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousRussell Spitzer
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team ApachePatrick McFadin
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataPatrick McFadin
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelinesPatrick McFadin
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 

Tendances (20)

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 

Similaire à Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraDenis Dus
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkTim Vincent
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonDatabricks
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developersChristopher Batey
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkVictor Coustenoble
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Erik-Berndt Scheper
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsMatt Stubbs
 

Similaire à Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day (20)

Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 

Dernier

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 

Dernier (20)

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

  • 1. codecentric AG Matthias Niehoff Big Data Analytics with Cassandra, Spark & MLLib
  • 2. codecentric AG – Spark  Basics  In A Cluster  Cassandra Spark Connector  Use Cases – Spark Streaming – Spark SQL – Spark MLLib – Live Demo AGENDA
  • 3. codecentric AG SELECT * FROM performer WHERE name = 'ACDC’ ok SELECT * FROM performer WHERE name = 'ACDC’ and country = ’Australia’ not ok SELECT country, COUNT(*) as quantity FROM artists GROUP BY country ORDER BY quantity DESC not possible CQL – QUERYING LANGUAGE WITH LIMITATIONS 18.06.2015 5 performer name text K style text country text type text
  • 4. codecentric AG – Open Source & Apache project since 2010 – Data processing framework  Batch processing  Stream processing WHAT IS APACHE SPARK? 18/06/2015 6
  • 5. codecentric AG – Fast  up to 100 times faster than Hadoop  a lot of in-memory processing  scalable about nodes – Easy  Scala, Java and Python API  Clean Code (e.g. with lambdas in Java 8)  expanded API: map, reduce, filter, groupBy, sort, union, join, reduceByKey, groupByKey, sample, take, first, count .. – Fault-Tolerant  easily reproducible WHY USE SPARK? 18.06.2015 7
  • 6. codecentric AG – RDD‘s – Resilient Distributed Dataset  Read – Only Collection of Objects  Distributed among the Cluster (on memory or disk)  Determined through transformations  Automatically rebuild on failure – Operations  Transformations (map,filter,reduce...)  new RDD  Actions (count, collect, save) – Only Actions start processing! EASILY REPRODUCIBLE? 18.06.2015 8
  • 7. codecentric AG scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 scala> linesWithSpark.count() res0: Long = 126 RDD EXAMPLE 18.06.2015 9
  • 8. codecentric AG REPRODUCE RDD USING A TREE 18.06.2015 10 Datenquelle rdd1 rdd3 val1 rdd5 rdd2 rdd4 val2 rdd6 val3 map(..)filter(..) union(..) count() count() count() sample(..) cache()
  • 9. codecentric AG – Spark Cassandra Connector by Datastax  https://github.com/datastax/spark-cassandra-connector – Cassandra tables as Spark RDD (read & write) – Mapping of C* tables and rows onto Java/Scala objects – Server-Side filtering („where“) – Compatible with  Spark ≥ 0.9  Cassandra ≥ 2.0 – Clone & Compile with SBT or download at Maven Central SPARK ON CASSANDRA 18.06.2015 16
  • 10. codecentric AG – Start Spark Shell bin/spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost --driver-memory 2g --executor-memory 3g – Import Cassandra Classes scala> import com.datastax.spark.connector._, USE SPARK CONNECTOR 18.06.2015 17
  • 11. codecentric AG – Read complete table val movies = sc.cassandraTable("movie", "movies") // Return type: CassandraRDD[CassandraRow] – Read selected columns val movies = sc.cassandraTable("movie", "movies").select("title","year") – Filter Rows val movies = sc.cassandraTable("movie", "movies").where("title = 'Die Hard'") READ TABLE 18.06.2015 18
  • 12. codecentric AG – Access columns in ResultSet movies.collect.foreach(r => println(r.get[String]("title"))) – As Scala tuple sc.cassandraTable[(String,Int)]("movie","movies") .select("title","year") or sc.cassandraTable("movie","movies").select("title","year") .as((_: String, _:Int)) CassandraRDD[(String,Int)] READ TABLE 18.06.2015 19 getString, getStringOption, getOrElse, getMap,
  • 13. codecentric AG case class Movie(name: String, year: Int) sc.cassandraTable[Movie]("movie","movies") .select("title","year") sc.cassandraTable("movie","movies").select("title","year") .as(Movie) CassandraRDD[Movie] READ TABLE – CASE CLASS 18.06.2015 20
  • 14. codecentric AG – Every RDD can be saved (not only CassandraRDD) – Tuple val tuples = sc.parallelize(Seq(("Hobbit",2012), ("96 Hours",2008))) tuples.saveToCassandra("movie","movies", SomeColumns("title","year") - Case Class case class Movie (title:String, year: int) val objects = sc.parallelize(Seq( Movie("Hobbit",2012), Movie("96 Hours",2008))) objects.saveToCassandra("movie","movies") WRITE TABLE 18.06.2015 21
  • 15. codecentric AG ( [atomic,collection,object] , [atomic,collection,object]) val fluege= List( ("Thomas", "Berlin") ,("Mark", "Paris"), ("Thomas", "Madrid") ) val pairRDD = sc.parallelize(fluege) pairRDD.filter(_._1 == "Thomas") .collect .foreach(t => println(t._1 + " flog nach " + t._2)) PAIR RDD‘S 18.06.2015 23 key – not unique value
  • 16. codecentric AG – Parallelization!  keys are use for partitioning  pairs with different keys are distributed across the cluster – Efficient processing of  aggregate by key  group by key  sort by key  joins, union based on keys WHY USE PAIR RDD'S? 18.06.2015 24
  • 17. codecentric AG – keys(), values() – mapValues(func), flatMapValues(func) – lookup(key), collectAsMap(), countByKey() – reduceByKey(func), foldByKey(zeroValue)(func) – groupByKey(), cogroup(otherDataset) – sortByKey([ascending]) – join(otherDataset), leftOuterJoin(otherDataset), rightOuterJoin(otherDataset) – union(otherDataset), substractByKey(otherDataset) SPECIAL OP‘S FOR PAIR RDD‘S 18.06.2015 25
  • 18. codecentric AG val pairRDD = sc.cassandraTable("movie","director") .map(r => (r.getString("country"),r)) // Directors / Country, sorted pairRDD.mapValues(v => 1).reduceByKey(_+_).sortBy(- _._2).collect.foreach(println) // or, unsorted pairRDD.countByKey().foreach(println) // All Countries pairRDD.keys() CASSANDRA EXAMPLE 18.06.2015 26 director name text K country text
  • 19. codecentric AG pairRDD.groupByCountry() RDD[(String,Iterable[CassandraRow])] val directors = sc.cassandraTable(..).map(r => r.getString("name"),r)) val movies = sc.cassandraTable().map(r => r.getString("director"),r)) directors.cogroup(movies) RDD[(String, (Iterable[CassandraRow], Iterable[CassandraRow]))] CASSANDRA EXAMPLE 18.06.2015 27 director name text K country text movie title text K director text Director Movies
  • 20. codecentric AG – Joins could be expensive  partitions for same keys in different tables on different nodes  requires shuffling val directors = sc.cassandraTable(..).map(r => (r.getString("name"),r)) val movies = sc.cassandraTable().map(r => (r.getString("director"),r)) movies.join(directors)  RDD[(String, (CassandraRow, CassandraRow))] CASSANDRA EXAMPLE - JOINS 18.06.2015 28 director name text K country text movie title text K director text DirectorMovie
  • 21. codecentric AG USE CASES 18.06.2015 30 DataLoading Validation& Normalization Analyses (Joins, Transformations,..) Schema Migration DataConversion
  • 22. codecentric AG – In particular for huge amounts of external data – Support for CSV, TSV, XML, JSON und other – Example: case class User (id: java.util.UUID, name: String) val users = sc.textFile("users.csv") .repartition(2*sc.defaultParallelism) .map(line => line.split(",") match { case Array(id,name) => User(java.util.UUID.fromString(id), name)}) users.saveToCassandra("keyspace","users") DATA LOADING 18.06.2015 31
  • 23. codecentric AG – Validate consistency in a Cassandra Database  syntactic  Uniqueness (only relevant for columns not in the PK)  Referential integrity  Integrity of the duplicates  semantic  Business- or Application constraints  e.g.: At least one genre per movies, a maximum of 10 tags per blog post VALIDATION 18.06.2015 32
  • 24. codecentric AG – Modelling, Mining, Transforming, .... – Use Cases  Recommendation  Fraud Detection  Link Analysis (Social Networks, Web)  Advertising  Data Stream Analytics ( Spark Streaming)  Machine Learning ( Spark ML) ANALYSES 18.06.2015 33
  • 25. codecentric AG – Changes on existing tables  New table required when changing primary key  Otherwise changes could be performed in-place – Creating new tables  data derived from existing tables  Support new queries – Use the CassandraConnectors in Spark val cc = CassandraConnector(sc.getConf) cc.withSessionDo{session => session.execute(ALTER TABLE movie.movies ADD year timestamp} SCHEMA MIGRATION & DATA CONVERSION 18.06.2015 34
  • 26. codecentric AG STREAM PROCESSING 18.06.2015 35 .. with Spark Streaming – Real Time Processing – Supported sources: TCP, HDFS, S3, Kafka, Twitter,.. – Data as Discretized Stream (DStream) – All Operations of the GenericRDD – Stateful Operations – Sliding Windows
  • 27. codecentric AG val ssc = new StreamingContext(sc,Seconds(1)) val stream = ssc.socketTextStream("127.0.0.1",9999) stream.map(x => 1).reduce(_ + _).print() ssc.start() // await manual termination or error ssc.awaitTermination() // manual termination ssc.stop() SPARK STREAMING - BEISPIEL 18.06.2015 36
  • 28. codecentric AG – SQL Queries with Spark (SQL & HiveQL)  On structured data  On SchemaRDD‘s  Every result of Spark SQL are SchemaRDD‘s  All operations of the GenericRDD‘s available – Unterstützt, auch auf nicht PK Spalten,...  Joins  Union  Group By  Having  Order By SPARK SQL 18.06.2015 37
  • 29. codecentric AG val csc = new CassandraSQLContext(sc) csc.setKeyspace("musicdb") val result = csc.sql("SELECT country, COUNT(*) as anzahl" + "FROM artists GROUP BY country" + "ORDER BY anzahl DESC"); result.collect.foreach(println); SPARK SQL – CASSANDRA BEISPIEL 18.06.2015 38 performer name text K style text country text type text
  • 30. codecentric AG val sqlContext = new SQLContext(sc) val personen = sqlContext.jsonFile(path) // Show the schema personen.printSchema() personen.registerTempTable("personen") val erwachsene = sqlContext.sql("SELECT name FROM personen WHERE alter > 18") erwachsene.collect.foreach(println) SPARK SQL – RDD / JSON BEISPIEL 18.06.2015 39 {"name":"Michael"} {"name":"Jan", "alter":30} {"name":"Tim", "alter":17}
  • 31. codecentric AG – Fully integrated in Spark  Scalable  Scala, Java & Python APIs  Use with Spark Streaming & Spark SQL – Packages various algorithms for machine learning – Includes  Clustering  Classification  Prediction  Collaborative Filtering – Still under development  performance, algorithms SPARK MLLIB 18.06.2015 40
  • 32. codecentric AG age EXAMPLE – CLUSTERING 18.06.2015 41 set of data points meaningful clusters income
  • 33. codecentric AG // Load and parse data val data = sc.textFile("data/mllib/kmeans_data.txt") val parsedData = data.map(s => Vectors.dense( s.split(' ').map(_.toDouble))).cache() // Cluster the data into 3 classes using KMeans with 20 iterations val clusters = KMeans.train(parsedData, 2, 20) // Evaluate clustering by computing Sum of Squared Errors val SSE = clusters.computeCost(parsedData) println("Sum of Squared Errors = " + WSSSE) EXAMPLE – CLUSTERING (K-MEANS) 18.06.2015 42
  • 34. codecentric AG EXAMPLE – CLASSIFICATION 18.06.2015 43
  • 35. codecentric AG EXAMPLE – CLASSIFICATION 18.06.2015 44
  • 36. codecentric AG // Load training data in LIBSVM format. val data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt") // Split data into training (60%) and test (40%). val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1) // Run training algorithm to build the model val numIterations = 100 val model = SVMWithSGD.train(training, numIterations) EXAMPLE – CLASSIFICATION (LINEAR SVM) 18.06.2015 45
  • 37. codecentric AG // Compute raw scores on the test set. val scoreAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) } // Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC() println("Area under ROC = " + auROC) EXAMPLE – CLASSIFICATION (LINEAR SVM) 18.06.2015 46
  • 39. codecentric AG // Load and parse the data (userid,itemid,rating) val data = sc.textFile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val rank = 10 val numIterations = 20 val model = ALS.train(ratings, rank, numIterations, 0.01) COLLABORATIVE FILTERING (ALS) 18.06.2015 48
  • 40. codecentric AG // Load and parse the data (userid,itemid,rating) val data = sc.textFile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val rank = 10 val numIterations = 20 val model = ALS.train(ratings, rank, numIterations, 0.01) COLLABORATIVE FILTERING (ALS) 18.06.2015 49
  • 41. codecentric AG // Evaluate the model on rating data val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts).map { case Rating(user, product, rate) => ((user, product), rate) } val ratesAndPreds = ratings.map { case Rating(user, product, rate) => ((user, product), rate)}.join(predictions) val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) => val err = (r1 - r2); err * err }.mean() println("Mean Squared Error = " + MSE) COLLABORATIVE FILTERING (ALS) 18.06.2015 50
  • 42. codecentric AG – Setup, Codes & Examples on Github  https://github.com/mniehoff/spark-cassandra-playground – Install Cassandra  For local experiments: Cassandra Cluster Manager https://github.com/pcmanus/ccm  ccm create test -v binary:2.0.5 -n 3 –s  ccm status  ccm node1 status – Install Spark  Tar Ball (Pre Built for Hadoop 2.6)  HomeBrew  MacPorts CURIOUS? GETTING STARTED! 18.06.2015 51
  • 43. codecentric AG – Spark Cassandra Connector  https://github.com/datastax/spark-cassandra-connector  Clone, ./sbt/sbt assembly or  Download Jar @ Maven Central – Start Spark shell spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0- SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost – Import Cassandra classes scala> import com.datastax.spark.connector._, CURIOUS? GETTING STARTED! 18.06.2015 52
  • 44. codecentric AG – Production ready Cassandra – Integrates  Spark  Hadoop  Solr – OpsCenter & DevCenter – Bundled and ready to go – Free for development and test – http://www.datastax.com/download (One time registration required) DATASTAX ENTERPRISE EDITION 18.06.2015 53
  • 46. codecentric AG Matthias Niehoff codecentric AG Zeppelinstraße 2 76185 Karlsruhe tel +49 (0) 721 – 95 95 681 matthias.niehoff@codecentric.de KONTAKT 18.06.2015 58

Notes de l'éditeur

  1. token = murmur3(partition key)
  2. speichert Liste der Partitionen
  3. tasks from different applications run in different JVMs Driver programm near to workers, e.g. same LAN
  4. GroupBy, ReduceBy, SQL Joins
  5. - Cassandra as Datasotre Spark for Data Modelling & Data Analytics Data Locality
  6. Data Locality Workload Isolation
  7. Master: Worker, Running / Completed Apps. Worker: Executors
  8. Anwendung: Jar hinzufügen, Context erzeugen
  9. Filter rows possible on primary key and indexed columns
  10. Alternativ: keyBy (rdd => key)
  11. toSeq.sortBy(-_._2) für Sortierung
  12. Join n keys -> n rows Cogroup -> Same keys are grouped
  13. Data Locality: Partitions aus den Daten des lokalen C* Nodes
  14. defaultParallelism: Total Number of Cores on all Nodes http://icons.iconarchive.com/icons/custom-icon-design/office/256/import-icon.png
  15. http://usacanadaregion.org/sites/usacanadaregion.org/files/Images/checkmark.jpg
  16. http://www.webhostingreviewjam.com/wp-content/uploads/2013/02/icon_analytics.png
  17. Intervall: 500ms bis einige Sekunden https://spark.apache.org/docs/latest/img/streaming-flow.png
  18. Schema RDD: Schema Informationen, besteht aus Zeilen.
  19. Tabelle mit Sänger, PK nur der Name
  20. Tabelle mit Sänger, PK nur der Name
  21. unsupervised
  22. clusters.clusterCenters
  23. supervised
  24. supervised
  25. SVM = Support Vector Machine