SlideShare une entreprise Scribd logo
1  sur  28
USING MULTIPLE PERSISTENCE LAYERS IN SPARK
TO BUILD A SCALABLE PREDICTION ENGINE
Richard Williamson | @SuperHadooper
22 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
• Some Background
• Model Building/Scoring Use Case
• Spark Persistence to Parquet via
Kafka
• Exploratory Analysis with Impala
• Spark Persistence to Cassandra
• Operational queries via CQL
• Impala Model Building/Scoring
• Spark Model Building/Scoring
• Questions
AGENDA
3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
4 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
THE EXPERIMENTAL ENTERPRISE
We need to both support investigative
work and build a solid layer for
production.
Data science allows us to observe our
experiments and respond to the
changing environment.
The foundation of the experimental
enterprise focuses on making
infrastructure readily accessible.
5 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
6 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
• VOLUME: Storing longer
periods of history
• VELOCITY: Stream
processing allows for on
the fly forecast
adjustments
• VARIETY: Data Scientists
can use additional
features to improve
model accuracy
HOW HAS
“BIG DATA”
CHANGED OUR
ABILITY TO BETTER
PREDICT THE
FUTURE?
7 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
• Lowered Cost of Storing
and Processing Data
• Improved data scientist
efficiency by having
common repository
• Improved modeling
performance by bringing
models to the data versus
bringing data to the
models
HOW HAS
“BIG DATA
TECHNOLOGY”
CHANGED OUR
ABILITY TO BETTER
PREDICT THE
FUTURE?
88 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Can we predict future RSVPs
by consuming Meetup.com
streaming API data?
http://stream.meetup.com/2/rs
vps
https://github.com/silicon-
valley-data-science/strataca-
2015/tree/master/Spark-to-
Build-a-Scalable-Prediction-
Engine
EXAMPLE USE
CASE
‹#›
WHERE DO WE
START?
10 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
BY TAKING A LOOK AT THE DATA…
{"response":"yes","member":{"member_name":"Richard
Williamson","photo":"http://photos3.meetupstatic.com/photos/member/d/a/4/0/thumb_23159587
2.jpeg","member_id":29193652},"visibility":"public","event":{"time":1424223000000,"event_url":"http://www.me
etup.com/Big-Data-Science/events/217322312/","event_id":"fbtgdlytdbbc","event_name":"Big Data
Science @Strata Conference,
2015"},"guests":0,"mtime":1424020205391,"rsvp_id":1536654666,"group":{"group_name":"Big Data
Science","group_state":"CA","group_city":"Fremont","group_lat":37.52,"group_urlname":"Big-Data-
Science","group_id":3168962,"group_country":"us","group_topics":[{"urlkey":"data-
visualization","topic_name":"Data Visualization"},{"urlkey":"data-mining","topic_name":"Data
Mining"},{"urlkey":"businessintell","topic_name":"Business
Intelligence"},{"urlkey":"mapreduce","topic_name":"MapReduce"},{"urlkey":"hadoop","topic_name":"Hadoop"},
{"urlkey":"opensource","topic_name":"Open Source"},{"urlkey":"r-project-for-statistical-
computing","topic_name":"R Project for Statistical Computing"},{"urlkey":"predictive-
analytics","topic_name":"Predictive Analytics"},{"urlkey":"cloud-computing","topic_name":"Cloud
Computing"},{"urlkey":"big-data","topic_name":"Big Data"},{"urlkey":"data-science","topic_name":"Data
Science"},{"urlkey":"data-analytics","topic_name":"Data
Analytics"},{"urlkey":"hbase","topic_name":"HBase"},{"urlkey":"hive","topic_name":"Hive"}],"group_lon":-
121.93},"venue":{"lon":-121.889122,"venue_name":"San Jose Convention Center, Room
210AE","venue_id":21805972,"lat":37.330341}}
11 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
FROM ONE TO MANY
Now that we know what one record looks like, let’s
capture a large dataset.
• First we capture stream to Kafka by curling to file
and tailing file to Kafka consumer — this will build
up some history for us to look at.
12 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
FROM ONE TO MANY
• Then use Spark to load to Parquet for interactive
exploratory analysis in Impala:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val meetup = sqlContext.jsonFile("/home/odroid/meetupstream”)
meetup.registerTempTable("meetup")
val meetup2 = sqlContext.sql("select event.event_id,event.event_name,event.event_url,event.time,
guests,member.member_id,member.member_name,member.other_services.facebook.identifier as
facebook_identifier, member.other_services.linkedin.identifier as linkedin_identifier,
member.other_services.twitter.identifier as twitter_identifier, member.photo,mtime,response,
rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from meetup”)
meetup2.saveAsParquetFile("/home/odroid/meetup.parquet")
13 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
INTERACTIVE EXPLORATORY
ANALYSIS IN IMPALA
• We could have done basic exploratory analysis in
Spark SQL, but date logic is not fully supported.
• It’s very easy to point Impala to the data and do it
there.
14 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
INTERACTIVE EXPLORATORY
ANALYSIS IN IMPALA
• Create external Impala table pointing to the data
we just created in Spark:
CREATE EXTERNAL TABLE meetup
LIKE PARQUET '/user/odroid/meetup/_metadata'
STORED AS PARQUET
LOCATION '/user/odroid/meetup’;
15 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
INTERACTIVE EXPLORATORY
ANALYSIS IN IMPALA
• Now we can query:
select from_unixtime(cast(mtime/1000 as bigint), "yyyy-MM-dd HH”) as data_hr
,count(*) as rsvp_cnt
from meetup
group by 1
order by 1;
16 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
val ssc = new StreamingContext(sc, Seconds(10))
val lines = KafkaUtils.createStream(ssc, ”localhost:2181", "meetupstream", Map("meetupstream" -> 10)).map(_._2)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
lines.foreachRDD(rdd => {
val lines2 = sqlContext.jsonRDD(rdd)
lines2.registerTempTable("lines2")
val lines3 = sqlContext.sql("select event.event_id,event.event_name,event.event_url,event.time,guests,
member.member_id, member.member_name,member.other_services.facebook.identifier as facebook_identifier,
member.other_services.linkedin.identifier as linkedin_identifier,member.other_services.twitter.identifier as
twitter_identifier,member.photo, mtime,response,
rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from lines2")
val lines4 = lines3.map(line => (line(0), line(5), line(13), line(1), line(2), line(3), line(4), line(6), line(7), line(8), line(9),
line(10), line(11), line(12), line(14), line(15), line(16), line(17), line(18) ))
lines4.saveToCassandra("meetup","rsvps", SomeColumns("event_id", "member_id","rsvp_id","event_name", "event_url",
"time","guests","member_name","facebook_identifier","linkedin_identifier","twitter_identifier","photo","mtime","response","l
at","lon","venue_id","venue_name","visibility"))})
ssc.start()
SPARK STREAMING:
Load Cassandra from Kafka
17 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
OPERATIONAL QUERIES IN
CASSANDRA
• When low millisecond queries are needed,
Cassandra provides favorable storage.
• Now we can perform queries like this one:
select * from meetup.rsvps where event_id=“fbtgdlytdbbc” and
member_id=29193652 and rsvp_id=1536654666;
18 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Why do we want to store the
same data in two places?
• Different access patterns prefer different physical
storage.
• Large table scans needed for analytical queries
require fast sequential access to data, making
Parquet format in HDFS a perfect fit. It can then be
queried by Impala, Spark, or even Hive.
• Fast record lookups require a carefully designed key
to allow for direct access to the specific row of
data, making Cassandra (or HBase) a very good fit.
19 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
START BUILDING MODELS
• A quick way to do some exploratory predictions is
with Impala MADlib or Spark MLlib.
• First we need to build some features from the table
we created in Impala.
20 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
START BUILDING MODELS
• A simple way to do this for hourly sales is to create a
boolean indicator variable for each hour, and set it to
true if the given observation was for the given hour.
• We can do the same to indicate whether the
observation was taken on a weekend or weekday.
21 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
START BUILDING MODELS
• The SQL to do this look something like this:
create table rsvps_by_hr_training as (
select
case when mhour=0 then 1 else 0 end as hr0
,case when mhour=1 then 1 else 0 end as hr1
…
,case when mhour=22 then 1 else 0 end as hr22
,case when mhour=23 then 1 else 0 end as hr23
,case when mdate in ("2015-02-14","2015-02-15") then 1 else 0 end as
weekend_day
,mdate,mhour,rsvp_cnt from rsvps_by_hour);
22 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
• Install MADlib for Impala: http://blog.cloudera.com/blog/2013/10/how-to-use-madlib-pre-built-
analytic-functions-with-impala/
• Train regression model in SQL:
SELECT printarray(linr(toarray(hr0,hr1,hr2,hr3,hr4,hr5,hr6,hr7,hr8,hr9,hr10,hr11,hr12,hr13,hr14,
hr15,hr16,hr17,hr18,hr19,hr20,hr21,hr22,hr23,weekend_day), rsvp_cnt))
from meetup.rsvps_by_hr_training;
• Which gives us the regression coefficients:
<8037.43, 7883.93, 7007.68, 6851.91, 6307.91, 5468.24, 4792.58, 4336.91, 4330.24, 4360.91, 4373.24,
4711.58, 5649.91, 6752.24, 8056.24, 9042.58, 9761.37, 10205.9, 10365.6, 10048.6, 9946.12, 9538.87,
9984.37, 9115.12, -2323.73>
• Which we can then apply to new data:
select mdate,mhour,cast(linrpredict(toarray(8037.43, 7883.93, 7007.68, 6851.91, 6307.91,
5468.24, 4792.58, 4336.91, 4330.24, 4360.91, 4373.24, 4711.58, 5649.91, 6752.24, 8056.24, 9042.58,
9761.37, 10205.9, 10365.6, 10048.6, 9946.12, 9538.87, 9984.37, 9115.12, -2323.73), toarray(hr0, hr1,
hr2, hr3, hr4, hr5, hr6, hr7, hr8, hr9, hr10, hr11, hr12, hr13, hr14, hr15, hr16, hr17, hr18, hr19, hr20,
hr21, hr22, hr23, weekend_day)) as int) as rsvp_cnt_pred, rsvp_cnt from
meetup.rsvps_by_hr_testing
Train/score regression model in Impala
23 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
PREDICTED HOURLY RSVPS FROM IMPALA
0
2000
4000
6000
8000
10000
12000
1 13 25 37 49 61 73 85 97
hour_mod
rsvp_cnt_prediction
rsvp_cnt
24 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Prepare data for regression in Spark
val meetup3 = sqlContext.sql("select mtime from meetup2")
val meetup4 = meetup3.map(x => new DateTime(x(0).toString().toLong)).map(x => (x.toString("yyyy-
MM-dd HH"),x.toString("yyyy-MM-dd"),x.toString("HH")) )
case class Meetup(mdatehr: String, mdate: String, mhour: String)
val meetup5 = meetup4.map(p => Meetup(p._1, p._2, p._3))
meetup5.registerTempTable("meetup5")
val meetup6 = sqlContext.sql("select mdate,mhour,count(*) as rsvp_cnt from meetup5 where
mdatehr >= '2015-02-15 02' group by mdatehr,mdate,mhour")
meetup6.registerTempTable("meetup6")
sqlContext.sql("cache table meetup6”)
val trainingData = meetup7.map { row =>
val features = Array[Double](row(24).toString().toDouble,row(0).toString().toDouble, …
LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}
25 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Train and Score regression model in
Spark
val trainingData = meetup7.map { row =>
val features = Array[Double](1.0,row(0).toString().toDouble,row(1).toString().toDouble, …
LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))
}
val model = new RidgeRegressionWithSGD().run(trainingData)
val scores = meetup7.map { row =>
val features = Vectors.dense(Array[Double](1.0,row(0).toString().toDouble, …
row(23).toString().toDouble))
(row(25),row(26),row(27), model.predict(features))
}
scores.foreach(println)
26 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
PREDICTED HOURLY RSVPS FROM SPARK
0
2000
4000
6000
8000
10000
12000
1 13 25 37 49 61 73 85 97
hour_mod
rsvp_cnt_prediction
rsvp_cnt
27 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
NEXT STEPS
• Include additional features (Holidays, Growth
Trend, etc.)
• Predict at lower granularity (Event Level, Topic
Level, etc.)
• Convert Spark regression model to utilize streaming
regression
• Build ensemble models to leverage strengths of
multiple algorithms
• Move to pilot, then production
‹#›
THANK YOU
Richard Williamson
richard@svds.com
@superhadooper
Yes, we’re hiring!
info@svds.com

Contenu connexe

Tendances

Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Databricks
 
AWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS Experience
Amazon Web Services
 

Tendances (20)

Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it tooQuerying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
AWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS Experience
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 

En vedette

Intelligence Types
Intelligence TypesIntelligence Types
Intelligence Types
ishlive
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
Anirvan Chakraborty
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Hadoop User Group
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 

En vedette (16)

Intelligence Types
Intelligence TypesIntelligence Types
Intelligence Types
 
Presentacion
PresentacionPresentacion
Presentacion
 
Phrases
PhrasesPhrases
Phrases
 
Improving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters TuningImproving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters Tuning
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop Ecosystem
 
Perspectives on Ethical Big Data Governance
Perspectives on Ethical Big Data GovernancePerspectives on Ethical Big Data Governance
Perspectives on Ethical Big Data Governance
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
 
Spend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
Spend Analysis: What Your Data Is Telling You and Why It’s Worth ListeningSpend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
Spend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
 
3.implant components and basic techniques3
3.implant components and basic techniques33.implant components and basic techniques3
3.implant components and basic techniques3
 
Impression tecnique for implant supported rehabilitation/ dental courses
Impression tecnique for implant supported rehabilitation/ dental coursesImpression tecnique for implant supported rehabilitation/ dental courses
Impression tecnique for implant supported rehabilitation/ dental courses
 
Zen and the Art of Supplier Management
Zen and the Art of Supplier ManagementZen and the Art of Supplier Management
Zen and the Art of Supplier Management
 
Master Supply Chain Management with Cloud and Business Networks
Master Supply Chain Management with Cloud and Business NetworksMaster Supply Chain Management with Cloud and Business Networks
Master Supply Chain Management with Cloud and Business Networks
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 

Similaire à Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Engine - StampedeCon 2015

Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
Ajay Shriwastava
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
10 ways to make your code rock
10 ways to make your code rock10 ways to make your code rock
10 ways to make your code rock
martincronje
 
NSA for Enterprises Log Analysis Use Cases
NSA for Enterprises   Log Analysis Use Cases NSA for Enterprises   Log Analysis Use Cases
NSA for Enterprises Log Analysis Use Cases
WSO2
 

Similaire à Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Engine - StampedeCon 2015 (20)

Mstr meetup
Mstr meetupMstr meetup
Mstr meetup
 
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenGrokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
 
Deploy your Multi-tier Application in Cloud Foundry
Deploy your Multi-tier Application in Cloud FoundryDeploy your Multi-tier Application in Cloud Foundry
Deploy your Multi-tier Application in Cloud Foundry
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
10 ways to make your code rock
10 ways to make your code rock10 ways to make your code rock
10 ways to make your code rock
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDB
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Why SQL? | Kenny Gorman, Cloudera
Why SQL? | Kenny Gorman, ClouderaWhy SQL? | Kenny Gorman, Cloudera
Why SQL? | Kenny Gorman, Cloudera
 
Evolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming PipelinesEvolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming Pipelines
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?
 
NSA for Enterprises Log Analysis Use Cases
NSA for Enterprises   Log Analysis Use Cases NSA for Enterprises   Log Analysis Use Cases
NSA for Enterprises Log Analysis Use Cases
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 

Plus de StampedeCon

Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 

Plus de StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Dernier

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Dernier (20)

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 

Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Engine - StampedeCon 2015

  • 1. USING MULTIPLE PERSISTENCE LAYERS IN SPARK TO BUILD A SCALABLE PREDICTION ENGINE Richard Williamson | @SuperHadooper
  • 2. 22 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. • Some Background • Model Building/Scoring Use Case • Spark Persistence to Parquet via Kafka • Exploratory Analysis with Impala • Spark Persistence to Cassandra • Operational queries via CQL • Impala Model Building/Scoring • Spark Model Building/Scoring • Questions AGENDA
  • 3. 3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
  • 4. 4 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. THE EXPERIMENTAL ENTERPRISE We need to both support investigative work and build a solid layer for production. Data science allows us to observe our experiments and respond to the changing environment. The foundation of the experimental enterprise focuses on making infrastructure readily accessible.
  • 5. 5 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
  • 6. 6 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. • VOLUME: Storing longer periods of history • VELOCITY: Stream processing allows for on the fly forecast adjustments • VARIETY: Data Scientists can use additional features to improve model accuracy HOW HAS “BIG DATA” CHANGED OUR ABILITY TO BETTER PREDICT THE FUTURE?
  • 7. 7 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. • Lowered Cost of Storing and Processing Data • Improved data scientist efficiency by having common repository • Improved modeling performance by bringing models to the data versus bringing data to the models HOW HAS “BIG DATA TECHNOLOGY” CHANGED OUR ABILITY TO BETTER PREDICT THE FUTURE?
  • 8. 88 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Can we predict future RSVPs by consuming Meetup.com streaming API data? http://stream.meetup.com/2/rs vps https://github.com/silicon- valley-data-science/strataca- 2015/tree/master/Spark-to- Build-a-Scalable-Prediction- Engine EXAMPLE USE CASE
  • 10. 10 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. BY TAKING A LOOK AT THE DATA… {"response":"yes","member":{"member_name":"Richard Williamson","photo":"http://photos3.meetupstatic.com/photos/member/d/a/4/0/thumb_23159587 2.jpeg","member_id":29193652},"visibility":"public","event":{"time":1424223000000,"event_url":"http://www.me etup.com/Big-Data-Science/events/217322312/","event_id":"fbtgdlytdbbc","event_name":"Big Data Science @Strata Conference, 2015"},"guests":0,"mtime":1424020205391,"rsvp_id":1536654666,"group":{"group_name":"Big Data Science","group_state":"CA","group_city":"Fremont","group_lat":37.52,"group_urlname":"Big-Data- Science","group_id":3168962,"group_country":"us","group_topics":[{"urlkey":"data- visualization","topic_name":"Data Visualization"},{"urlkey":"data-mining","topic_name":"Data Mining"},{"urlkey":"businessintell","topic_name":"Business Intelligence"},{"urlkey":"mapreduce","topic_name":"MapReduce"},{"urlkey":"hadoop","topic_name":"Hadoop"}, {"urlkey":"opensource","topic_name":"Open Source"},{"urlkey":"r-project-for-statistical- computing","topic_name":"R Project for Statistical Computing"},{"urlkey":"predictive- analytics","topic_name":"Predictive Analytics"},{"urlkey":"cloud-computing","topic_name":"Cloud Computing"},{"urlkey":"big-data","topic_name":"Big Data"},{"urlkey":"data-science","topic_name":"Data Science"},{"urlkey":"data-analytics","topic_name":"Data Analytics"},{"urlkey":"hbase","topic_name":"HBase"},{"urlkey":"hive","topic_name":"Hive"}],"group_lon":- 121.93},"venue":{"lon":-121.889122,"venue_name":"San Jose Convention Center, Room 210AE","venue_id":21805972,"lat":37.330341}}
  • 11. 11 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. FROM ONE TO MANY Now that we know what one record looks like, let’s capture a large dataset. • First we capture stream to Kafka by curling to file and tailing file to Kafka consumer — this will build up some history for us to look at.
  • 12. 12 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. FROM ONE TO MANY • Then use Spark to load to Parquet for interactive exploratory analysis in Impala: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val meetup = sqlContext.jsonFile("/home/odroid/meetupstream”) meetup.registerTempTable("meetup") val meetup2 = sqlContext.sql("select event.event_id,event.event_name,event.event_url,event.time, guests,member.member_id,member.member_name,member.other_services.facebook.identifier as facebook_identifier, member.other_services.linkedin.identifier as linkedin_identifier, member.other_services.twitter.identifier as twitter_identifier, member.photo,mtime,response, rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from meetup”) meetup2.saveAsParquetFile("/home/odroid/meetup.parquet")
  • 13. 13 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. INTERACTIVE EXPLORATORY ANALYSIS IN IMPALA • We could have done basic exploratory analysis in Spark SQL, but date logic is not fully supported. • It’s very easy to point Impala to the data and do it there.
  • 14. 14 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. INTERACTIVE EXPLORATORY ANALYSIS IN IMPALA • Create external Impala table pointing to the data we just created in Spark: CREATE EXTERNAL TABLE meetup LIKE PARQUET '/user/odroid/meetup/_metadata' STORED AS PARQUET LOCATION '/user/odroid/meetup’;
  • 15. 15 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. INTERACTIVE EXPLORATORY ANALYSIS IN IMPALA • Now we can query: select from_unixtime(cast(mtime/1000 as bigint), "yyyy-MM-dd HH”) as data_hr ,count(*) as rsvp_cnt from meetup group by 1 order by 1;
  • 16. 16 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. val ssc = new StreamingContext(sc, Seconds(10)) val lines = KafkaUtils.createStream(ssc, ”localhost:2181", "meetupstream", Map("meetupstream" -> 10)).map(_._2) val sqlContext = new org.apache.spark.sql.SQLContext(sc) lines.foreachRDD(rdd => { val lines2 = sqlContext.jsonRDD(rdd) lines2.registerTempTable("lines2") val lines3 = sqlContext.sql("select event.event_id,event.event_name,event.event_url,event.time,guests, member.member_id, member.member_name,member.other_services.facebook.identifier as facebook_identifier, member.other_services.linkedin.identifier as linkedin_identifier,member.other_services.twitter.identifier as twitter_identifier,member.photo, mtime,response, rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from lines2") val lines4 = lines3.map(line => (line(0), line(5), line(13), line(1), line(2), line(3), line(4), line(6), line(7), line(8), line(9), line(10), line(11), line(12), line(14), line(15), line(16), line(17), line(18) )) lines4.saveToCassandra("meetup","rsvps", SomeColumns("event_id", "member_id","rsvp_id","event_name", "event_url", "time","guests","member_name","facebook_identifier","linkedin_identifier","twitter_identifier","photo","mtime","response","l at","lon","venue_id","venue_name","visibility"))}) ssc.start() SPARK STREAMING: Load Cassandra from Kafka
  • 17. 17 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. OPERATIONAL QUERIES IN CASSANDRA • When low millisecond queries are needed, Cassandra provides favorable storage. • Now we can perform queries like this one: select * from meetup.rsvps where event_id=“fbtgdlytdbbc” and member_id=29193652 and rsvp_id=1536654666;
  • 18. 18 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Why do we want to store the same data in two places? • Different access patterns prefer different physical storage. • Large table scans needed for analytical queries require fast sequential access to data, making Parquet format in HDFS a perfect fit. It can then be queried by Impala, Spark, or even Hive. • Fast record lookups require a carefully designed key to allow for direct access to the specific row of data, making Cassandra (or HBase) a very good fit.
  • 19. 19 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. START BUILDING MODELS • A quick way to do some exploratory predictions is with Impala MADlib or Spark MLlib. • First we need to build some features from the table we created in Impala.
  • 20. 20 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. START BUILDING MODELS • A simple way to do this for hourly sales is to create a boolean indicator variable for each hour, and set it to true if the given observation was for the given hour. • We can do the same to indicate whether the observation was taken on a weekend or weekday.
  • 21. 21 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. START BUILDING MODELS • The SQL to do this look something like this: create table rsvps_by_hr_training as ( select case when mhour=0 then 1 else 0 end as hr0 ,case when mhour=1 then 1 else 0 end as hr1 … ,case when mhour=22 then 1 else 0 end as hr22 ,case when mhour=23 then 1 else 0 end as hr23 ,case when mdate in ("2015-02-14","2015-02-15") then 1 else 0 end as weekend_day ,mdate,mhour,rsvp_cnt from rsvps_by_hour);
  • 22. 22 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. • Install MADlib for Impala: http://blog.cloudera.com/blog/2013/10/how-to-use-madlib-pre-built- analytic-functions-with-impala/ • Train regression model in SQL: SELECT printarray(linr(toarray(hr0,hr1,hr2,hr3,hr4,hr5,hr6,hr7,hr8,hr9,hr10,hr11,hr12,hr13,hr14, hr15,hr16,hr17,hr18,hr19,hr20,hr21,hr22,hr23,weekend_day), rsvp_cnt)) from meetup.rsvps_by_hr_training; • Which gives us the regression coefficients: <8037.43, 7883.93, 7007.68, 6851.91, 6307.91, 5468.24, 4792.58, 4336.91, 4330.24, 4360.91, 4373.24, 4711.58, 5649.91, 6752.24, 8056.24, 9042.58, 9761.37, 10205.9, 10365.6, 10048.6, 9946.12, 9538.87, 9984.37, 9115.12, -2323.73> • Which we can then apply to new data: select mdate,mhour,cast(linrpredict(toarray(8037.43, 7883.93, 7007.68, 6851.91, 6307.91, 5468.24, 4792.58, 4336.91, 4330.24, 4360.91, 4373.24, 4711.58, 5649.91, 6752.24, 8056.24, 9042.58, 9761.37, 10205.9, 10365.6, 10048.6, 9946.12, 9538.87, 9984.37, 9115.12, -2323.73), toarray(hr0, hr1, hr2, hr3, hr4, hr5, hr6, hr7, hr8, hr9, hr10, hr11, hr12, hr13, hr14, hr15, hr16, hr17, hr18, hr19, hr20, hr21, hr22, hr23, weekend_day)) as int) as rsvp_cnt_pred, rsvp_cnt from meetup.rsvps_by_hr_testing Train/score regression model in Impala
  • 23. 23 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. PREDICTED HOURLY RSVPS FROM IMPALA 0 2000 4000 6000 8000 10000 12000 1 13 25 37 49 61 73 85 97 hour_mod rsvp_cnt_prediction rsvp_cnt
  • 24. 24 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Prepare data for regression in Spark val meetup3 = sqlContext.sql("select mtime from meetup2") val meetup4 = meetup3.map(x => new DateTime(x(0).toString().toLong)).map(x => (x.toString("yyyy- MM-dd HH"),x.toString("yyyy-MM-dd"),x.toString("HH")) ) case class Meetup(mdatehr: String, mdate: String, mhour: String) val meetup5 = meetup4.map(p => Meetup(p._1, p._2, p._3)) meetup5.registerTempTable("meetup5") val meetup6 = sqlContext.sql("select mdate,mhour,count(*) as rsvp_cnt from meetup5 where mdatehr >= '2015-02-15 02' group by mdatehr,mdate,mhour") meetup6.registerTempTable("meetup6") sqlContext.sql("cache table meetup6”) val trainingData = meetup7.map { row => val features = Array[Double](row(24).toString().toDouble,row(0).toString().toDouble, … LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}
  • 25. 25 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Train and Score regression model in Spark val trainingData = meetup7.map { row => val features = Array[Double](1.0,row(0).toString().toDouble,row(1).toString().toDouble, … LabeledPoint(row(27).toString().toDouble, Vectors.dense(features)) } val model = new RidgeRegressionWithSGD().run(trainingData) val scores = meetup7.map { row => val features = Vectors.dense(Array[Double](1.0,row(0).toString().toDouble, … row(23).toString().toDouble)) (row(25),row(26),row(27), model.predict(features)) } scores.foreach(println)
  • 26. 26 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. PREDICTED HOURLY RSVPS FROM SPARK 0 2000 4000 6000 8000 10000 12000 1 13 25 37 49 61 73 85 97 hour_mod rsvp_cnt_prediction rsvp_cnt
  • 27. 27 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. NEXT STEPS • Include additional features (Holidays, Growth Trend, etc.) • Predict at lower granularity (Event Level, Topic Level, etc.) • Convert Spark regression model to utilize streaming regression • Build ensemble models to leverage strengths of multiple algorithms • Move to pilot, then production