SlideShare une entreprise Scribd logo
1  sur  24
Fast Data Intelligence in the IoT
Real-time Data Analytics with Spark Streaming and MLlib
Bas Geerdink
#iottechday
ABOUT ME
• Chapter Lead in Analytics area at ING
• Academic background in Artificial
Intelligence and Informatics
• Working in IT since 2004, previously as
developer and software architect
• Spark Certified Developer
• Twitter: @bgeerdink
• Github: geerdink
WHAT’S NEW IN THE IOT?
• More data
– Streaming data from multiple sources
• New use cases
– Combining data streams
• New technology
– Fast processing and scalability
Front
End
Back
End
Data
PATTERNS & PRACTICES
FOR FAST DATA ANALYTICS
• Lambda Architecture
• Reactive Principles
• Pipes & filters
• Event Sourcing
• REST, HATEOAS
• …
LAMBDA ARCHITECTURE
Source: Nathan Marz & James Warren (2015)
REACTIVE PRINCIPLES
Source: Reactive Manifesto (2014)
USE CASE
nest
WWW
FAST DATA ARCHITECTURE
Products
Users
API
App
Web
…
Batch
(Machine Learning)
Social
Media
Search
History
GPS
Data
…
Message
Broker
Events
Streaming
(Business Logic)
VisualizeProcessing Database
A SHIFT IN TECHNOLOGY PARADIGMS
Disk  In-memory
Database  Stream
Objects  Functions
Centralized  Distributed
Shared Memory/CPU/Disk  Shared Nothing
TOOLS FOR THE JOB
• Apache Kafka
• Apache Cassandra
• Apache Spark
• Apache Zeppelin
• Akka
• Scala
FAST DATA ARCHITECTURE
Products
Users
API
App
Web
…
Batch
Machine Learning
Social
Media
Search
History
GPS
Data
GPS
Data
Message
Broker
Streaming
Business Logic
Events VisualizeProcessing Database
KAFKA
• Distributed Message broker
• Built for speed, scalability, fault-tolerance
• Works with topics, producers, consumers
• Created at LinkedIn, now open source
• Written in Scala
CODE: KAFKA
• build.sbt:
"org.apache.kafka" %% "kafka" % kafkaVersion
• Application.conf:
kafka { producer … consumer }
• KafkaConnection.scala:
def producer, def consumer
• KafkaProducerActor.scala:
producer.send(msg)
• KafkaConsumerActor.scala:
val kafkaStream =
connection.createMessageStreams(Map(topic -> 1))(topic)(0)
CASSANDRA
• NoSQL database
• Built for speed, scalability, fault-tolerance
• Works with CQL, consistency levels, replication factors
• Created at Facebook, now open source
• Written in Java
CODE: CASSANDRA
CREATE TABLE products (user_name text, product_category text, product_name text,
score int, insertion_time timeuuid, PRIMARY KEY (user_name, product_category,
product_name));
val cluster = new Cluster.Builder().
addContactPoints(uri.hosts.toArray: _*).
withPort(uri.port).
withQueryOptions(new
QueryOptions().setConsistencyLevel(defaultConsistencyLevel)).build
val session = cluster.connect
session.execute(s"USE ${uri.keyspace}")
def insertScore(productScore: ProductScore): Unit = {
val query = s”INSERT INTO products (user_name, product_category, product_name,
score, insertion_time) VALUES ('${productScore.userName}',
'${productScore.productCategory}', '${productScore.productName}',
${productScore.score}, now())"
session.execute(query)
}
SPARK
• Fast, parallel, in-memory, general-purpose data
processing engine
• Winner of Daytona Gray Sort benchmark 2014
• Runs on Hadoop YARN, Mesos, cloud, or standalone
• Created at AMPLab UC Berkeley, now open source
• Written in Scala
CODE: SPARK BASICS
val l = List(1,2,3,4,5)
val p = sc.parallelize(l) // create RDD
p.count() // action
def fun1(x: Int): Int = x * 2
p.map(fun1).collect() // transformation
p.map(i => i * 2).filter(_ < 6).collect() // lambda
SPARK
SPARK STREAMING
CODE: SPARK STREAMING
val conf = new SparkConf().setAppName("fast-data-search-history").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(2)) // batch interval = 2 sec
val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092")
val kafkaDirectStream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, Set("search_history"))
kafkaDirectStream
.map(rdd => ProductScoreHelper.createProductScore(rdd._2))
.filter(_.productCategory != "Sneakers")
.foreachRDD(rdd => rdd.foreach(CassandraHelper.insertScore))
ssc.start() // it's necessary to explicitly tell the StreamingContext to start receiving data
ssc.awaitTermination() // wait for the job to finish
CODE: SPARK MLLIB
// initialize Spark MLlib
val conf = new SparkConf().setAppName("fast-data-social-media").setMaster("local[2]")
val sc = new SparkContext(conf)
// load machine learning model from disk
val model = LinearRegressionModel.load(sc, "/home/social_media.model")
def processEvent(sme: SocialMediaEvent): Unit = {
// feature vector extraction
val vector = new DenseVector(Array(sme.userName, sme.message))
// get a new prediction for the top user category
val value = model.predict(vector)
// store the predicted category value
val user = new User(sme.userName, UserHelper.getCategory(value))
CassandraHelper.updateUserCategory(user)
}
THREE KEY TAKEAWAYS
• The IoT comes with new architecture: reactive and
scalable are the new normal
• Be aware of the paradigm shift: in-memory,
streaming, distributed, shared nothing
• Open source tooling such as Kafka, Cassandra, and
Spark can help to process the fast data flows
Thank You!
“please rate my talk in the offical IoT Tech Day app”
@bgeerdink
#iottechday

Contenu connexe

Tendances

Machine Learning at Hand with Power BI
Machine Learning at Hand with Power BIMachine Learning at Hand with Power BI
Machine Learning at Hand with Power BIIvo Andreev
 
Optier presentation for open analytics event
Optier presentation for open analytics eventOptier presentation for open analytics event
Optier presentation for open analytics eventOpen Analytics
 
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...Databricks
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceDeepak Chandramouli
 
Big data analytic platform
Big data analytic platformBig data analytic platform
Big data analytic platformJesse Wang
 
Dive into H2O: NYC
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYCSri Ambati
 
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Databricks
 
Wizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureDatabricks
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamSri Ambati
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data ScientistsDomino Data Lab
 
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Markus Harrer
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySrinath Perera
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Dataconomy Media
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demoDatabricks
 
Neo4j Graph Data Science - Webinar
Neo4j Graph Data Science - WebinarNeo4j Graph Data Science - Webinar
Neo4j Graph Data Science - WebinarNeo4j
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Analytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret WeaponAnalytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret WeaponDatabricks
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 

Tendances (20)

Machine Learning at Hand with Power BI
Machine Learning at Hand with Power BIMachine Learning at Hand with Power BI
Machine Learning at Hand with Power BI
 
Optier presentation for open analytics event
Optier presentation for open analytics eventOptier presentation for open analytics event
Optier presentation for open analytics event
 
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
 
Big data analytic platform
Big data analytic platformBig data analytic platform
Big data analytic platform
 
Dive into H2O: NYC
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYC
 
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
 
Wizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in Azure
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data Scientists
 
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration Technology
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
 
Machine Data Analytics
Machine Data AnalyticsMachine Data Analytics
Machine Data Analytics
 
Neo4j Graph Data Science - Webinar
Neo4j Graph Data Science - WebinarNeo4j Graph Data Science - Webinar
Neo4j Graph Data Science - Webinar
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Analytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret WeaponAnalytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret Weapon
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 

En vedette

IOT Oversharing: 
Stop Sending My Stuff to the Cloud
IOT Oversharing: 
Stop Sending My Stuff to the CloudIOT Oversharing: 
Stop Sending My Stuff to the Cloud
IOT Oversharing: 
Stop Sending My Stuff to the CloudRamin Firoozye
 
Data Science for Social Good
Data Science for Social GoodData Science for Social Good
Data Science for Social GoodCarlo Torniai
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Nathan Bijnens
 
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)Amazon Web Services
 
E3: Edge and Cloud Connectivity (Predix Transform 2016)
E3: Edge and Cloud Connectivity (Predix Transform 2016)E3: Edge and Cloud Connectivity (Predix Transform 2016)
E3: Edge and Cloud Connectivity (Predix Transform 2016)Predix
 
AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)
AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)
AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)Amazon Web Services
 
AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)
AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)
AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)Amazon Web Services
 

En vedette (10)

IOT Oversharing: 
Stop Sending My Stuff to the Cloud
IOT Oversharing: 
Stop Sending My Stuff to the CloudIOT Oversharing: 
Stop Sending My Stuff to the Cloud
IOT Oversharing: 
Stop Sending My Stuff to the Cloud
 
Opp ppt1
Opp ppt1Opp ppt1
Opp ppt1
 
Data Science for Social Good
Data Science for Social GoodData Science for Social Good
Data Science for Social Good
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
 
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
 
E3: Edge and Cloud Connectivity (Predix Transform 2016)
E3: Edge and Cloud Connectivity (Predix Transform 2016)E3: Edge and Cloud Connectivity (Predix Transform 2016)
E3: Edge and Cloud Connectivity (Predix Transform 2016)
 
AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)
AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)
AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)
 
GE Predix - The IIoT Platform
GE Predix - The IIoT PlatformGE Predix - The IIoT Platform
GE Predix - The IIoT Platform
 
AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)
AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)
AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Similaire à Fast Data Intelligence in the IoT - real-time data analytics with Spark

Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache SparkDan Lynn
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleMateusz Dymczyk
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 

Similaire à Fast Data Intelligence in the IoT - real-time data analytics with Spark (20)

Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

Dernier

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 

Dernier (20)

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 

Fast Data Intelligence in the IoT - real-time data analytics with Spark

  • 1. Fast Data Intelligence in the IoT Real-time Data Analytics with Spark Streaming and MLlib Bas Geerdink #iottechday
  • 2. ABOUT ME • Chapter Lead in Analytics area at ING • Academic background in Artificial Intelligence and Informatics • Working in IT since 2004, previously as developer and software architect • Spark Certified Developer • Twitter: @bgeerdink • Github: geerdink
  • 3.
  • 4. WHAT’S NEW IN THE IOT? • More data – Streaming data from multiple sources • New use cases – Combining data streams • New technology – Fast processing and scalability Front End Back End Data
  • 5. PATTERNS & PRACTICES FOR FAST DATA ANALYTICS • Lambda Architecture • Reactive Principles • Pipes & filters • Event Sourcing • REST, HATEOAS • …
  • 6. LAMBDA ARCHITECTURE Source: Nathan Marz & James Warren (2015)
  • 9. FAST DATA ARCHITECTURE Products Users API App Web … Batch (Machine Learning) Social Media Search History GPS Data … Message Broker Events Streaming (Business Logic) VisualizeProcessing Database
  • 10. A SHIFT IN TECHNOLOGY PARADIGMS Disk  In-memory Database  Stream Objects  Functions Centralized  Distributed Shared Memory/CPU/Disk  Shared Nothing
  • 11. TOOLS FOR THE JOB • Apache Kafka • Apache Cassandra • Apache Spark • Apache Zeppelin • Akka • Scala
  • 12. FAST DATA ARCHITECTURE Products Users API App Web … Batch Machine Learning Social Media Search History GPS Data GPS Data Message Broker Streaming Business Logic Events VisualizeProcessing Database
  • 13. KAFKA • Distributed Message broker • Built for speed, scalability, fault-tolerance • Works with topics, producers, consumers • Created at LinkedIn, now open source • Written in Scala
  • 14. CODE: KAFKA • build.sbt: "org.apache.kafka" %% "kafka" % kafkaVersion • Application.conf: kafka { producer … consumer } • KafkaConnection.scala: def producer, def consumer • KafkaProducerActor.scala: producer.send(msg) • KafkaConsumerActor.scala: val kafkaStream = connection.createMessageStreams(Map(topic -> 1))(topic)(0)
  • 15. CASSANDRA • NoSQL database • Built for speed, scalability, fault-tolerance • Works with CQL, consistency levels, replication factors • Created at Facebook, now open source • Written in Java
  • 16. CODE: CASSANDRA CREATE TABLE products (user_name text, product_category text, product_name text, score int, insertion_time timeuuid, PRIMARY KEY (user_name, product_category, product_name)); val cluster = new Cluster.Builder(). addContactPoints(uri.hosts.toArray: _*). withPort(uri.port). withQueryOptions(new QueryOptions().setConsistencyLevel(defaultConsistencyLevel)).build val session = cluster.connect session.execute(s"USE ${uri.keyspace}") def insertScore(productScore: ProductScore): Unit = { val query = s”INSERT INTO products (user_name, product_category, product_name, score, insertion_time) VALUES ('${productScore.userName}', '${productScore.productCategory}', '${productScore.productName}', ${productScore.score}, now())" session.execute(query) }
  • 17. SPARK • Fast, parallel, in-memory, general-purpose data processing engine • Winner of Daytona Gray Sort benchmark 2014 • Runs on Hadoop YARN, Mesos, cloud, or standalone • Created at AMPLab UC Berkeley, now open source • Written in Scala
  • 18. CODE: SPARK BASICS val l = List(1,2,3,4,5) val p = sc.parallelize(l) // create RDD p.count() // action def fun1(x: Int): Int = x * 2 p.map(fun1).collect() // transformation p.map(i => i * 2).filter(_ < 6).collect() // lambda
  • 19. SPARK
  • 21. CODE: SPARK STREAMING val conf = new SparkConf().setAppName("fast-data-search-history").setMaster("local[2]") val ssc = new StreamingContext(conf, Seconds(2)) // batch interval = 2 sec val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092") val kafkaDirectStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set("search_history")) kafkaDirectStream .map(rdd => ProductScoreHelper.createProductScore(rdd._2)) .filter(_.productCategory != "Sneakers") .foreachRDD(rdd => rdd.foreach(CassandraHelper.insertScore)) ssc.start() // it's necessary to explicitly tell the StreamingContext to start receiving data ssc.awaitTermination() // wait for the job to finish
  • 22. CODE: SPARK MLLIB // initialize Spark MLlib val conf = new SparkConf().setAppName("fast-data-social-media").setMaster("local[2]") val sc = new SparkContext(conf) // load machine learning model from disk val model = LinearRegressionModel.load(sc, "/home/social_media.model") def processEvent(sme: SocialMediaEvent): Unit = { // feature vector extraction val vector = new DenseVector(Array(sme.userName, sme.message)) // get a new prediction for the top user category val value = model.predict(vector) // store the predicted category value val user = new User(sme.userName, UserHelper.getCategory(value)) CassandraHelper.updateUserCategory(user) }
  • 23. THREE KEY TAKEAWAYS • The IoT comes with new architecture: reactive and scalable are the new normal • Be aware of the paradigm shift: in-memory, streaming, distributed, shared nothing • Open source tooling such as Kafka, Cassandra, and Spark can help to process the fast data flows
  • 24. Thank You! “please rate my talk in the offical IoT Tech Day app” @bgeerdink #iottechday

Notes de l'éditeur

  1. In this session, streaming data from IoT sources (sensors) will be pulled into an analytics engine to make predictions about the future. We use Spark as the technology of choice, since this framework is well suited for combining streaming data with machine learning techniques. Join this session to get an overview of a (nearly) fullblown analytics application, and to get inspired to set up your own predictive API for the IoT!
  2. This is a dream for engineers…
  3. Who is now actually working on a IoT application in production? Compare to a conference of Content Management Systems, ERP, … Big data vs Fast data: 3V, Volume Variety Velocity Storage is not an issue anymore… Hadoop is 10 years old! Speed and responsiveness are the new challenges. Same as with big data: you have to do something with the data. Machine learning = best with lots of data, e.g. historical events
  4. Reusable solutions to common problems Building blocks, guidelines, blueprints of architecture. I’m going to tell a little about the first two.
  5. 1. All data entering the system is dispatched to both the batch layer and the speed layer for processing. 2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. 3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. 4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 5. Any incoming query can be answered by merging results from batch views and real-time views.
  6. Elastic = Scalable on demand, up & down. System stays responsive under varying workload. Resilient = system stays responsive in face of failure Responsive = system should respond in a timely manner if at all possible, even if (parts) are failing. Deal with problems quickly. Message-driven = rely on asynchronous message passing to ensure loosely coupling, isolation, non-blocking, back-pressure. Back-pressure = the ability to communicate that a component is under stress. This feedback is used by upstream components to reduce the load, thereby ensuring the system as a whole doesn’t fail.
  7. We have this nice guy. He is a little strange, because he has a social network. Even worse: he is on the internet, buying stuff, searching for items. He is even connected to the IoT: car, house, fridge, phone, etc. Scary! Now, meet an evil guy. He wants to make advantage of all this nice data! He sets up a company that combines all these data flows, and does something very clever: he is giving mr nice guy adds in banners. He wants to give him an offer he can’t refuse! Obviously everyone in the audience will not click on such advertisement spam, but please consider that there are people on this planet who might do that. So, I am a developer in this company, how should I build my system? It has to be scalabe: we start small, but what if this becomes a succes? I’ver heard something about fast data and the lambda architecture, let’s give that a try…
  8. Batch: Based on historical behavior and user profile, predict (recommend) the product category that a user is interested in. Algorithm on daily/hourly basis Speed: Based on current, actual data, score the products of a category. Store events for API: - Select data from tables, define order/priority of products within a category. What do we need to set up such a system nowadays?
  9. Parallelism Fault-tolerance (stateless, immutable)
  10. All open source, reason: not because it’s free, but because we want to contribute to the community. All running on commodity hardware and cloud. I will discuss the top three…
  11. Batch: Based on historical behavior and user profile, predict the product category that a user is interested in. Algorithm on daily/hourly basis Speed: Based on current, actual data, score the products of a category. Store events for API: - Select data from tables, define order/priority of products within a category. What do we need to set up such a system nowadays?
  12. Allows SOA and Microservices architecture, but it’s not an ESB (too little functionality) Elastic: 1 instance can server a large organization. One broker can handle 100s of megabytes per second from 1000s of clients. Runs on Zookeeper: high performance coordination service Publish-subscribe mechanism Too fast? (Precision in real time can lead to misses)
  13. Consistency level: tradeoff between speed and data quality (1 = fast, may not read last written value, quorum = strict majority w.r.t. replication factor, all = slow, guaranteed reads) CAP theorem: it’s impossible to provide all three guarantees of Consistency (= quality; all nodes see the same data at the same time), Availability, Partition Tolerance ACID vs BASE consistency model: relational/’safe’ vs scala/resilient/’eventually consistent’ Commercialized by Datastax
  14. Spark = data processing framework With built-in parallel distribution, in-memory computing. Biggest ‘big data’ project at Apache Daytona Sort: 2009: Hadoop, 100 TB in 173 minutes, 3452 nodes x 4 cores 2013: Hadoop, 100 TB in 4 seconds, 2100 nodes x 8 cores 2014: Spark, 100 TB in 1.4 seconds, 207 nodes x 32 cores Commercialized by Databricks, Cloudera, Hortonworks, Amazon, IBM, … StorageLevel can be chosen: memory and/or disk, eventually serialized Number and size of partitions is configurable.
  15. RDD = resilient distributed dataset Transformations Actions Accumulators, Broadcast variables
  16. History: General batch processing: MapReduce Specialized systems: Dremel, Drill, Impala, Storm, S4, … Unified Platform: Spark Spark SQL = query structured data GraphX = for graph structures, e.g. hyperlinks, communities, …
  17. RDD = Resilient Distributed Dataset For true streaming: Apache Flink
  18. Also show CassandraWriterActor
  19. Show Zeppelin. ML variations: classification, regression, clustering
  20. Fourth one: maybe don’t use social media??
  21. In this session, streaming data from IoT sources (sensors) will be pulled into an analytics engine to make predictions about the future. We use Spark as the technology of choice, since this framework is well suited for combining streaming data with machine learning techniques. Join this session to get an overview of a (nearly) fullblown analytics application, and to get inspired to set up your own predictive API for the IoT!