SlideShare une entreprise Scribd logo
1  sur  31
Perfecting Your Streaming
Skills with Spark and Real
World IoT Data
Bob Wakefield
Principal
bob@MassStreet.net
Twitter:
@BobLovesData
Bob’s Background
• IT professional 16 years
• Currently working as a Data Engineer
• Education
• BS Business Admin (MIS) from KState
• MBA (finance concentration) from KU
• Coursework in Mathematics at Washburn
• Graduate certificate Data Science from Rockhurst
•Addicted to everything data
Follow Me!
•Personal Twitter: @BobLovesData
•Company Twitter: @MassStreet
•Blog: DataDrivenPerspectives.com
•Website: www.MassStreet.net
•Facebook: @MassStreetAnalyticsLLC
KC Learn Big Data Objectives
•Educate people about what you
can do with all the new technology
surrounding data.
•Grow the big data career field.
•Teach skills not products
ACM Kansas City
We’re looking for a speaker
willing to talk in deep detail
about data engineering
challenges their organization is
experiencing.
This Evening’s Learning
Objectives
Learn how to practice
your IoT skills with
real world IoT data.
Motivations For This Evenings Discussion
• Getting ready for the opportunities that IoT
presents.
• Tired of working with Sandboxes
• Tired of playing with human generated data
Disclaimer
Tonight’s presentation is based on a
personal hack-a-thon.
The Original Plan
The Original Plan
• Azure HDInsight
• Push button cluster
• Kafka
• Distributed pub/sub messaging system
• Spark
• Stream Processing framework
• Druid
• Opensource OLAP NoSQL real time database
• Grafana
• IoT Dashboard
The New Plan
All Material Can Be Downloaded
from GitHub
Azure HDInsight
• Button push Hadoop cluster
• Cloud version of Hortonworks Data Platform
• You can spin up different types of clusters
• Plain Hadoop
• Spark
• Hbase
• R Server
• Storm
• Real Time Hive
• Kafka
Azure HDInsight
• Each cluster type comes with the following
• Ambari
• Avro
• Hive and Hcat
• Mahout
• MapReduce
• Oozie
• Phoenix (?) (I don’t normally play with Hbase)
• Pig
• Sqoop
• Tez
• Yarn
• ZooKeeper
Azure HDInsight
Let’s stand up a cluster!
Azure HDInsight
Make sure you blow away the
resource group!
The Case for Learning IoT
• IoT represents a significant business opportunity
• Still on the wrong side of the hype cycle
• Not sure why
• Might be due to hardware requirements
• Just a matter of time
• IoT opens up an new world of applications
• IoT use cases
• Has the potential for ubiquitousness
An IoT Case Study
Life Alert
Fatal flaw (no pun
intended) .
Patient has to be
conscious to
activate.
An IoT Case Study
Software is easy.
Challenge: Creating a
device with a form factor
that provides function but
doesn’t get in the way of
the patients life.
Satori
• www.satori.com
• Open Streaming Data Platform
• Appears to be in tech preview
• So far appears to be free on the sub side
• Allows people to pub/sub to data streams
• A lot of municipal and science streams
• Not just IoT data
Satori
• You need an SDK to build stuff with Satori
• You can use your favorite build tool to import the
necessary classes
Satori
Let’s look at some code!
Spark Structured Streaming….
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to localhost:9999
val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()
Spark Structured Streaming
• Represents a VAST improvement over previous
stream processing frameworks
• Storm
• Flume
• Flink
• Samza
• Spark Streaming
• Allows you to create stream processes without having
to think much about it.
Spark Structured Streaming
Spark Structured Streaming
• If you can make batch jobs with Spark SQL, you can
make Structured Streaming Jobs.
• So new there is little to no literature on the topic.
• Structured Streaming Programming Guide is your best bet.
• Exactly once delivery semantics out of the box.
• Limited number of built in sources
• Socket
• Kafka
• File
Spark Structured Streaming
• Fairly unlimited on sinks
• JSON
• ORC
• Parquet
• CSV
• Database table
Spark Structured Streaming
• Things you can do with Structured Streaming
• SQL Operations
• Grouping/aggregations
• Filtering
• Joins (one DF has to be static)
• Things you can’t do make no sense in a streaming context
• Limit
• First N rows
• Operations over sliding windows
• Handling late data with watermarking
Spark Structured Streaming
Let’s look at some code!
Code Challenge
•Convert the socket server into a Kafka
Producer
•Find a streaming source that’s simpler
in structure than the weather data and
use it
•Attempt to parse the weather data and
load it into a case class.
•Try and build the rest of the

Contenu connexe

Tendances

The State of Stream Processing
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processing
confluent
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
DataWorks Summit
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 

Tendances (20)

Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
The State of Stream Processing
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processing
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
 
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 Presentations
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory Speed
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time Analytics
 
Druid
DruidDruid
Druid
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
Case Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with DruidCase Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with Druid
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Streamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User GroupStreamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User Group
 

En vedette

En vedette (9)

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
 
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
 
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
 
Build your First IoT Application with IBM Watson IoT
Build your First IoT Application with IBM Watson IoTBuild your First IoT Application with IBM Watson IoT
Build your First IoT Application with IBM Watson IoT
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
 
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
 

Similaire à Perfecting Your Streaming Skills with Spark and Real World IoT Data

Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
Stephen Borg
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 
JAZOON'13 - Abdelmonaim Remani - The Economies of Scaling Software
JAZOON'13 - Abdelmonaim Remani - The Economies of Scaling SoftwareJAZOON'13 - Abdelmonaim Remani - The Economies of Scaling Software
JAZOON'13 - Abdelmonaim Remani - The Economies of Scaling Software
jazoon13
 

Similaire à Perfecting Your Streaming Skills with Spark and Real World IoT Data (20)

Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
 
Iot meets Serverless
Iot meets ServerlessIot meets Serverless
Iot meets Serverless
 
Data science and Artificial Intelligence
Data science and Artificial IntelligenceData science and Artificial Intelligence
Data science and Artificial Intelligence
 
Code PaLOUsa Azure IoT Workshop
Code PaLOUsa Azure IoT WorkshopCode PaLOUsa Azure IoT Workshop
Code PaLOUsa Azure IoT Workshop
 
NDC Minnesota 2019 - Fundamentals of Azure IoT
NDC Minnesota 2019 - Fundamentals of Azure IoTNDC Minnesota 2019 - Fundamentals of Azure IoT
NDC Minnesota 2019 - Fundamentals of Azure IoT
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Genji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelinesGenji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelines
 
IoT Workshop Cincinnati
IoT Workshop CincinnatiIoT Workshop Cincinnati
IoT Workshop Cincinnati
 
JAZOON'13 - Abdelmonaim Remani - The Economies of Scaling Software
JAZOON'13 - Abdelmonaim Remani - The Economies of Scaling SoftwareJAZOON'13 - Abdelmonaim Remani - The Economies of Scaling Software
JAZOON'13 - Abdelmonaim Remani - The Economies of Scaling Software
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
 

Plus de Adaryl "Bob" Wakefield, MBA

Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Adaryl "Bob" Wakefield, MBA
 

Plus de Adaryl "Bob" Wakefield, MBA (8)

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Getting Started With Sparklyr
Getting Started With SparklyrGetting Started With Sparklyr
Getting Started With Sparklyr
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Hands On: Linux survival skills
Hands On: Linux survival skillsHands On: Linux survival skills
Hands On: Linux survival skills
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics Solution
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 

Dernier

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 

Dernier (20)

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 

Perfecting Your Streaming Skills with Spark and Real World IoT Data

  • 1. Perfecting Your Streaming Skills with Spark and Real World IoT Data Bob Wakefield Principal bob@MassStreet.net Twitter: @BobLovesData
  • 2. Bob’s Background • IT professional 16 years • Currently working as a Data Engineer • Education • BS Business Admin (MIS) from KState • MBA (finance concentration) from KU • Coursework in Mathematics at Washburn • Graduate certificate Data Science from Rockhurst •Addicted to everything data
  • 3. Follow Me! •Personal Twitter: @BobLovesData •Company Twitter: @MassStreet •Blog: DataDrivenPerspectives.com •Website: www.MassStreet.net •Facebook: @MassStreetAnalyticsLLC
  • 4. KC Learn Big Data Objectives •Educate people about what you can do with all the new technology surrounding data. •Grow the big data career field. •Teach skills not products
  • 5. ACM Kansas City We’re looking for a speaker willing to talk in deep detail about data engineering challenges their organization is experiencing.
  • 6. This Evening’s Learning Objectives Learn how to practice your IoT skills with real world IoT data.
  • 7. Motivations For This Evenings Discussion • Getting ready for the opportunities that IoT presents. • Tired of working with Sandboxes • Tired of playing with human generated data
  • 8. Disclaimer Tonight’s presentation is based on a personal hack-a-thon.
  • 10. The Original Plan • Azure HDInsight • Push button cluster • Kafka • Distributed pub/sub messaging system • Spark • Stream Processing framework • Druid • Opensource OLAP NoSQL real time database • Grafana • IoT Dashboard
  • 12. All Material Can Be Downloaded from GitHub
  • 13. Azure HDInsight • Button push Hadoop cluster • Cloud version of Hortonworks Data Platform • You can spin up different types of clusters • Plain Hadoop • Spark • Hbase • R Server • Storm • Real Time Hive • Kafka
  • 14. Azure HDInsight • Each cluster type comes with the following • Ambari • Avro • Hive and Hcat • Mahout • MapReduce • Oozie • Phoenix (?) (I don’t normally play with Hbase) • Pig • Sqoop • Tez • Yarn • ZooKeeper
  • 16. Azure HDInsight Make sure you blow away the resource group!
  • 17. The Case for Learning IoT • IoT represents a significant business opportunity • Still on the wrong side of the hype cycle • Not sure why • Might be due to hardware requirements • Just a matter of time • IoT opens up an new world of applications • IoT use cases • Has the potential for ubiquitousness
  • 18. An IoT Case Study Life Alert Fatal flaw (no pun intended) . Patient has to be conscious to activate.
  • 19. An IoT Case Study Software is easy. Challenge: Creating a device with a form factor that provides function but doesn’t get in the way of the patients life.
  • 20. Satori • www.satori.com • Open Streaming Data Platform • Appears to be in tech preview • So far appears to be free on the sub side • Allows people to pub/sub to data streams • A lot of municipal and science streams • Not just IoT data
  • 21. Satori • You need an SDK to build stuff with Satori • You can use your favorite build tool to import the necessary classes
  • 24. import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession val spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate() import spark.implicits._ // Create DataFrame representing the stream of input lines from connection to localhost:9999 val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load() // Split the lines into words val words = lines.as[String].flatMap(_.split(" ")) // Generate running word count val wordCounts = words.groupBy("value").count() // Start running the query that prints the running counts to the console val query = wordCounts.writeStream.outputMode("complete").format("console").start() query.awaitTermination()
  • 25. Spark Structured Streaming • Represents a VAST improvement over previous stream processing frameworks • Storm • Flume • Flink • Samza • Spark Streaming • Allows you to create stream processes without having to think much about it.
  • 27. Spark Structured Streaming • If you can make batch jobs with Spark SQL, you can make Structured Streaming Jobs. • So new there is little to no literature on the topic. • Structured Streaming Programming Guide is your best bet. • Exactly once delivery semantics out of the box. • Limited number of built in sources • Socket • Kafka • File
  • 28. Spark Structured Streaming • Fairly unlimited on sinks • JSON • ORC • Parquet • CSV • Database table
  • 29. Spark Structured Streaming • Things you can do with Structured Streaming • SQL Operations • Grouping/aggregations • Filtering • Joins (one DF has to be static) • Things you can’t do make no sense in a streaming context • Limit • First N rows • Operations over sliding windows • Handling late data with watermarking
  • 31. Code Challenge •Convert the socket server into a Kafka Producer •Find a streaming source that’s simpler in structure than the weather data and use it •Attempt to parse the weather data and load it into a case class. •Try and build the rest of the