The Internet of Things (IoT) is all about connected devices that produce and exchange data, and producing insights from these high volumes of data is challenging. On this session, we will start by providing a quick introduction to the MQTT protocol, and focus on using AI and machine learning techniques to provide insights from data collected from IoT devices. We will present some common AI concepts and techniques used by the industry to deploy state-of-the-art smart IoT systems. These techniques allow systems to determined patterns from the data, predict and prevent failures as well as suggest actions that can be used to minimize or avoid IoT device breakdowns on an intelligent way beyond rule-based and database search approaches. We will finish with a demo that puts together all the techniques discussed in an application that uses Apache Spark and Apache Bahir support for MQTT.
Getting insights from IoT data with Apache Spark and Apache Bahir
1. Getting Insights from IoT data
with Apache Spark &
Apache Bahir
Luciano Resende
June 20th
, 2018
1
2. About me - Luciano Resende
2
Data Science Platform Architect – IBM – CODAIT
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Toree, Apache Spark among other projects related to AI/ML platforms
lresende@apache.org
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
4. 5
IBM’s history of strong AI leadership
1997: Deep Blue
• Deep Blue became the first machine to beat a world chess
champion in tournament play
2011: Jeopardy!
• Watson beat two top
Jeopardy! champions
1968, 2001: A Space Odyssey
• IBM was a technical
advisor
• HAL is “the latest in
machine intelligence”
2018: Open Tech, AI & emerging
standards
• New IBM centers of gravity for AI
• OS projects increasing exponentially
• Emerging global standards in AI
5. Center for Open Source
Data and AI Technologies
CODAIT
codait.org
codait (French)
= coder/coded
https://m.interglot.com/fr/en/codait
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
6
6. Agenda
7
Internet of Things
IoT Use Cases
IoT Design Patterns
Apache Spark & Apache Bahir
Live Demo – Anomaly Detection
Summary
References
Q&A
8. What is THE INTERNET OF THINGS (IoT) ?
THE TERM IOT WAS FIRST TOUTED BY
KEVIN ASHTON, IN 1999. It’s more than
just machine to machine communications.
It’s about ecosystems of devices that form
relevant connections to people and other
devices to exchange data. It’s a useful term
to describe the world of connected and
wearable devices that’s emerging.
9
9. IoT - INTERACTIONS BETWEEN multiple ENTITIES
10
control
observe
inform
command
actuate
inform
PEOPLE
THINGS SOFTWARE
10. Some IoT EXAMPLES
11
SMART HOMES
TRANSPORT
WEARABLES INDUSTRY
DISPLAYS HEALTH
From thermostats
to smart switches
and remote
controls to security
systems
Self-driving cars,
drones for
delivering goods ,
etc
Smartwatches,
and other devices
enabling control
and providing
monitoring
capabilities
Robotics, sensors
for predicting
quality, failures,
etc
Not only VR, but
many new displays
enabling gesture
controls, haptic
interfaces, etc
Connected health,
in partnership with
wearables for
monitoring health
metrics, among
other examples
11. Some IoT PATTERNS
12
• Remote control
• Security analysis
• Edge analytics
• Historical data analysis
• Distributed Platforms
• Real-time decisions
13. The Weather Company
The Weather Company data
- Surface observations
- precipitation
- radar
- satellite
- personal weather stations
- lightning sources
- data collected from planes
every day
- etc.
14. Home Automation & Security
- Multiple connected or
standalone devices
- Controlled by Voice
- Amazon Echo (Alexa)
- Google Home
- Apple HomePod (Siri)
15
15. TESLA connected cars
CONNECTED VEHICLES IS ONE
EXAMPLE OF THE IoT. It’s not
just about Google Maps in cars.
When Tesla finds a software
fault with their vehicle rather
than issuing an expensive and
damaging recall, they simply
updated the car’s operating
system over the air.
[hcp://www.wired.com/2014/02
/teslas- air-fix-best-example-
yet-internet-things/]
16
17. Industrial Internet of Things
18
- Smart factory
- Predictive and remote
maintenance
- Smart metering and
smart grid
- Industrial security
- Industrial heating,
ventilation and air
conditioning
- Asset tracking and
smart logistics
Reference: https://www.iiconsortium.org/
19. LAMBDA Architecture
Lambda architecture is a data-
processing architecture designed to
handle massive quantities of data by
taking advantage of both batch-
and stream-processing methods. This
approach to architecture attempts to
balance latency, throughput, and fault-
tolerance by using batch processing to
provide comprehensive and accurate
views of batch data, while simultaneously
using real-time stream processing to
provide views of online data
20
Images: https://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry
21. REALTIME Data Processing best practice
22
Pub/Sub Component Data Processor Data Storage
One recommendation for processing
streaming of massive quantities of
data is to add a queue component to
front-end the processing of the data.
This enables a more fault-tolerant
solution, where in conjunction with
state management, the runtime
application can have failures and
subsequently continue processing
from the same data point.
24. MQTT – IoT Connectivity Protocol
25
Connect
+
Publish
+
Subscribe
~1990
IBM / Eurotech
2010
Published
2011
Eclipse M2M / Paho
2014
OASIS
Open spec
+ 40 client
implementations
Minimal
overhead
Tiny
Clients
(Java 170KB)
History
Header
2-4 bytes
(publish)
14 bytes
(connect)
V5
May 2018
25. MQTT – Quality of Service
26
MQTT
Broker
QoS0
QoS1
QoS2
At most once
At least once
Exactly once
. No connection failover
. Never duplicate
. Has connection failover
. Can duplicate
. Has connection failover
. Never duplicate
26. MQTT – World usage
Smart Home Automation
Messaging
Notable Mentions:
- IBM IoT Platform
- AWS IoT
- Microsoft IoT Hub
- Facebook Messenger
27
28. Apache Spark Introduction
29
Spark Core
Spark
SQL
Spark
Streaming
Spark
ML
Spark
GraphX
executes SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
general compute engine, handles
distributed task dispatching,
scheduling and basic I/O
functions
large variety of data sources
and formats can be supported,
both on-premise or cloud
BigInsights
(HDFS)
Cloudant
dashDB
SQL
DB
30. Apache Spark – Spark SQL
31
Spark
SQL
Unified data access APIS: Query
structured data sets with SQL or
Dataset/DataFrame APIs
Fast, familiar query language across
all of your enterprise data
RDBMS
Data Sources
Structured
Streaming
Data Sources
31. Apache Spark – Spark SQL
32
You can run SQL statement with SparkSession.sql(…) interface:
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”)
val ds = spark.sql(“select * from T1”)
You can further transform the resultant dataset:
val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”)
val ds2 = ds.orderBy(“c1”)
The result is a DataFrame / Dataset[Row]
ds.show() displays the rows
32. Apache Spark – Spark SQL
You can read from data sources using SparkSession.read.format(…)
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank]
// loading JSON data to a Dataset of Bank type
val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank]
// select a column value from the Dataset
bankFromCSV.select(‘age’).show() will return all rows of column “age” from this dataset.
33
33. Apache Spark – Spark SQL
You can also configure a specific data source with specific options
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = sparkSession.read
.option("header", ”true") // Use first line of all files as header
.option("inferSchema", ”true") // Automatically infer data types
.option("delimiter", " ")
.csv("/users/lresende/data.csv”)
.as[Bank]
bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset.
34
34. Apache Spark – Spark SQL Structured Streaming
Unified programming model for streaming, interactive and batch queries
35Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Considers the data stream as an
unbounded table
35. Apache Spark – Spark SQL Structured Streaming
SQL regular APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.read
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. save(” dest-path”)
36
Structured Streaming APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.readStream
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. startStream(” dest-path”)
36. Apache Spark – Spark Streaming
37
Spark
Streaming
Micro-batch event processing for
near-real time analytics
e.g. Internet of Things (IoT) devices,
Twitter feeds, Kafka (event hub), etc.
No multi-threading or parallel process
programming required
37. Apache Spark – Spark Streaming
Also known as discretized stream or DStream
Abstracts a continuous stream of data
Based on micro-batching
Based on RDDs
38
38. Apache Spark – Spark Streaming
val sparkConf = new SparkConf()
.setAppName("MQTTWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2)
val words = lines.flatMap(x => x.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
39
40. Apache Bahir Project
Apache Bahir provides
extensions to multiple
distributed analytic platforms,
extending their reach with a
diversity of streaming
connectors and SQL data
sources.
Currently, Bahir provides
extensions for Apache
Spark and Apache Flink.
41. Bahir extensions for Apache Spark
MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming.
• http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/
• http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/
Couch DB/Cloudant– Enables reading data from CouchDB/Cloudant using Spark SQL and Spark
Streaming.
Twitter – Enables reading social data from twitter using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/
Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-akka/
ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/
42
42. Bahir extensions for Apache Spark
Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub
43
43. Apache Spark extensions in Bahir
Adding Bahir extensions into your application
- Using SBT
libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0”
- Using Maven
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-mqtt_2.11 </artifactId>
<version>2.2.0</version>
</dependency>
44
44. Apache Spark extensions in Bahir
Submitting applications with Bahir extensions to Spark
- Spark-shell
bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
- Spark-submit
bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
45
46. IoT Analytics – Anomaly Detection
The demo environment
https://github.com/lresende/bahir-iot-demo
47
Node.js Web app
Simulates Elevator IoT devices
Elevator simulator
Metrics:
• Weight
• Speed
• Power
• Temperature
• System
MQTT
Mosquitto
MQTT
JavaScript Client
MQTT
Streaming Connector
47. IoT Analytics – Anomaly Detection
The Moving Z-Score model scores
anomalies in a univariate sequential dataset,
often a time series.
The moving Z-score is a very simple model
for measuring the anomalousness of each
point in a sequential dataset like a time
series. Given a window size w, the moving
Z-score is the number of standard
deviations each observation is away from
the mean, where the mean and standard
deviation are computed only over the
previous w observations.
48
Reference: https://turi.com/learn/userguide/anomaly_detection/moving_zscore.html
Reference: https://www.isixsigma.com/tools-templates/statistical-analysis/improved-forecasting-moving-averages-and-z-scores/
Z-Score
Moving Z-Score
Moving mean and moving standard deviation
49. Summary – Take away points
IoT Design Patterns
- Lambda and Kappa Architecture
- Real Time data processing best practice
Apache Spark
- IoT Analytics Runtime with support for ”Continuous Applications”
Apache Bahir
- Bring access to IoT data via supported connectors (e.g. MQTT)
IoT Applications
- Using Spark and Bahir to start processing IoT data in near real time using Spark Streaming
- Anomaly detection using Moving Z-Scores model
50