Contenu connexe Similaire à How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases (20) How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases1. © 2015 IBM Corporation
How Spark Enables the Internet of Things:
Efficient Integration of Multiple Spark
Components for Smart City Use Cases
Paula Ta-Shma
IBM Research
paula@il.ibm.com
Joint work with:
Adnan Akbar, University of Surrey
Michael Factor, IBM Research
Guy Hadash, IBM Research
Juan Sancho, ATOS
2. © 2015 IBM Corporation2
The Evolution of Data Collection
Internet of
Things
3. © 2015 IBM Corporation3
2005 2012 2017
The IoT market will grow to
$1.7 trillion in 2020 (IDC)
By 2020 the number of networked devices
will be 30 billion (IDC), more than 4 times
the entire global population
IoT : The Biggest Big Data
GlobalDataVolumeinExabytes
2005 2012 2017
4. © 2015 IBM Corporation4
EMT Madrid Bus Company Needs to Make Decisions
According to Current and Predicted Future Traffic State
The Problem
– EMT needs to staff control rooms where employees manually analyze Madrid traffic sensor output.
This can be slow and costly.
Objective
– Improve customer satisfaction and reduce costs by responding more efficiently and quickly to real-
time traffic problems
Approach
– Monitor data from up to 3000 sensors. React by rerouting buses, modifying traffic lights, etc., based
upon knowledge derived from historical data
Today Tomorrow
5. © 2015 IBM Corporation5
1. Collect historical time series data
– Collect data from devices
– Aggregate into objects
– Index and/or partition
Generic IoT Architecture – Data Flow
Secor
IoT
Swift
6. © 2015 IBM Corporation6
2. Learn patterns in data
– May be time/location dependent
– Generate thresholds, classifiers etc.
Generic IoT Architecture – Data Flow
Secor
Swift
7. © 2015 IBM Corporation7
IoT
3. Apply what was learned on
real time data stream
– Take action
Generic IoT Architecture – Data Flow
Secor
CEP
Swift
8. © 2015 IBM Corporation8
How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark
Components for Smart City Use Cases
IoT
Generic IoT Architecture – Data Flow
CEP
Secor
Swift
Green Flows: Real time
Purple Flows: Batch
9. © 2015 IBM Corporation9
Aim: Collect historical timeseries data for analysis
– Continuously collect data from up to 3000 Madrid council traffic sensors via web service
- Data includes traffic speeds and intensities, updated every 5 mins
– Push the messages to Kafka
– Use Secor to aggregate multiple messages into a single Swift object
- According to policy, e.g., every 60 mins
- Possibly partition the data, e.g. according to date
- Convert to Parquet format
- Annotate with metadata, e.g., min/max speed, start/end time
– Index Swift objects according to their metadata using ElasticSearch
Secor
Swift
IoT Architecture – Madrid Traffic – Ingestion Flow
IoT
10. © 2015 IBM Corporation10
IoT Architecture – Madrid Traffic – Data Access
Aim: Access data efficiently and cost
effectively
– Store IoT data in OpenStack Swift object
storage
- Open source, low cost deployment, and
highly scalable
– Parquet data is accessible via Spark SQL
– Optimized predicate pushdown
- Custom Spark SQL external data source
driver
- Uses object metadata indexes
- Searches for Swift objects whose min/max
values overlap requested ranges
Get all data for morning traffic:
SELECT codigo, intensidad, velocidad FROM
madridtraffic
WHERE tf >= '08:00:00' AND tf <= '12:00:00'
Brute force method
13245 Swift requests
Optimized predicate pushdown
616 Swift requests
21.5 times improvement
Swift
11. © 2015 IBM Corporation11
IoT Architecture – Madrid Traffic – Machine Learning
Aim: Learn to differentiate between ‘good’ and
‘bad’ traffic
– Depends on context
- Time (morning/evening), Day (weekday/weekend)
- Location
– Use Spark MLlib k-means clustering
– Produce threshold values for real-time decision making
– Re-run algorithm when quality of clusters decreases
- Can use silhouette index to measure quality
Swift
12. © 2015 IBM Corporation12
IoT Architecture – Madrid Traffic – Machine Learning
Event Detection:
• Use Spark MLlib k-means
clustering to separate
data into 2 clusters
• Find the midpoint between
the 2 cluster centres
• Use this midpoint to
generate the thresholds
• Repeat for each context
e.g. time period (morning,
afternoon, evening, night)
Anomaly Detection:
• Use a single cluster and
define an anomaly to be
further than a certain
distance from the cluster
centre
Morning Traffic on Weekdays
13. © 2015 IBM Corporation13
IoT Architecture – Madrid Traffic –
Real Time Decision Making
Aim: Respond in real time to traffic conditions
– Use Complex Event Processing (CEP) approach
- Rule based
- Process events record by record
- CEP rules are typically defined manually but in many
cases it is difficult to get them right
- We automate this process and make it smart
- uCEP has a small footprint, can be run at the edge
CEP
IoT
Work in Progress
Proactive approach:
• Use Spark streaming
linear regression to
predict traffic behavior
(e.g. speed, intensity)
for near future
• Apply CEP on
predicted data
• Respond pro-actively
to predicted events
such as traffic
congestion
– e.g. EMT can
proactively re-
route buses
15. © 2015 IBM Corporation15
Our Architecture Applies to Many IoT Use Cases
Energy/utilities
– Anomaly detection
- Pipe leakage
- Appliance malfunction
– Occupancy detection
Healthcare
– Healthcare patient
monitoring/alert/response
Insurance
– Driver behavior and location
monitoring
Transportation
– Connected vehicles, engine
diagnostics, automated service
scheduling
Logistics
– Goods tracking, sensitive
goods management
16. © 2015 IBM Corporation
Data
Sources
Apache
Spark
Node-RED
Secor
Message
Bus
Data
Storage
Data
Analytics
Data
Visualization
Freeboard Dashboard
Object
Storage
16
MQTT
The Madrid Traffic Use Case on IBM Bluemix
Madrid Traffic Sensors
Joint work with Naeem Altaf and team
19. © 2015 IBM Corporation19
COSMOS
Funding: EU FP7 at level of 2PY x 3 years
Started: Sept 2013
Coordinator: ATOS
Technical partners: IBM, NTUA, Univ Surrey, Siemens, ATOS
Use Case Partners: Hildebrand/Camden, EMT Madrid Bus Transport/Madrid
Council, III Taiwan – Smart Cities use cases
Project Vision: Enable ‘things’ to interact with each other based on shared
experience, trust, reputation etc.
20. © 2015 IBM Corporation20
IBM Bluemix Data Analytics for IoT Architecture
21. © 2015 IBM Corporation21
What is it?
– Apache Kafka is a high throughput distributed publish/subscribe messaging system.
– Secor is an open source tool developed by Pinterest, which aggregates Kafka messages
and saves as an S3 object.
What extensions were needed?
– Support for OpenStack Swift as a Secor target. We also added support for Parquet
format and annotating objects with metadata search to support indexing.
What is the value of integration with Swift?
– Enables bringing new data and applications to Swift which is an open source solution.
Parquet and metadata search enable improved performance for batch analytics.
Status
– We contributed OpenStack Swift support to the Secor community and it is now part of
Secor.
Secor
Kafka + Secor
22. © 2015 IBM Corporation22
Parquet
What is it?
– A column based semi-structured, schema-based storage format supported by Hadoop
and Spark. Enables column-wise compression and projection pushdown.
What integration is needed?
– Since Swift is now part of the Hadoop ecosystem, no additional integration is needed.
Data in Swift can be stored in Apache Parquet format, inheriting associated advantages.
Status
– Spark SQL supports storing tabular data in Parquet format in Hadoop compatible storage
systems such as Swift.
23. © 2015 IBM Corporation23
elasticsearch
What is it?
– A distributed, scalable, real-time search and analytics engine, built on Apache Lucene.
What integration is needed?
– Index object metadata allowing search for objects by attributes.
What is the value of integration with Swift
– Use search to select objects for further processing, e.g., relevant objects for analytics.
- Note that S3 does not yet have native search according to metadata.
Status
– The IBM SoftLayer object service includes a basic implementation of metadata search;
At IBM Research, we added extensions such as data type support and range searches.
24. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
For up-to-date information and news
about the Spark and the Spark Technology Center,
Sign up for our newsletter
at www.spark.tc