SlideShare une entreprise Scribd logo
1  sur  24
© 2015 IBM Corporation
How Spark Enables the Internet of Things:
Efficient Integration of Multiple Spark
Components for Smart City Use Cases
Paula Ta-Shma
IBM Research
paula@il.ibm.com
Joint work with:
Adnan Akbar, University of Surrey
Michael Factor, IBM Research
Guy Hadash, IBM Research
Juan Sancho, ATOS
© 2015 IBM Corporation2
The Evolution of Data Collection
Internet of
Things
© 2015 IBM Corporation3
2005 2012 2017
The IoT market will grow to
$1.7 trillion in 2020 (IDC)
By 2020 the number of networked devices
will be 30 billion (IDC), more than 4 times
the entire global population
IoT : The Biggest Big Data
GlobalDataVolumeinExabytes
2005 2012 2017
© 2015 IBM Corporation4
EMT Madrid Bus Company Needs to Make Decisions
According to Current and Predicted Future Traffic State
 The Problem
– EMT needs to staff control rooms where employees manually analyze Madrid traffic sensor output.
This can be slow and costly.
 Objective
– Improve customer satisfaction and reduce costs by responding more efficiently and quickly to real-
time traffic problems
 Approach
– Monitor data from up to 3000 sensors. React by rerouting buses, modifying traffic lights, etc., based
upon knowledge derived from historical data
Today Tomorrow
© 2015 IBM Corporation5
1. Collect historical time series data
– Collect data from devices
– Aggregate into objects
– Index and/or partition
Generic IoT Architecture – Data Flow
Secor
IoT
Swift
© 2015 IBM Corporation6
2. Learn patterns in data
– May be time/location dependent
– Generate thresholds, classifiers etc.
Generic IoT Architecture – Data Flow
Secor
Swift
© 2015 IBM Corporation7
IoT
3. Apply what was learned on
real time data stream
– Take action
Generic IoT Architecture – Data Flow
Secor
CEP
Swift
© 2015 IBM Corporation8
How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark
Components for Smart City Use Cases
IoT
Generic IoT Architecture – Data Flow
CEP
Secor
Swift
Green Flows: Real time
Purple Flows: Batch
© 2015 IBM Corporation9
Aim: Collect historical timeseries data for analysis
– Continuously collect data from up to 3000 Madrid council traffic sensors via web service
- Data includes traffic speeds and intensities, updated every 5 mins
– Push the messages to Kafka
– Use Secor to aggregate multiple messages into a single Swift object
- According to policy, e.g., every 60 mins
- Possibly partition the data, e.g. according to date
- Convert to Parquet format
- Annotate with metadata, e.g., min/max speed, start/end time
– Index Swift objects according to their metadata using ElasticSearch
Secor
Swift
IoT Architecture – Madrid Traffic – Ingestion Flow
IoT
© 2015 IBM Corporation10
IoT Architecture – Madrid Traffic – Data Access
Aim: Access data efficiently and cost
effectively
– Store IoT data in OpenStack Swift object
storage
- Open source, low cost deployment, and
highly scalable
– Parquet data is accessible via Spark SQL
– Optimized predicate pushdown
- Custom Spark SQL external data source
driver
- Uses object metadata indexes
- Searches for Swift objects whose min/max
values overlap requested ranges
Get all data for morning traffic:
SELECT codigo, intensidad, velocidad FROM
madridtraffic
WHERE tf >= '08:00:00' AND tf <= '12:00:00'
Brute force method
13245 Swift requests
Optimized predicate pushdown
616 Swift requests
21.5 times improvement
Swift
© 2015 IBM Corporation11
IoT Architecture – Madrid Traffic – Machine Learning
Aim: Learn to differentiate between ‘good’ and
‘bad’ traffic
– Depends on context
- Time (morning/evening), Day (weekday/weekend)
- Location
– Use Spark MLlib k-means clustering
– Produce threshold values for real-time decision making
– Re-run algorithm when quality of clusters decreases
- Can use silhouette index to measure quality
Swift
© 2015 IBM Corporation12
IoT Architecture – Madrid Traffic – Machine Learning
Event Detection:
• Use Spark MLlib k-means
clustering to separate
data into 2 clusters
• Find the midpoint between
the 2 cluster centres
• Use this midpoint to
generate the thresholds
• Repeat for each context
e.g. time period (morning,
afternoon, evening, night)
Anomaly Detection:
• Use a single cluster and
define an anomaly to be
further than a certain
distance from the cluster
centre
Morning Traffic on Weekdays
© 2015 IBM Corporation13
IoT Architecture – Madrid Traffic –
Real Time Decision Making
Aim: Respond in real time to traffic conditions
– Use Complex Event Processing (CEP) approach
- Rule based
- Process events record by record
- CEP rules are typically defined manually but in many
cases it is difficult to get them right
- We automate this process and make it smart
- uCEP has a small footprint, can be run at the edge
CEP
IoT
Work in Progress
Proactive approach:
• Use Spark streaming
linear regression to
predict traffic behavior
(e.g. speed, intensity)
for near future
• Apply CEP on
predicted data
• Respond pro-actively
to predicted events
such as traffic
congestion
– e.g. EMT can
proactively re-
route buses
© 2015 IBM Corporation14
Demo
© 2015 IBM Corporation15
Our Architecture Applies to Many IoT Use Cases
 Energy/utilities
– Anomaly detection
- Pipe leakage
- Appliance malfunction
– Occupancy detection
 Healthcare
– Healthcare patient
monitoring/alert/response
 Insurance
– Driver behavior and location
monitoring
 Transportation
– Connected vehicles, engine
diagnostics, automated service
scheduling
 Logistics
– Goods tracking, sensitive
goods management
© 2015 IBM Corporation
Data
Sources
Apache
Spark
Node-RED
Secor
Message
Bus
Data
Storage
Data
Analytics
Data
Visualization
Freeboard Dashboard
Object
Storage
16
MQTT
The Madrid Traffic Use Case on IBM Bluemix
Madrid Traffic Sensors
Joint work with Naeem Altaf and team
© 2015 IBM Corporation17
Thank You !
© 2015 IBM Corporation18
Backup
© 2015 IBM Corporation19
COSMOS
 Funding: EU FP7 at level of 2PY x 3 years
 Started: Sept 2013
 Coordinator: ATOS
 Technical partners: IBM, NTUA, Univ Surrey, Siemens, ATOS
 Use Case Partners: Hildebrand/Camden, EMT Madrid Bus Transport/Madrid
Council, III Taiwan – Smart Cities use cases
 Project Vision: Enable ‘things’ to interact with each other based on shared
experience, trust, reputation etc.
© 2015 IBM Corporation20
IBM Bluemix Data Analytics for IoT Architecture
© 2015 IBM Corporation21
 What is it?
– Apache Kafka is a high throughput distributed publish/subscribe messaging system.
– Secor is an open source tool developed by Pinterest, which aggregates Kafka messages
and saves as an S3 object.
 What extensions were needed?
– Support for OpenStack Swift as a Secor target. We also added support for Parquet
format and annotating objects with metadata search to support indexing.
 What is the value of integration with Swift?
– Enables bringing new data and applications to Swift which is an open source solution.
Parquet and metadata search enable improved performance for batch analytics.
 Status
– We contributed OpenStack Swift support to the Secor community and it is now part of
Secor.
Secor
Kafka + Secor
© 2015 IBM Corporation22
Parquet
 What is it?
– A column based semi-structured, schema-based storage format supported by Hadoop
and Spark. Enables column-wise compression and projection pushdown.
 What integration is needed?
– Since Swift is now part of the Hadoop ecosystem, no additional integration is needed.
Data in Swift can be stored in Apache Parquet format, inheriting associated advantages.
 Status
– Spark SQL supports storing tabular data in Parquet format in Hadoop compatible storage
systems such as Swift.
© 2015 IBM Corporation23
elasticsearch
 What is it?
– A distributed, scalable, real-time search and analytics engine, built on Apache Lucene.
 What integration is needed?
– Index object metadata allowing search for objects by attributes.
 What is the value of integration with Swift
– Use search to select objects for further processing, e.g., relevant objects for analytics.
- Note that S3 does not yet have native search according to metadata.
 Status
– The IBM SoftLayer object service includes a basic implementation of metadata search;
At IBM Research, we added extensions such as data type support and range searches.
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
For up-to-date information and news
about the Spark and the Spark Technology Center,
Sign up for our newsletter
at www.spark.tc

Contenu connexe

Tendances

Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Databricks
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 

Tendances (20)

Breaking Down Analytical and Computational Barriers Across the Energy Industr...
Breaking Down Analytical and Computational Barriers Across the Energy Industr...Breaking Down Analytical and Computational Barriers Across the Energy Industr...
Breaking Down Analytical and Computational Barriers Across the Energy Industr...
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
Data Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoData Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at Zalando
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Power Your Delta Lake with Streaming Transactional Changes
 Power Your Delta Lake with Streaming Transactional Changes Power Your Delta Lake with Streaming Transactional Changes
Power Your Delta Lake with Streaming Transactional Changes
 
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
 
Zero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using HadoopZero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using Hadoop
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
Event Driven Architecture: Mistakes, I've made a few...
Event Driven Architecture: Mistakes, I've made a few...Event Driven Architecture: Mistakes, I've made a few...
Event Driven Architecture: Mistakes, I've made a few...
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at Airbnb
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
 
Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture
 

En vedette

SSN2013 Demo: tablet based visualization of transport data with SPARQLStream
SSN2013 Demo: tablet based visualization of transport data with SPARQLStreamSSN2013 Demo: tablet based visualization of transport data with SPARQLStream
SSN2013 Demo: tablet based visualization of transport data with SPARQLStream
Jean-Paul Calbimonte
 
iris magazine
iris magazineiris magazine
iris magazine
Monisha
 

En vedette (20)

SSN2013 Demo: tablet based visualization of transport data with SPARQLStream
SSN2013 Demo: tablet based visualization of transport data with SPARQLStreamSSN2013 Demo: tablet based visualization of transport data with SPARQLStream
SSN2013 Demo: tablet based visualization of transport data with SPARQLStream
 
Atos ecarga brochure EN
Atos ecarga brochure ENAtos ecarga brochure EN
Atos ecarga brochure EN
 
CV - P.A. Shenton
CV - P.A.  ShentonCV - P.A.  Shenton
CV - P.A. Shenton
 
Implementing a Smart City through a stepwise approach
Implementing a Smart City through a stepwise approachImplementing a Smart City through a stepwise approach
Implementing a Smart City through a stepwise approach
 
iris magazine
iris magazineiris magazine
iris magazine
 
OpenStack Summit Austin 2016 v1.3
OpenStack Summit Austin 2016 v1.3 OpenStack Summit Austin 2016 v1.3
OpenStack Summit Austin 2016 v1.3
 
HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月
HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月
HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月
 
Ubiwhere's Annual Report 2016 - Volume 1
Ubiwhere's Annual Report 2016 - Volume 1Ubiwhere's Annual Report 2016 - Volume 1
Ubiwhere's Annual Report 2016 - Volume 1
 
FIWARE for Smart Industry
FIWARE for Smart IndustryFIWARE for Smart Industry
FIWARE for Smart Industry
 
Spark Streaming and Expert Systems
Spark Streaming and Expert SystemsSpark Streaming and Expert Systems
Spark Streaming and Expert Systems
 
Data Pioneers - Roland Haeve (Atos Nederland) - Big data in organisaties
Data Pioneers - Roland Haeve (Atos Nederland) - Big data in organisatiesData Pioneers - Roland Haeve (Atos Nederland) - Big data in organisaties
Data Pioneers - Roland Haeve (Atos Nederland) - Big data in organisaties
 
Effective IoT System on Openstack
Effective IoT System on OpenstackEffective IoT System on Openstack
Effective IoT System on Openstack
 
Luciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conferenceLuciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conference
 
How mentoring can help you start contributing to open source
How mentoring can help you start contributing to open sourceHow mentoring can help you start contributing to open source
How mentoring can help you start contributing to open source
 
Smart City Technologies in Beijing
Smart City Technologies in BeijingSmart City Technologies in Beijing
Smart City Technologies in Beijing
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningSystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
 
IoT in agri-food
IoT in agri-foodIoT in agri-food
IoT in agri-food
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 

Similaire à How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

IBM CDS Overview
IBM CDS OverviewIBM CDS Overview
IBM CDS Overview
Jean Tan
 
2019 punter data voor slimme systemen dvc 17okt-pdf
2019 punter data voor slimme systemen dvc 17okt-pdf2019 punter data voor slimme systemen dvc 17okt-pdf
2019 punter data voor slimme systemen dvc 17okt-pdf
DVCSI
 
Confluent Cloud inside the Digital Transformation of Autostrade per l’Italia
Confluent Cloud inside the Digital Transformation of Autostrade per l’ItaliaConfluent Cloud inside the Digital Transformation of Autostrade per l’Italia
Confluent Cloud inside the Digital Transformation of Autostrade per l’Italia
confluent
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
vithakur
 

Similaire à How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases (20)

Fin fest 2014 - Internet of Things and APIs
Fin fest 2014 - Internet of Things and APIsFin fest 2014 - Internet of Things and APIs
Fin fest 2014 - Internet of Things and APIs
 
[IoT Tech Expo] Smart Cities – Leveraging Messaging from Project to City to ...
[IoT Tech Expo] Smart Cities – Leveraging Messaging from Project to City to ...[IoT Tech Expo] Smart Cities – Leveraging Messaging from Project to City to ...
[IoT Tech Expo] Smart Cities – Leveraging Messaging from Project to City to ...
 
Ibm iot overview
Ibm   iot overviewIbm   iot overview
Ibm iot overview
 
Session 1908 connecting devices to the IBM IoT Cloud
Session 1908   connecting devices to the  IBM IoT CloudSession 1908   connecting devices to the  IBM IoT Cloud
Session 1908 connecting devices to the IBM IoT Cloud
 
IBM CDS Overview
IBM CDS OverviewIBM CDS Overview
IBM CDS Overview
 
Informix MQTT Streaming
Informix MQTT StreamingInformix MQTT Streaming
Informix MQTT Streaming
 
2019 punter data voor slimme systemen dvc 17okt-pdf
2019 punter data voor slimme systemen dvc 17okt-pdf2019 punter data voor slimme systemen dvc 17okt-pdf
2019 punter data voor slimme systemen dvc 17okt-pdf
 
Smart Cities, IoT, SDN, 5G Networks, Cloud Computing… Managing Complexity wit...
Smart Cities, IoT, SDN, 5G Networks, Cloud Computing… Managing Complexity wit...Smart Cities, IoT, SDN, 5G Networks, Cloud Computing… Managing Complexity wit...
Smart Cities, IoT, SDN, 5G Networks, Cloud Computing… Managing Complexity wit...
 
Confluent Cloud inside the Digital Transformation of Autostrade per l’Italia
Confluent Cloud inside the Digital Transformation of Autostrade per l’ItaliaConfluent Cloud inside the Digital Transformation of Autostrade per l’Italia
Confluent Cloud inside the Digital Transformation of Autostrade per l’Italia
 
IRJET- Smart Parking System in Multi-Storey Buildings
IRJET- Smart Parking System in Multi-Storey BuildingsIRJET- Smart Parking System in Multi-Storey Buildings
IRJET- Smart Parking System in Multi-Storey Buildings
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
 
Connectivité temps réel et bi-directionnelle ​ pour solutions IOT
Connectivité temps réel et bi-directionnelle ​ pour solutions IOTConnectivité temps réel et bi-directionnelle ​ pour solutions IOT
Connectivité temps réel et bi-directionnelle ​ pour solutions IOT
 
IoT and the Oil & Gas industry at M2M Oil & Gas 2014 in London
IoT and the Oil & Gas industry at M2M Oil & Gas 2014 in LondonIoT and the Oil & Gas industry at M2M Oil & Gas 2014 in London
IoT and the Oil & Gas industry at M2M Oil & Gas 2014 in London
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
 
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
 
Transport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital TwinTransport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital Twin
 
Get Cloud Resources to the IoT Edge with Fog Computing
Get Cloud Resources to the IoT Edge with Fog ComputingGet Cloud Resources to the IoT Edge with Fog Computing
Get Cloud Resources to the IoT Edge with Fog Computing
 
10 years of microservices at finn.no - why is that dragon still here (ndc o...
10 years of microservices at finn.no  - why is that dragon still here  (ndc o...10 years of microservices at finn.no  - why is that dragon still here  (ndc o...
10 years of microservices at finn.no - why is that dragon still here (ndc o...
 
Creating Microservices Application with IBM Cloud Private (ICP) - introductio...
Creating Microservices Application with IBM Cloud Private (ICP) - introductio...Creating Microservices Application with IBM Cloud Private (ICP) - introductio...
Creating Microservices Application with IBM Cloud Private (ICP) - introductio...
 
Transform Your Enterprise Faster with Seamless Hybrid Cloud from Netapp
Transform Your Enterprise Faster with Seamless Hybrid Cloud from NetappTransform Your Enterprise Faster with Seamless Hybrid Cloud from Netapp
Transform Your Enterprise Faster with Seamless Hybrid Cloud from Netapp
 

Plus de sparktc

Plus de sparktc (13)

Apache Spark™ Applications the Easy Way - Pierre Borckmans
Apache Spark™ Applications the Easy Way - Pierre BorckmansApache Spark™ Applications the Easy Way - Pierre Borckmans
Apache Spark™ Applications the Easy Way - Pierre Borckmans
 
Hyperparameter Optimization - Sven Hafeneger
Hyperparameter Optimization - Sven HafenegerHyperparameter Optimization - Sven Hafeneger
Hyperparameter Optimization - Sven Hafeneger
 
Data Science Hub & the Data Science Community - Philippe Van Impe
Data Science Hub & the Data Science Community - Philippe Van ImpeData Science Hub & the Data Science Community - Philippe Van Impe
Data Science Hub & the Data Science Community - Philippe Van Impe
 
Data Science and Beer - Kris peeters
Data Science and Beer - Kris peetersData Science and Beer - Kris peeters
Data Science and Beer - Kris peeters
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
 
Building Custom
Machine Learning Algorithms
with Apache SystemML
Building Custom
Machine Learning Algorithms
with Apache SystemMLBuilding Custom
Machine Learning Algorithms
with Apache SystemML
Building Custom
Machine Learning Algorithms
with Apache SystemML
 
The Internet of Everywhere — How The Weather Company Scales
The Internet of Everywhere — How The Weather Company ScalesThe Internet of Everywhere — How The Weather Company Scales
The Internet of Everywhere — How The Weather Company Scales
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
STC Design - Engage
STC Design - EngageSTC Design - Engage
STC Design - Engage
 
Spark Summit EU: IBM Keynote
Spark Summit EU: IBM KeynoteSpark Summit EU: IBM Keynote
Spark Summit EU: IBM Keynote
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

  • 1. © 2015 IBM Corporation How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases Paula Ta-Shma IBM Research paula@il.ibm.com Joint work with: Adnan Akbar, University of Surrey Michael Factor, IBM Research Guy Hadash, IBM Research Juan Sancho, ATOS
  • 2. © 2015 IBM Corporation2 The Evolution of Data Collection Internet of Things
  • 3. © 2015 IBM Corporation3 2005 2012 2017 The IoT market will grow to $1.7 trillion in 2020 (IDC) By 2020 the number of networked devices will be 30 billion (IDC), more than 4 times the entire global population IoT : The Biggest Big Data GlobalDataVolumeinExabytes 2005 2012 2017
  • 4. © 2015 IBM Corporation4 EMT Madrid Bus Company Needs to Make Decisions According to Current and Predicted Future Traffic State  The Problem – EMT needs to staff control rooms where employees manually analyze Madrid traffic sensor output. This can be slow and costly.  Objective – Improve customer satisfaction and reduce costs by responding more efficiently and quickly to real- time traffic problems  Approach – Monitor data from up to 3000 sensors. React by rerouting buses, modifying traffic lights, etc., based upon knowledge derived from historical data Today Tomorrow
  • 5. © 2015 IBM Corporation5 1. Collect historical time series data – Collect data from devices – Aggregate into objects – Index and/or partition Generic IoT Architecture – Data Flow Secor IoT Swift
  • 6. © 2015 IBM Corporation6 2. Learn patterns in data – May be time/location dependent – Generate thresholds, classifiers etc. Generic IoT Architecture – Data Flow Secor Swift
  • 7. © 2015 IBM Corporation7 IoT 3. Apply what was learned on real time data stream – Take action Generic IoT Architecture – Data Flow Secor CEP Swift
  • 8. © 2015 IBM Corporation8 How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases IoT Generic IoT Architecture – Data Flow CEP Secor Swift Green Flows: Real time Purple Flows: Batch
  • 9. © 2015 IBM Corporation9 Aim: Collect historical timeseries data for analysis – Continuously collect data from up to 3000 Madrid council traffic sensors via web service - Data includes traffic speeds and intensities, updated every 5 mins – Push the messages to Kafka – Use Secor to aggregate multiple messages into a single Swift object - According to policy, e.g., every 60 mins - Possibly partition the data, e.g. according to date - Convert to Parquet format - Annotate with metadata, e.g., min/max speed, start/end time – Index Swift objects according to their metadata using ElasticSearch Secor Swift IoT Architecture – Madrid Traffic – Ingestion Flow IoT
  • 10. © 2015 IBM Corporation10 IoT Architecture – Madrid Traffic – Data Access Aim: Access data efficiently and cost effectively – Store IoT data in OpenStack Swift object storage - Open source, low cost deployment, and highly scalable – Parquet data is accessible via Spark SQL – Optimized predicate pushdown - Custom Spark SQL external data source driver - Uses object metadata indexes - Searches for Swift objects whose min/max values overlap requested ranges Get all data for morning traffic: SELECT codigo, intensidad, velocidad FROM madridtraffic WHERE tf >= '08:00:00' AND tf <= '12:00:00' Brute force method 13245 Swift requests Optimized predicate pushdown 616 Swift requests 21.5 times improvement Swift
  • 11. © 2015 IBM Corporation11 IoT Architecture – Madrid Traffic – Machine Learning Aim: Learn to differentiate between ‘good’ and ‘bad’ traffic – Depends on context - Time (morning/evening), Day (weekday/weekend) - Location – Use Spark MLlib k-means clustering – Produce threshold values for real-time decision making – Re-run algorithm when quality of clusters decreases - Can use silhouette index to measure quality Swift
  • 12. © 2015 IBM Corporation12 IoT Architecture – Madrid Traffic – Machine Learning Event Detection: • Use Spark MLlib k-means clustering to separate data into 2 clusters • Find the midpoint between the 2 cluster centres • Use this midpoint to generate the thresholds • Repeat for each context e.g. time period (morning, afternoon, evening, night) Anomaly Detection: • Use a single cluster and define an anomaly to be further than a certain distance from the cluster centre Morning Traffic on Weekdays
  • 13. © 2015 IBM Corporation13 IoT Architecture – Madrid Traffic – Real Time Decision Making Aim: Respond in real time to traffic conditions – Use Complex Event Processing (CEP) approach - Rule based - Process events record by record - CEP rules are typically defined manually but in many cases it is difficult to get them right - We automate this process and make it smart - uCEP has a small footprint, can be run at the edge CEP IoT Work in Progress Proactive approach: • Use Spark streaming linear regression to predict traffic behavior (e.g. speed, intensity) for near future • Apply CEP on predicted data • Respond pro-actively to predicted events such as traffic congestion – e.g. EMT can proactively re- route buses
  • 14. © 2015 IBM Corporation14 Demo
  • 15. © 2015 IBM Corporation15 Our Architecture Applies to Many IoT Use Cases  Energy/utilities – Anomaly detection - Pipe leakage - Appliance malfunction – Occupancy detection  Healthcare – Healthcare patient monitoring/alert/response  Insurance – Driver behavior and location monitoring  Transportation – Connected vehicles, engine diagnostics, automated service scheduling  Logistics – Goods tracking, sensitive goods management
  • 16. © 2015 IBM Corporation Data Sources Apache Spark Node-RED Secor Message Bus Data Storage Data Analytics Data Visualization Freeboard Dashboard Object Storage 16 MQTT The Madrid Traffic Use Case on IBM Bluemix Madrid Traffic Sensors Joint work with Naeem Altaf and team
  • 17. © 2015 IBM Corporation17 Thank You !
  • 18. © 2015 IBM Corporation18 Backup
  • 19. © 2015 IBM Corporation19 COSMOS  Funding: EU FP7 at level of 2PY x 3 years  Started: Sept 2013  Coordinator: ATOS  Technical partners: IBM, NTUA, Univ Surrey, Siemens, ATOS  Use Case Partners: Hildebrand/Camden, EMT Madrid Bus Transport/Madrid Council, III Taiwan – Smart Cities use cases  Project Vision: Enable ‘things’ to interact with each other based on shared experience, trust, reputation etc.
  • 20. © 2015 IBM Corporation20 IBM Bluemix Data Analytics for IoT Architecture
  • 21. © 2015 IBM Corporation21  What is it? – Apache Kafka is a high throughput distributed publish/subscribe messaging system. – Secor is an open source tool developed by Pinterest, which aggregates Kafka messages and saves as an S3 object.  What extensions were needed? – Support for OpenStack Swift as a Secor target. We also added support for Parquet format and annotating objects with metadata search to support indexing.  What is the value of integration with Swift? – Enables bringing new data and applications to Swift which is an open source solution. Parquet and metadata search enable improved performance for batch analytics.  Status – We contributed OpenStack Swift support to the Secor community and it is now part of Secor. Secor Kafka + Secor
  • 22. © 2015 IBM Corporation22 Parquet  What is it? – A column based semi-structured, schema-based storage format supported by Hadoop and Spark. Enables column-wise compression and projection pushdown.  What integration is needed? – Since Swift is now part of the Hadoop ecosystem, no additional integration is needed. Data in Swift can be stored in Apache Parquet format, inheriting associated advantages.  Status – Spark SQL supports storing tabular data in Parquet format in Hadoop compatible storage systems such as Swift.
  • 23. © 2015 IBM Corporation23 elasticsearch  What is it? – A distributed, scalable, real-time search and analytics engine, built on Apache Lucene.  What integration is needed? – Index object metadata allowing search for objects by attributes.  What is the value of integration with Swift – Use search to select objects for further processing, e.g., relevant objects for analytics. - Note that S3 does not yet have native search according to metadata.  Status – The IBM SoftLayer object service includes a basic implementation of metadata search; At IBM Research, we added extensions such as data type support and range searches.
  • 24. Power of data. Simplicity of design. Speed of innovation. IBM Spark For up-to-date information and news about the Spark and the Spark Technology Center, Sign up for our newsletter at www.spark.tc