SlideShare une entreprise Scribd logo
1  sur  48
Applying Geospatial Analytics
Using Apache Spark
C. Adam Mollenkopf
Real-Time GIS Capability Lead, Esri
amollenkopf@esri.com
@amollenkopf
What we do
Geographic Information System (GIS)
•  Founded in 1969
•  Esri develops GIS software
•  Global Company with over 350,000 user organizations worldwide
Headquarters in Redlands, CA 80 Esri distributors worldwide
Hortonworks Certified Partner
http://hortonworks.com/partner/esri/
Continuous & Batch Analytics
on high velocity & volume spatiotemporal data
Apps
Access
DesktopWeb Device
ServicesGeoEvent
Extension
GeoAnalytics
Extension
A
ArcGIS Server
•  Ingesting real-time
spatiotemporal data
•  Performing continuous
processing and
real-time analytics
•  Sending updates and
alerts to those who need
it where they need it
Ingestion
Storage
Continuous
Analytics )
Batch
Analytics
Visualization
Ingestion
of high velocity spatiotemporal data
High Velocity Ingestion
Requirements
•  Sustain a single node throughput of tens of thousands of events per second
•  Achieve near linear scalability of throughput when adding additional machines
•  Gracefully handle bursty data
Apache Kafka
Publish-subscribe messaging rethought as a distributed commit log
•  Fast
-  single broker can handle hundreds of MBs of reads and writes per second
•  Scalable
-  data streams are partitioned and spread over a cluster of machines
•  Durable
-  messages are persisted to disk and replicated within the cluster
•  Distributed
-  cluster-centric design that offers strong durability and fault-tolerance guarantees
Apache Spark
A fast and general engine for large-scale data processing
•  Unified big data processing
-  write streaming jobs the same way you write batch jobs
-  can combine streaming with batch and interactive queries
•  Spark apps can be written in Java, Scala, Python, and R
1 node cluster benchmark c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS
1 Node Throughput
Ingest 1 node
Spark Streaming
w/ Kafka
132k
2 Node Linear Scalability of Throughput
2 node cluster benchmark c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS
Ingest 1 node 2 node
Spark Streaming
w/ Kafka
132k 282k
Gracefully Handle Bursty Data
Direct API for Kafka + Back-pressure
•  Direct API for Kafka (Introduced at Spark 1.3)
-  Provides exactly-once semantics and offset ranges
•  Back-pressure (Planned feature of Spark 1.5, see SPARK-7398)
-  Fast Publisher, Slow Subscriber signaling
Analytics
of high velocity & volume spatiotemporal data
GIS Tools for Hadoop
http://esri.github.io/gis-tools-for-hadoop/
•  Esri Geometry API for Java:
-  Geometry objects: points, lines, polygons
-  Spatial relations: intersects, touches, overlaps, …
-  Spatial operations: buffer, cut, union, …
•  Spatial Framework for Hadoop
-  Includes Spatial UDFs (User Defined Functions) that extend Hive
•  GeoProcessing Tools for Hadoop
Ch. 8 Geospatial & Temporal Data Analysis
High Velocity Geospatial Analytics
Continuous analytics on data-in-motion
•  A GeoEvent Service configures the flow of events,
-  the Filtering and Processing steps to perform,
-  what ingestion stream(s) to apply them to,
-  and where to send the results.
High Velocity Geospatial Analytics
Continuous analytics on data-in-motion
•  A GeoEvent Service configures the flow of events,
-  the Filtering and Processing steps to perform,
-  what ingestion stream(s) to apply them to,
-  and where to send the results.
High Velocity Geospatial Analytics
Continuous analytics on data-in-motion
•  A GeoEvent Service configures the flow of events,
-  the Filtering and Processing steps to perform,
-  what ingestion stream(s) to apply them to,
-  and where to send the results.
=> DAG
KafkaUtils.createStream(ssc, …)
.map( event => FieldEnricher.enrich(event, …) )
.filter( event => IncidentDetector.evaluate(event, …) )
.map( event => FieldEnricher.enrich(event, …) )
.map( event => FieldMapper(event, …))
.saveTo…
(Directed Acyclic Graph)
High Velocity Geospatial Analytics
Continuous analytics on data-in-motion
Demo
New York Taxi Cab Location Density Monitoring
High Velocity Geospatial Analytics
Storage
of high velocity & volume spatiotemporal data
High Velocity & Volume Storage
Requirements
•  Sustain a write throughput of tens of thousands of events per second
•  achieve growth in volume capacity & write throughput when adding additional machines
•  efficiently access and query a large volume of data
-  Query by any combination of id, time, space, and attributes
Elasticsearch
Store and Search Data in Real-Time
•  Distributed, Scalable, and Highly Available
-  Detect new or failed nodes, and reorganize and rebalance data automatically
•  Near real-time
-  All data is immediately made available for search and analytics
•  Spatial and Full Text Search
-  Comes with GeoPoint and GeoShape (polygon and polyline)
•  RESTful API
•  Spark Elasticsearch Connector
-  https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/core/main/scala/org/
elasticsearch/spark/rdd
High velocity & volume storage c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS
1 Node Write Throughput
Storage 1 node
{es} 106k
Ingest 1 node
Spark + Kafka 132k
High velocity & volume storage c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS
2 Node Write Throughput
Storage 1 node 2 node
{es} 106k 143k
Ingest 1 node 2 node
Spark + Kafka 132k 282k
High velocity & volume storage c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS
3 Node Write Throughput
Storage 1 node 2 node 3 node
{es} 106k 143k 192k
Ingest 1 node 2 node
Spark + Kafka 132k 282k
High velocity & volume storage c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS
4 Node Write Throughput
Storage 1 node 2 node 3 node 4 node
{es} 106k 143k 192k 224k
Ingest 1 node 2 node
Spark + Kafka 132k 282k
High velocity & volume storage c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS
Storage 1 node 2 node 3 node 4 node 5 node
{es} 106k 143k 192k 224k 249k
5 Node Write Throughput
Ingest 1 node 2 node
Spark + Kafka 132k 282k
Visualization
of high velocity & volume spatiotemporal data
High Velocity & Volume Visualization
Requirements
•  Render a map service that has the ability to do aggregation-on-the-fly
-  aggregations are calculated at various levels of detail and are specific to each user session
-  when zoomed in far enough raw features are returned and rendered
High Velocity & Volume Visualization
Requirements
•  Render a map service that has the ability to do aggregation-on-the-fly
-  aggregations are calculated at various levels of detail and are specific to each user session
-  when zoomed in far enough raw features are returned and rendered
High Velocity & Volume Visualization
Requirements
•  Render a map service that has the ability to do aggregation-on-the-fly
-  aggregations are calculated at various levels of detail and are specific to each user session
-  when zoomed in far enough raw features are returned and rendered
ArcGIS API for JavaScript
https://developers.arcgis.com/javascript/
•  A lightweight way to embed maps and tasks in web apps
•  Connects to any Map Service or Feature Service compliant source
High Velocity & Volume Visualization
Aggregation-on-the-fly
Demo
Ingestion, Storage, Continuous Analytics, and Visualization
High Velocity & Volume
High Velocity & Volume Analytics
Continuous and Batch Analytics
High Velocity & Volume Analytics
Continuous and Batch Analytics
Customer Example
of applying geospatial analytics on big data
Port of Rotterdam
Vessel and Port Usage Behavioral Analytics
•  8th largest port in the world
•  Largest port in Europe
Polyline Track Tool
Speed Tool
Line Crosses Tool
Density Tool
Port of Rotterdam
Vessel and Port Usage Behavioral Analytics
Port of Rotterdam
Polyline Track Analytics
Port of Rotterdam
Polyline Track Analytics
Port of Rotterdam
Density Analytics
Port of Rotterdam
Line Crosses Analytics
Port of Rotterdam
Line Crosses Analytics
The challenge of counting
D
d
Δ
(Lat,lon)
Where is Δ≃ 0 ?
Port of Rotterdam
Dredging Prioritization
Port of Rotterdam
Dredging Prioritization
•  When working with high velocity & volume spatiotemporal data we have found the best
technology selections are as follows:
-  Ingestion = Spark Streaming + Kafka
-  Storage = Elasticsearch + Spark Elasticsearch Connector
-  Visualization = ArcGIS API for JavaScript + on-the-fly-aggregations in Elasticsearch
-  Continuous Analytics = Spark Streaming + GIS Tools for Hadoop
-  Batch Analytics = Spark Core +/- Spark SQL + GIS Tools for Hadoop
-  GIS Tools for Hadoop
-  Can be used as a basis to add spatial geometries, relations, and operators to Spark
-  http://esri.github.io/gis-tools-for-hadoop/
Applying Geospatial Analytics Using Apache Spark
Summary
Questions / Feedback?
C. Adam Mollenkopf
Real-Time GIS Capability Lead, Esri
amollenkopf@esri.com
@amollenkopf

Contenu connexe

Tendances

Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Databricks
 
Big Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics PlatformBig Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics Platform
Sudhir Tonse
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
Databricks
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 

Tendances (20)

Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
 
Big Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics PlatformBig Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics Platform
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
Low cost solutions for Lidar and GIS analysis
Low cost solutions for Lidar and GIS analysisLow cost solutions for Lidar and GIS analysis
Low cost solutions for Lidar and GIS analysis
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan Ravat
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtime
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 

En vedette

101129 tokyopref bochibochi
101129 tokyopref bochibochi101129 tokyopref bochibochi
101129 tokyopref bochibochi
redgang
 

En vedette (20)

Geo-Analytics with Apache Spark and In-Memory Data Grids
Geo-Analytics with Apache Spark and In-Memory Data GridsGeo-Analytics with Apache Spark and In-Memory Data Grids
Geo-Analytics with Apache Spark and In-Memory Data Grids
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
 
101129 tokyopref bochibochi
101129 tokyopref bochibochi101129 tokyopref bochibochi
101129 tokyopref bochibochi
 
Dot pab forum september 2011
Dot pab forum september 2011Dot pab forum september 2011
Dot pab forum september 2011
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
 
Big Data Day LA 2016/ Data Science Track - Data Storytelling for Impact - Dav...
Big Data Day LA 2016/ Data Science Track - Data Storytelling for Impact - Dav...Big Data Day LA 2016/ Data Science Track - Data Storytelling for Impact - Dav...
Big Data Day LA 2016/ Data Science Track - Data Storytelling for Impact - Dav...
 
Do you know how the ultra affluent use social media? Find out.
Do you know how the ultra affluent use social media? Find out.Do you know how the ultra affluent use social media? Find out.
Do you know how the ultra affluent use social media? Find out.
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
6 damaging myths about social media and the truths behind them
6 damaging myths about social media and the truths behind them6 damaging myths about social media and the truths behind them
6 damaging myths about social media and the truths behind them
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 
Big Data Day LA 2016/ NoSQL track - Big Data and Real Estate, Jon Zifcak, CEO...
Big Data Day LA 2016/ NoSQL track - Big Data and Real Estate, Jon Zifcak, CEO...Big Data Day LA 2016/ NoSQL track - Big Data and Real Estate, Jon Zifcak, CEO...
Big Data Day LA 2016/ NoSQL track - Big Data and Real Estate, Jon Zifcak, CEO...
 
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
 
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
 

Similaire à Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics using Apache Spark by Adam Mollenkopf of ESRI

Similaire à Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics using Apache Spark by Adam Mollenkopf of ESRI (20)

DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and Analytics
 
Scalable Data Analytics and Visualization with Cloud Optimized Services
Scalable Data Analytics and Visualization with Cloud Optimized ServicesScalable Data Analytics and Visualization with Cloud Optimized Services
Scalable Data Analytics and Visualization with Cloud Optimized Services
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
True Reusable Code - DevSum2016
True Reusable Code - DevSum2016True Reusable Code - DevSum2016
True Reusable Code - DevSum2016
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Multidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGISMultidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGIS
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 

Plus de Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

Plus de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics using Apache Spark by Adam Mollenkopf of ESRI

  • 1. Applying Geospatial Analytics Using Apache Spark C. Adam Mollenkopf Real-Time GIS Capability Lead, Esri amollenkopf@esri.com @amollenkopf
  • 2. What we do Geographic Information System (GIS) •  Founded in 1969 •  Esri develops GIS software •  Global Company with over 350,000 user organizations worldwide Headquarters in Redlands, CA 80 Esri distributors worldwide
  • 4. Continuous & Batch Analytics on high velocity & volume spatiotemporal data Apps Access DesktopWeb Device ServicesGeoEvent Extension GeoAnalytics Extension A ArcGIS Server •  Ingesting real-time spatiotemporal data •  Performing continuous processing and real-time analytics •  Sending updates and alerts to those who need it where they need it Ingestion Storage Continuous Analytics ) Batch Analytics Visualization
  • 5. Ingestion of high velocity spatiotemporal data
  • 6. High Velocity Ingestion Requirements •  Sustain a single node throughput of tens of thousands of events per second •  Achieve near linear scalability of throughput when adding additional machines •  Gracefully handle bursty data
  • 7. Apache Kafka Publish-subscribe messaging rethought as a distributed commit log •  Fast -  single broker can handle hundreds of MBs of reads and writes per second •  Scalable -  data streams are partitioned and spread over a cluster of machines •  Durable -  messages are persisted to disk and replicated within the cluster •  Distributed -  cluster-centric design that offers strong durability and fault-tolerance guarantees
  • 8. Apache Spark A fast and general engine for large-scale data processing •  Unified big data processing -  write streaming jobs the same way you write batch jobs -  can combine streaming with batch and interactive queries •  Spark apps can be written in Java, Scala, Python, and R
  • 9. 1 node cluster benchmark c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS 1 Node Throughput Ingest 1 node Spark Streaming w/ Kafka 132k
  • 10. 2 Node Linear Scalability of Throughput 2 node cluster benchmark c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS Ingest 1 node 2 node Spark Streaming w/ Kafka 132k 282k
  • 11. Gracefully Handle Bursty Data Direct API for Kafka + Back-pressure •  Direct API for Kafka (Introduced at Spark 1.3) -  Provides exactly-once semantics and offset ranges •  Back-pressure (Planned feature of Spark 1.5, see SPARK-7398) -  Fast Publisher, Slow Subscriber signaling
  • 12. Analytics of high velocity & volume spatiotemporal data
  • 13. GIS Tools for Hadoop http://esri.github.io/gis-tools-for-hadoop/ •  Esri Geometry API for Java: -  Geometry objects: points, lines, polygons -  Spatial relations: intersects, touches, overlaps, … -  Spatial operations: buffer, cut, union, … •  Spatial Framework for Hadoop -  Includes Spatial UDFs (User Defined Functions) that extend Hive •  GeoProcessing Tools for Hadoop Ch. 8 Geospatial & Temporal Data Analysis
  • 14. High Velocity Geospatial Analytics Continuous analytics on data-in-motion •  A GeoEvent Service configures the flow of events, -  the Filtering and Processing steps to perform, -  what ingestion stream(s) to apply them to, -  and where to send the results.
  • 15. High Velocity Geospatial Analytics Continuous analytics on data-in-motion •  A GeoEvent Service configures the flow of events, -  the Filtering and Processing steps to perform, -  what ingestion stream(s) to apply them to, -  and where to send the results.
  • 16. High Velocity Geospatial Analytics Continuous analytics on data-in-motion •  A GeoEvent Service configures the flow of events, -  the Filtering and Processing steps to perform, -  what ingestion stream(s) to apply them to, -  and where to send the results. => DAG KafkaUtils.createStream(ssc, …) .map( event => FieldEnricher.enrich(event, …) ) .filter( event => IncidentDetector.evaluate(event, …) ) .map( event => FieldEnricher.enrich(event, …) ) .map( event => FieldMapper(event, …)) .saveTo… (Directed Acyclic Graph)
  • 17. High Velocity Geospatial Analytics Continuous analytics on data-in-motion
  • 18. Demo New York Taxi Cab Location Density Monitoring High Velocity Geospatial Analytics
  • 19. Storage of high velocity & volume spatiotemporal data
  • 20. High Velocity & Volume Storage Requirements •  Sustain a write throughput of tens of thousands of events per second •  achieve growth in volume capacity & write throughput when adding additional machines •  efficiently access and query a large volume of data -  Query by any combination of id, time, space, and attributes
  • 21. Elasticsearch Store and Search Data in Real-Time •  Distributed, Scalable, and Highly Available -  Detect new or failed nodes, and reorganize and rebalance data automatically •  Near real-time -  All data is immediately made available for search and analytics •  Spatial and Full Text Search -  Comes with GeoPoint and GeoShape (polygon and polyline) •  RESTful API •  Spark Elasticsearch Connector -  https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/core/main/scala/org/ elasticsearch/spark/rdd
  • 22. High velocity & volume storage c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS 1 Node Write Throughput Storage 1 node {es} 106k Ingest 1 node Spark + Kafka 132k
  • 23. High velocity & volume storage c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS 2 Node Write Throughput Storage 1 node 2 node {es} 106k 143k Ingest 1 node 2 node Spark + Kafka 132k 282k
  • 24. High velocity & volume storage c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS 3 Node Write Throughput Storage 1 node 2 node 3 node {es} 106k 143k 192k Ingest 1 node 2 node Spark + Kafka 132k 282k
  • 25. High velocity & volume storage c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS 4 Node Write Throughput Storage 1 node 2 node 3 node 4 node {es} 106k 143k 192k 224k Ingest 1 node 2 node Spark + Kafka 132k 282k
  • 26. High velocity & volume storage c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS Storage 1 node 2 node 3 node 4 node 5 node {es} 106k 143k 192k 224k 249k 5 Node Write Throughput Ingest 1 node 2 node Spark + Kafka 132k 282k
  • 27. Visualization of high velocity & volume spatiotemporal data
  • 28. High Velocity & Volume Visualization Requirements •  Render a map service that has the ability to do aggregation-on-the-fly -  aggregations are calculated at various levels of detail and are specific to each user session -  when zoomed in far enough raw features are returned and rendered
  • 29. High Velocity & Volume Visualization Requirements •  Render a map service that has the ability to do aggregation-on-the-fly -  aggregations are calculated at various levels of detail and are specific to each user session -  when zoomed in far enough raw features are returned and rendered
  • 30. High Velocity & Volume Visualization Requirements •  Render a map service that has the ability to do aggregation-on-the-fly -  aggregations are calculated at various levels of detail and are specific to each user session -  when zoomed in far enough raw features are returned and rendered
  • 31. ArcGIS API for JavaScript https://developers.arcgis.com/javascript/ •  A lightweight way to embed maps and tasks in web apps •  Connects to any Map Service or Feature Service compliant source
  • 32. High Velocity & Volume Visualization Aggregation-on-the-fly
  • 33. Demo Ingestion, Storage, Continuous Analytics, and Visualization High Velocity & Volume
  • 34. High Velocity & Volume Analytics Continuous and Batch Analytics
  • 35. High Velocity & Volume Analytics Continuous and Batch Analytics
  • 36. Customer Example of applying geospatial analytics on big data
  • 37. Port of Rotterdam Vessel and Port Usage Behavioral Analytics •  8th largest port in the world •  Largest port in Europe
  • 38. Polyline Track Tool Speed Tool Line Crosses Tool Density Tool Port of Rotterdam Vessel and Port Usage Behavioral Analytics
  • 39. Port of Rotterdam Polyline Track Analytics
  • 40. Port of Rotterdam Polyline Track Analytics
  • 42. Port of Rotterdam Line Crosses Analytics
  • 43. Port of Rotterdam Line Crosses Analytics
  • 44. The challenge of counting
  • 45. D d Δ (Lat,lon) Where is Δ≃ 0 ? Port of Rotterdam Dredging Prioritization
  • 46. Port of Rotterdam Dredging Prioritization
  • 47. •  When working with high velocity & volume spatiotemporal data we have found the best technology selections are as follows: -  Ingestion = Spark Streaming + Kafka -  Storage = Elasticsearch + Spark Elasticsearch Connector -  Visualization = ArcGIS API for JavaScript + on-the-fly-aggregations in Elasticsearch -  Continuous Analytics = Spark Streaming + GIS Tools for Hadoop -  Batch Analytics = Spark Core +/- Spark SQL + GIS Tools for Hadoop -  GIS Tools for Hadoop -  Can be used as a basis to add spatial geometries, relations, and operators to Spark -  http://esri.github.io/gis-tools-for-hadoop/ Applying Geospatial Analytics Using Apache Spark Summary
  • 48. Questions / Feedback? C. Adam Mollenkopf Real-Time GIS Capability Lead, Esri amollenkopf@esri.com @amollenkopf