This session will explore how to apply GeoSpatial analytics using Apache Spark on high-velocity streaming (data-in-motion) and high-volume batch (data-at-rest). Demonstrations will be performed throughout the session to cement these concepts.
2. What we do
Geographic Information System (GIS)
• Founded in 1969
• Esri develops GIS software
• Global Company with over 350,000 user organizations worldwide
Headquarters in Redlands, CA 80 Esri distributors worldwide
4. Continuous & Batch Analytics
on high velocity & volume spatiotemporal data
Apps
Access
DesktopWeb Device
ServicesGeoEvent
Extension
GeoAnalytics
Extension
A
ArcGIS Server
• Ingesting real-time
spatiotemporal data
• Performing continuous
processing and
real-time analytics
• Sending updates and
alerts to those who need
it where they need it
Ingestion
Storage
Continuous
Analytics )
Batch
Analytics
Visualization
6. High Velocity Ingestion
Requirements
• Sustain a single node throughput of tens of thousands of events per second
• Achieve near linear scalability of throughput when adding additional machines
• Gracefully handle bursty data
7. Apache Kafka
Publish-subscribe messaging rethought as a distributed commit log
• Fast
- single broker can handle hundreds of MBs of reads and writes per second
• Scalable
- data streams are partitioned and spread over a cluster of machines
• Durable
- messages are persisted to disk and replicated within the cluster
• Distributed
- cluster-centric design that offers strong durability and fault-tolerance guarantees
8. Apache Spark
A fast and general engine for large-scale data processing
• Unified big data processing
- write streaming jobs the same way you write batch jobs
- can combine streaming with batch and interactive queries
• Spark apps can be written in Java, Scala, Python, and R
11. Gracefully Handle Bursty Data
Direct API for Kafka + Back-pressure
• Direct API for Kafka (Introduced at Spark 1.3)
- Provides exactly-once semantics and offset ranges
• Back-pressure (Planned feature of Spark 1.5, see SPARK-7398)
- Fast Publisher, Slow Subscriber signaling
13. GIS Tools for Hadoop
http://esri.github.io/gis-tools-for-hadoop/
• Esri Geometry API for Java:
- Geometry objects: points, lines, polygons
- Spatial relations: intersects, touches, overlaps, …
- Spatial operations: buffer, cut, union, …
• Spatial Framework for Hadoop
- Includes Spatial UDFs (User Defined Functions) that extend Hive
• GeoProcessing Tools for Hadoop
Ch. 8 Geospatial & Temporal Data Analysis
14. High Velocity Geospatial Analytics
Continuous analytics on data-in-motion
• A GeoEvent Service configures the flow of events,
- the Filtering and Processing steps to perform,
- what ingestion stream(s) to apply them to,
- and where to send the results.
15. High Velocity Geospatial Analytics
Continuous analytics on data-in-motion
• A GeoEvent Service configures the flow of events,
- the Filtering and Processing steps to perform,
- what ingestion stream(s) to apply them to,
- and where to send the results.
16. High Velocity Geospatial Analytics
Continuous analytics on data-in-motion
• A GeoEvent Service configures the flow of events,
- the Filtering and Processing steps to perform,
- what ingestion stream(s) to apply them to,
- and where to send the results.
=> DAG
KafkaUtils.createStream(ssc, …)
.map( event => FieldEnricher.enrich(event, …) )
.filter( event => IncidentDetector.evaluate(event, …) )
.map( event => FieldEnricher.enrich(event, …) )
.map( event => FieldMapper(event, …))
.saveTo…
(Directed Acyclic Graph)
20. High Velocity & Volume Storage
Requirements
• Sustain a write throughput of tens of thousands of events per second
• achieve growth in volume capacity & write throughput when adding additional machines
• efficiently access and query a large volume of data
- Query by any combination of id, time, space, and attributes
21. Elasticsearch
Store and Search Data in Real-Time
• Distributed, Scalable, and Highly Available
- Detect new or failed nodes, and reorganize and rebalance data automatically
• Near real-time
- All data is immediately made available for search and analytics
• Spatial and Full Text Search
- Comes with GeoPoint and GeoShape (polygon and polyline)
• RESTful API
• Spark Elasticsearch Connector
- https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/core/main/scala/org/
elasticsearch/spark/rdd
28. High Velocity & Volume Visualization
Requirements
• Render a map service that has the ability to do aggregation-on-the-fly
- aggregations are calculated at various levels of detail and are specific to each user session
- when zoomed in far enough raw features are returned and rendered
29. High Velocity & Volume Visualization
Requirements
• Render a map service that has the ability to do aggregation-on-the-fly
- aggregations are calculated at various levels of detail and are specific to each user session
- when zoomed in far enough raw features are returned and rendered
30. High Velocity & Volume Visualization
Requirements
• Render a map service that has the ability to do aggregation-on-the-fly
- aggregations are calculated at various levels of detail and are specific to each user session
- when zoomed in far enough raw features are returned and rendered
31. ArcGIS API for JavaScript
https://developers.arcgis.com/javascript/
• A lightweight way to embed maps and tasks in web apps
• Connects to any Map Service or Feature Service compliant source
47. • When working with high velocity & volume spatiotemporal data we have found the best
technology selections are as follows:
- Ingestion = Spark Streaming + Kafka
- Storage = Elasticsearch + Spark Elasticsearch Connector
- Visualization = ArcGIS API for JavaScript + on-the-fly-aggregations in Elasticsearch
- Continuous Analytics = Spark Streaming + GIS Tools for Hadoop
- Batch Analytics = Spark Core +/- Spark SQL + GIS Tools for Hadoop
- GIS Tools for Hadoop
- Can be used as a basis to add spatial geometries, relations, and operators to Spark
- http://esri.github.io/gis-tools-for-hadoop/
Applying Geospatial Analytics Using Apache Spark
Summary
48. Questions / Feedback?
C. Adam Mollenkopf
Real-Time GIS Capability Lead, Esri
amollenkopf@esri.com
@amollenkopf