This document discusses geospatial analytics and spatial capabilities on big data systems. It covers analyzing movement data through techniques like trajectory analysis and discretization. It discusses operational requirements for analyzing telematics data at large scales. It proposes using Apache Spark and geospatial libraries on Hadoop for distributed processing and storage. Key analytical challenges discussed include snap-to-road matching, trajectory clustering, and traffic event detection. Machine learning techniques like kernel methods and sequence analysis are proposed for solving these challenges.
2. Agenda
2
• Why Analyze Telematics
• Analysis of Movement Data
• Analytical Assets for Telematics
• Operational Requirements on Telematics
• Data flow on big data platforms
• Analytical Challenges and Applications Solved through
Machine Learning
• Snap to road
• Unifying trajectories to patterns of movements and routines
• Traffic event detection
3. Why Analyze Telematics
• We are being recorded everywhere
• Provides great insights into the customer routines and
movement
• Key players competing in the market
3
4. Analyzing Movement Data
Trajectory
4
Object in motion (time –
space)
Coordinate based
recording
Raw trajectories
Symbolic trajectories
Discretization
Streets, locations, or
events
5. Traditional Operational Requirements In The World Of
Geographic Information Systems (GIS)
• Traditional use cases : cartography, geo-algebra (display of statistical
events, hotspots, co-locations on the map)
• Databases used : postgres, sql server
• Mostly static data sources
• Relatively small data sets
• Moderate geometric accuracy
• Offline processing acceptable
• Complex geometric datatypes support
6. Operational Requirements and
Design Considerations for
Telematics
• Realtime ingestion and analytics on sensor data, distance queries,
snap-to-road
• 100 TBs/ Petabyte scale of the data
• High variation in geospatial queries (range queries, etc..) and
throughtput of CRUD operations: insertion/deletion/read
• Processing flow and map applications, nature of the relationships in the
data implicating storage technology. Indexing techniques and
implications.
7. Telematics and Geospatial Data
Types
• Spatial data structures:
• Raster: geographically-referenced matrix of uniform size
• Vector: features on the earth’s surface are represented as
geographically-referenced vector objects
• Hierarchical nature of objects
• Points: different types : Entity, label, area, node
• Lines: lines, polylines, arc, link, etc.
• Polygons: area, polygon, complex polygon
• Requirements: The ability to manipulate Geospatial Data.
• Databases and libraries required to manipulate these objects on
distributed scale ( Spark and scala, MongoDB, or any other nosql
data base)
8. Analytical Assests for Telematics
• The analytical assets for Telematics can be broadly related to
• Snap-to-road
• Analysis of User Activities (Clustering)
• Traffic Event Detection (Classification)
• Realtime location search
• Set operations on geometriy objects and geoalgebra (layering of
geospatial information atop each other and algebraic operations on
them)
9. Conceptual dataflow and geospatial processing in Telematics
9
PDA
Event capture
Kafka
Event Processing & Delivery Descision
Stream Processing Engine
PDA Geodata & Critical events
Mongo / Hbase , Cassandra
/ Elastic
(on top of Hadoop)
Persistence Layer
Risk area
Tomcat App (Optional
Raster Processing - Geotrellis)
Datafeed client
Preload risk area
Preload traffic info
Client
D3 / Ajax /
Leaflet
API Push(REST)
Push
Websocket
Push
Pull
Push
Stream
Pull
Persistent layer should be scalable & support storage and querying of spatiotemporal objects (point, polygons, lines, line strings, for reference see mongo db’s 2d spherical indexing and geospatial
querying). The following low level queries shall be supported. (1) nearest neighbor query: given a point (lat, long) find all the line strings that are within x meter radius. (2) containment query: give all
the points within a polygon, or given a point find al the polygons containing them .
Client browser. e.g. fleet manager. In the
current scheme, we have deferred all the
intelligence to the client. i.e. the raster
processing, displaying the map, and
different layers along with map algebra will
be done on the client side. One such
example can be leaflet. An alternate
strategy can be to use geotrellis.io as a geo
processing engine to do the raster
operations and only use client for the display
of the map.
Stream processing queries (1)Instantaneous speed/
angular momentum of the PDA. (2) Distance to a
traffic event pulled from bing (3) Running
aggregates, e.g. how long the vehicle has spent at
the current location
Geocoding Service
OSM / Realtime traffic API
10. Analytics Cluster GIS capablities
Client browser. e.g. fleet manager. In the
current scheme, we have deferred all the
intelligence to the client. i.e. the raster
processing, displaying the map, and
different layers along with map algebra will
be done on the client side. One such
example can be leaflet. An alternate
strategy can be to use geotrellis.io as a geo
processing engine to do the raster
operations and only use client for the display
of the map. Hadoop Cluster
NoSql Database
Mongo DB
/ Hbase/ Elastic
Data Storage
Provisioning Layer
Spark
Scala +
R Studio
Server &
RMR
Processing Layer
11. Data Storage - Persistence layer
Name Index strategy geometry Query types Ease of
use/integration
Scalability/
Speed
Comments
Elastic search Geohash Point Bbox, Radius Good 3 stars 10s of TBs, Average writes, reads
and search extremely
fast
Neo4j Rtree Point/Line/
Polygon
Bbox, Radius Moderately Good 2
stars
10s of TBs Too much Granular
Hbase Buily your own
index
- - Moderately Good 2
stars
Petabytes Writes are fast, reads
as well, needs
specialization
Cassandra Build your own
index
- - Good , 3 stars Petabytes Same as HBase
Mongo db/ couch
base
geohash Point /line
/polygon
1) geo-within
2) Near
3) intersect
Excellent, 5 stars
Geojson / leaflet/
osm
10s of TBs,
Average
throughput
Best Integration with
geojson in all cases
Proposed Solutions: Short term : Mongo DB
Long term: Elastic search as the indexing engine and Hbase/ Cassandra as the storage
technology on top of hadoop
12. Analytical Services on Telematics
Cluster
1) Geocoding and reverse geocoding service on the
cluster
2) Weather and traffic Api (real time and history) to
support the use cases related to weather and traffic
related analytics
3) Street maps ( open street map in the start and then
some better map providers in the longer run)
• Required for the following analytics: regular trips , snap to
road, Mode of transport, Identification of risky roads, Impact of
POI (e.g. school) on events , enables Location based
13. Analytical Operations/Procedures Useful For Spatial Analysis
(R Studio Server With R Packages)
•Having an R studio Server on the cluster would be useful.
•Github Repository (already established)
•R packages for dealing with vector data (rgdal, rgeos, geojson_io,
SpatialTransforms)
• Point pattern analysis – dbscan, glm, gbm
• Describing and Analyzing Fields , Statistical Analysis of
Fields/Spatial Interpolation-krigging, tps
• Network Analysis, snap –to-road, frequent routes, etc..
(igraph, sna)
• Visualization of the data – leaflet, shiny
14. Geospatial processing layer on top
of persistence
• The Geospatial Processing layer that performs the
integration of map geometry and algebra to display the
information on map. On a small scale, can be performed
via java script (leaflet / d3)
• The following operations are required
1) Vector Operations
2) Map Algebra
• On larger scale, a software engineering layer for
distributed geospatial processing , for example, Scala,
Spark and Geotrellis is required.
• http://www.google-
melange.com/gsoc/proposal/public/google/gsoc2015/allixender/5676830073
15. Analytical Challenges in Movement Data
• Basic challenges in movement data
• Matching (Snap-to-road, street network matching)
• Similarity measures
• Trajectory clustering
• Event detection (classification)
15
16. Example Applications Solved through Machine Learning
• For raw trajectories
• Snap-to-road
• For symbolic trajectories
• Analysis of user activities
• Traffic event detection
16
17. Snap-to-road
• Given a trajectory T and a street network G
• Find a path in G that matches T with its real or ground
truth path
17
18. Snap-to-road: Analytical Modeling
• Multiregression view:
• Task = estimate noise free function f from T
that preserves the structural information
• Preserving structural correlations in output:
• Try kernelized embedding with kernel for raw trajectories
•
18
19. Snap-to-road• An important problem in organizations like Here, IBM and
Microsoft.
• Error between 10-100 meters (Wifi, Vehicle Navigation,
Mobile Devices)
• Sampling rate deteriorated and sparse GPS data
• Difficult at roundabouts, and tunnels
19
20. Solution:
Basic steps:
Embed the trajectory by Kernel Methods but
ignore map constraints
Benefits:
Noise reduction
Capture multi-output, non-linear
dependencies
‘Round’ the resulting ‘relaxed
assignment’ to street map
20
23. Grouping Of Trajectories/Stops In Similar Routines
Basically Requires similarity measures for trajectories.
Unroll a trajectory by defining a mapping
23
24. Similarity Measures For Trajectories -- Symbolic Trajectories
• Formed by discretization of the curve through
measurement process or algorithms.
• Snap-to-road
• Stay points
• Regional division
24
25. Clustering of Staypoints to find Homezones
25
Grouping Of Trajectories/Stops In Similar Routines
26. Applications for Symbolic Trajectories Clustering and Event
Detection
• Trajectory clustering
• User activity analysis
• Traffic event detection
• Classification of events from non-event data
• Rerouting of traffic during baseball games
• Detection of conference in auditoriums
26
27. Applications for Symbolic Trajectories
• Exploit sequence analysis (in particular biological
sequence analysis)
1. Discretize the raw trajectories with an appropriate alphabet
2. Use alignment kernel with traffic symbol similairty in order to
translate traffic invariances to biological domain
3. Exploit sequence analysis to find discrete sequential patterns
(Where Traffic Meets DNA, Best Poster Award, ACM GIS
2011, Ahmed Jawad)
27
29. Trajectory Clustering :
Analysis of User Activities
• Analysis of user activities
• Frequent routes in trajectories
• Clustering at map matched Level
• Frequent routines in trajectories
• Clustering at stay point level
• Visualization of variability in routines (sequence logos)
29
33. Application for Symbolic Trajectories:
Traffic Event Detection
Using biological sequence methods to model event persistence
• Analysis of Dodger’s baseball games from highway sensor
data
• Detecting Presence of Baseball Game
• Visualization
• Analysis of events at Caltech auditorium Entrance
• Detecting conferences in the auditorium
33
36. Summary and Conclusions
• Structural information analysis is the connection
between machine learning and GIS
• Still, a lot of data engineering and task specific tricks
needed, e.g., regularization, and normalization
36
37. Active Directions being pursued
• In Snap-to-road
• Fisher kernels for Sparse GPS data
• Testing KMM with real world system
• In clustering and event detection
• User profiles and diaries
• Label sequence graph kernels
• In structural information
• Can doing away the latitude/longitude pairs and keeping only
the structural information help with privacy issues
37
39. References (1)
• Thomas Brinkhoff, Generating Network-Based Moving Objects, Proceedings of the 12th International Conference on Scientific and
Statistical Database Management, p.253, July 26-28, 2000
• C. Körner, M. May, S. Wrobel. Spatiotemporal Modeling and Analysis - Introduction and Overview. KI, 2012.
• Yi Guo , Junbin Gao , Paul W. Kwan, Twin Kernel Embedding, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.30
n.8, p.1490-1495, August 2008
• Julian J. McAuley, Teofilo de Campos, and Tiberio S. Caetano. Unified graph matching in euclidean spaces. In CVPR, 2010.
• Tom Mitchell. Mining our reality. Science, 326(5960):1644--1645, 2009.
• Paul Newson , John Krumm, Hidden Markov Snap-to-road through noise and sparseness, Proceedings of the 17th ACM SIGSPATIAL
International Conference on Advances in Geographic Information Systems, November 04-06, 2009, Seattle, Washington
• Novi Quadrianto, Le Song, and Alex Smola. Kernelized sorring. In NIPS 21, pages 1289--1296. 2009.
• Mohammed A. Quddus, Washington Y. Ochieng, and Robert B. Noland. Current map-matching algorithms for transport
applications: State-of-the art and future research directions. Transportation Research Part C: Emerging Technologies, 15(5):312--
328, 2007.
• A. Abbott. A primer on sequence methods. Organization Science, 1(4):375--392, 1990.
• Gennady Andrienko , Natalia Andrienko , Stefan Wrobel, Visual analytics tools for analysis of movement data, ACM SIGKDD
Explorations Newsletter, v.9 n.2, December 2007
• Mihael Ankerst , Markus M. Breunig , Hans-Peter Kriegel , Jörg Sander, OPTICS: ordering points to identify the clustering structure,
Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.49-60, May 31-June 03, 1999,
Philadelphia, Pennsylvania, United States
• Gerben de Vries , Maarten van Someren, Clustering vessel trajectories with alignment kernels under trajectory compression,
Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I, September 20-
24, 2010, Barcelona, Spain
• R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis. Cambridge University Press, 1998.
• M. Ester, H. P. Kriegel, S. Jörg, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise.
In KDD, pages 226--231, 1996.
39
40. References (2)
• Alexander Ihler , Jon Hutchins , Padhraic Smyth, Adaptive event detection with time-varying poisson processes, Proceedings of
the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia,
PA, USA
• Ahmed Jawad , Kristian Kersting, Kernelized Snap-to-road, Proceedings of the 18th SIGSPATIAL International Conference on
Advances in Geographic Information Systems, November 02-05, 2010, San Jose, California
• C. Joh, T. A. Arentze, and H. J. P. Timmermans. Multidimensional sequence alignment methods for activity-travel pattern
analysis: A comparison of dynamic programming and genetic algorithms. Geographical Analysis, 33(3):247--270, 2001.
• John A. Lee , Michel Verleysen, Nonlinear Dimensionality Reduction, Springer Publishing Company, Incorporated, 2007
• Yanchi Liu , Zhongmou Li , Hui Xiong , Xuedong Gao , Junjie Wu, Understanding of Internal Clustering Validation Measures,
Proceedings of the 2010 IEEE International Conference on Data Mining, p.911-916, December 13-17, 2010
• T. Mitchell. Mining our reality. Science, 326(5960):1644--1645, 2009.
• Salvatore Rinzivillo , Dino Pedreschi , Mirco Nanni , Fosca Giannotti , Natalia Andrienko , Gennady Andrienko, Visually driven
analysis of movement data by progressive clustering, Information Visualization, v.7 n.3, p.225-239, June 2008
• Albrecht Schmidt , Marc Langheinrich , Kritian Kersting, Perception beyond the Here and Now, Computer, v.44 n.2, p.86-88,
February 2011
• S. Schonfelder and K. W. Axhausen. Urban Rhythms and Travel Behavior: Spatial and Temporal Phenomena of Daily Travel
(Transport and Society). Ashgate, 2010.
• N. Shoval and M. Isaacson. Sequence alignment as a method for human activity analysis in space and time. Annals of the
Association of American Geographers, 97(2):282--297, 2007.
• C. Wilson. Analysis of travel behavior using sequence alignment methods. Journal of the Transportation Research Board, 1645(-
1):52--59, 1998.
40
41. References (3)
• T. Gärtner. Kernels for structured data. World Scientific, Hackensack, N.J., 2008.
• T. Gärtner, P. A. Flach, and S. Wrobel. On graph kernels: Hardness results and ecient alternatives. In Proceedings of
Conference on Learning Theory (COLT), pages 129---143, 2003.
• T. Gärtner, T. Horvath, Q. V. Le, A. J. Smola, and S.Wrobel. Kernel methods for graphs. In Mining Graph Data, pages
253--282. John Wiley and Sons, Inc,2006.
• Intelligence (PAMI), 31(5):944{952, 2009.
• R. O. Duda, D. G. Stork, and P. E. Hart. Pattern classification. Wiley, New York; Chichester, 2nd edition, 2000.
• R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological SequenceAnalysis. Cambridge University Press, 1998.
• M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial
databases with noise. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining
(SIGKDD), pages 226{231, 1996.
• D. Fox, J. Hightower, L. Liao, D. Schulz, and G. Borriello. Bayesian ltering for location estimation. IEEE Pervasive
Computing, 2(3):24--33, 2003.
• S. J. Ganey, A. W. Robertson, P. Smyth, S. J. Camargo, and M. Ghil. Probabilistic clustering of extratropical cyclones
using regression mixture models. Climate Dynamics, 29(4):423--440, 2006.
• M. Gariel, A. N. Srivastava, and E. Feron. Trajectory clustering and an application to airspace monitoring. IEEE
Transactions on Intelligent Transportation Systems (TITS), 12(4):1511--1524, 2006.
41
42. Appendix: persistence options
• Neo4j Spatial :
• Utilities for importing from ESRI Shapefile as well as Open Street Map
files
• Support for all the common geometry types
• An RTree index for fast searches on geometries
• Support for topology operations during the search (contains, within,
intersects, covers, disjoint, etc.)
• The possibility to enable spatial operations on any graph of data,
regardless of the way the spatial data is stored, as long as an adapter is
provided to map from the graph to the geometries.
• Ability to split a single layer or dataset into multiple sub-layers or views
with pre-configured filters
43. Appendix: persistence options
Hbase/Cassandra - Build your own index .
• Perform Geohashing yourself or use elastic
search as a hashing / search engine
• Libraries Available, to connect ES with
cassandra /Hbase
• Besides geohashing is easy to program
• http://thenewstack.io/building-streaming-data-
hub-elasticsearch-kafka-cassandra/
44. Appendix: persistence options
Mongodb Geospatial
• Store your location data as GeoJSON objects with this
coordinate-axis order: longitude, latitude. The
coordinate reference system for GeoJSON uses the
WGS84 datum.