Thermopylae Sciences & Technology chose to customize MongoDB's spatial indexing capabilities to better support their needs for indexing multi-dimensional and geospatial data. They developed a custom R-tree spatial index that leverages existing MongoDB data structures and provides improved performance over MongoDB's existing geohash-based approach. Their custom index supports complex queries on multidimensional geometric shapes and scales to large geospatial datasets through potential sharding and distribution techniques. They have contributed their work back to the MongoDB open source project and collaborate with MongoDB to further integrate their contributions.
How AI, OpenAI, and ChatGPT impact business and software.
Why We Chose MongoDB for Big Spatial Data
1. WHY WE CHOSE MONGODB TO
PUT BIG-DATA ‘ON THE MAP’
JUNE 2012
@nknize
+Nicholas Knize
2. “The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in one
location…this capability allows for unprecedented situational awareness and information sharing”
-Gen. Doug Frasier
TST PRODUCTS
ACCOMPLISHING THE IMPOSSIBLE
3. • Expose enterprise data in a geo-temporal user defined
environment
• Provide a flexible and scalable spatial indexing framework
for heterogeneous data
• Visualize spatially referenced data on 3D globe & 2D maps
• Manage real-time data feeds and mobile messaging
• View data over geo-rectified imagery with 3D terrain
• Support mission planning and simulation
• Provide real-time collaboration and sharing
ISPATIAL OVERVIEW
ACCOMPLISHING THE IMPOSSIBLE
4. Desired Data Store Characteristic for ‘Big Data’
• Horizontally scalable – Large volume / elastic
• Vertically scalable – Heterogeneous data types (“Data Stack”)
• Smartly Distributed – Reduce the distance bits must travel
• Fault Tolerant – Replication Strategy and Consistency model
• High Availability – Node recovery
• Fast – Reads or writes (can’t always have both)
BIG DATA STORAGE CHARACTERISTICS
ACCOMPLISHING THE IMPOSSIBLE
5. Subset of Evaluated NoSQL Options
• Cassandra
– Nice Bring Your Own Index (BYOI) design
– … but Java, Java, Java… Memory management can be an issue
– Adding new nodes can be a pain (Token Changes, nodetool)
– Key-Value store…good for simple data models
• Hbase
– Nice BigTable model
– Theory grounded heavily in C.A.P, inflexible trade-offs
– Complicated setup and maintenance
• CouchDB
– Provides some GeoSpatial functionality (Currently being rewritten)
– HEAVILY dependent on Map-Reduce model (complicated design)
– Erlang based – poor multi-threaded heap management
NOSQL OPTIONS
ACCOMPLISHING THE IMPOSSIBLE
6. Why MongoDB for Thermopylae?
• Documents based on JSON – A GEOJSON match made in heaven!
• C++ - No Garbage Collection Overhead! Efficient memory management
design reduces disk swapping and paging
• Disk storage is memory mapped, enabling fast swapping when necessary
• Built in auto-failover with replica sets and fast recovery with journaling
• Tunable Consistency – Consistency defined at application layer
• Schema Flexible – friendly properties of SQL enable easy port
• Provided initial spatial indexing support – Point based limited!
WHY TST LIKES MONGODB
ACCOMPLISHING THE IMPOSSIBLE
7. ... The Spatial Indexer wasn’t quite right
• MongoDB (like nearly all relational DBs) uses a b-Tree
– Data structure for storing sorted data in log time
– Great for indexing numerical and text documents (1D attribute data)
– Cannot store multi-dimension (>2D) data – NOT COMPLEX GEOMETRY
FRIENDLY
MONGODB SPATIAL INDEXER
ACCOMPLISHING THE IMPOSSIBLE
8. How does MongoDB solve the dimensionality problem?
• Space Filling (Z) Curve
– A continuous line that
intersects every point in a
two-dimensional plane
• Use Geohash to
represent lat/lon values
– Interleave the bits of a
lat/long pair
– Base32 encode the result
DIMENSIONALITY REDUCTION
ACCOMPLISHING THE IMPOSSIBLE
9. Issues with the Geohash b-Tree approach
• Neighbors aren’t so
close!
– Neighboring points on the
Geoid may end up on
opposite ends of the
plane
– Impacts search efficiency
• What about Geometry?
– Doesn’t support > 2D
– Mongo uses Multi-
Location documents
which really just indexes
multiple points that link
back to a single document
GEOHASH BTREE ISSUES
ACCOMPLISHING THE IMPOSSIBLE
10. Mongo Multi-location Document Clipping Issues
($within search doesn’t always work w/ multi-location)
Case 1: Success! Case 3: Fail!
Case 2: Success! Case 4: Fail!
Multi-Location Document (aka. Polygon) Search Polygon
MULTI-LOCATION CLIPPING
ACCOMPLISHING THE IMPOSSIBLE
11. Potential Solutions
• Constrain the system to single point searches
– Multi-dimension support will be exponentially complex (won’t scale)
• Interpolate points along the edge of the shape
– Multi-dimension support will be exponentially complex (won’t scale)
• Customize the spatial indexer
– Selected approach
SOLUTIONS TO GEOHASH PROBLEM
ACCOMPLISHING THE IMPOSSIBLE
12. Thermopylae Custom Tuned MongoDB for Geo
TST Leverage’s Guttman’s 1984 Research in R/R* Trees
• R-Trees organize any-dimensional data by representing
the data as a minimum bounding box.
• Each node bounds it’s children. A node can have many
objects in it (max: m min: ceil(m/2) )
• Splits and merges optimized by minimizing overlaps
• The leaves point to the actual objects (stored on disk
probably)
• Height balanced – search is always O(log n)
CUSTOM TUNED SPATIAL INDEXER
ACCOMPLISHING THE IMPOSSIBLE
13. Spatial Indexing at Scale with R-Trees
Spatial data represented as minimum bounding rectangles (2-dimension),
cubes (3-dimension), hexadecant (4-dimension)
Index represented as: <I, DiskLoc> where:
I = (I0, I1, … In) : n = number of dimensions
Each I is a set in the form of [min,max] describing MBR range along a dimension
RTREE THEORY
ACCOMPLISHING THE IMPOSSIBLE
14. mn o p
R*-Tree Spatial Index Example
• Sample insertion result for 4th order
tree
• Objectives: a b cd e f g h i jk l
1. Minimize area
2. Minimize overlaps
3. Minimize margins
4. Maximize inner node utilization
R*-TREE INDEX OBJECTIVES
ACCOMPLISHING THE IMPOSSIBLE
15. Insert
• Similar to insertion into B+-tree but may insert
into any leaf; leaf splits in case capacity exceeded.
– Which leaf to insert into?
– How to split a node?
R*-TREE INSERT EXAMPLE
ACCOMPLISHING THE IMPOSSIBLE
16. Insert—Leaf Selection
• Follow a path from root to leaf.
• At each node move into subtree whose MBR area
increases least with addition of new rectangle.
n
m
o p
21. mn o p
Query
• Start at root a b cd e f g h i jk l
• Find all overlapping MBRs
• Search subtrees recursively
n
m
a
o p
a x
a
22. mn o p
Query
a b cd e f g h i jk l
• Search m.
e n
a m
a
a
b
a g a
o p
c
d x
x
23. R*-Tree Leverages B-Tree Base Data Structures (buckets)
R*-TREE MONGODB IMPLEMENTATION
ACCOMPLISHING THE IMPOSSIBLE
24. Geo-Sharding – (in work)
Scalable Distributed R* Tree (SD-r*Tree)
“Balanced” binary tree, with
nodes distributed on a set of
servers:
• Each internal node has
exactly two children
• Each leaf node stores a
subset of the indexed
dataset
• At each node, the height
of the subtrees differ by
at most one
• mongos “routing” node
maintains binary tree
GEO-SHARDING
ACCOMPLISHING THE IMPOSSIBLE
25. SD-r*Tree Data Structure Illustration
a a a
c c
d0 r1 b r1 b
Data Node Spatial
Coverage
c
b d0 d1 c b d0 r2 d
e
e d1 d2 d
• di = Data Node (Chunk)
• ri = Coverage Node
Leveraged work from Litwin, Mouza, Rigaux 2007
SD-r*Tree DATA STRUCTURE
ACCOMPLISHING THE IMPOSSIBLE
26. SD-r*Tree Structure Distribution
a
c GeoShard 2 GeoShard 3
r1 b
d1 d2
mongos
c
b d0 r2 d
r1 r2 GeoShard 1
e
d0
e d1 d2 d
SD-r*TREE STRUCTURE DISTRIBUTION
ACCOMPLISHING THE IMPOSSIBLE
27. GeoSharding Alternative – 3D / 4D Hilbert Scanning Order
GEO-SHARDING ALTERNATIVE
ACCOMPLISHING THE IMPOSSIBLE
28. Next Steps: Beyond 4-Dimensions - X-Tree
(Berchtold, Keim, Kriegel – 1996)
Normal Internal Nodes Supernodes Data Nodes
• Avoid MBR overlaps
• Avoid node splits (main cause for high overlap)
• Introduce new node structure: Supernodes – Large Directory nodes of variable size
BEYOND 4-DIMENSIONS
ACCOMPLISHING THE IMPOSSIBLE
30. T-Sciences Custom Tuned Spatial Indexer
• Optimized Spatial Search – Finds intersecting MBR and recurses into
those nodes
• Optimized Spatial Inserts – Uses the Hilbert Value of MBR centroid to
guide search
– 28% reduction in number of nodes touched
• Optimize Deletes – Leverages R* split/merge approach for rebalancing
tree when nodes become over/under-full
• Low maintenance – Leverages MongoDB’s automatic data compaction
and partitioning
CONCLUSION
ACCOMPLISHING THE IMPOSSIBLE
31. Example Use Case – OSINT (Foursquare Data)
• Sample Foursquare
data set mashed with
Government Intel
Data (poly reports)
• 100 million Geo
Document test (3D
points and polys)
• 4 server replica set
• ~350ms query
response
• ~300%
improvement over
PostGIS
EXAMPLE
ACCOMPLISHING THE IMPOSSIBLE
32. Community Support
• Thermopylae contributes fixes to the codebase
– http://github.com/mongodb
• TST will work with 10gen to fold into the baseline
• Active developer collaboration
– IRC: #mongodb freenode.net
FIND US
ACCOMPLISHING THE IMPOSSIBLE
33. THANK YOU
Questions?
Nicholas Knize
nknize@t-sciences.com
THANK YOU
ACCOMPLISHING THE IMPOSSIBLE
35. Thermopylae Sciences & Technology – Who are we?
• Advanced technology w/ 160+ employees
• Core customers in national security, venues and
events, military and police, and city planning
• Partnered with Google and imagery providers
• Long term relationship focused – TS/SCI Staff
TST + 10gen + Google = Game-changing approach
ENTERPRISE
PARTNER
WHO ARE THESE GUYS?
ACCOMPLISHING THE IMPOSSIBLE
36. Key Customers - Government
• US Dept of State Bureau of Diplomatic Security
– Build and support 30 TB Google Earth Globe with multi-
terabytes of individual globes sent to embassies throughout
the world. Integrated Google Earth and iSpatial framework.
• US Army Intelligence Security Command
– Provide expertise in managing technology integration –
prime contractor providing operations, intelligence, and IT
support worldwide. Partners include IBM, Lockheed Martin,
Google, MIT, Carnegie Mellon. Integrated Google Earth and
iSpatial framework.
• US Southern Command
– Coordinate Intelligence management systems spatial data
collection, indexing, and distribution. Integrated Google
Earth, iSpatial, and iHarvest.
– Index large volume imagery and expose it for different
services (Air Force, Navy, Army, Marines, Coast Guard)
GOVERNMENT CUSTOMERS
ACCOMPLISHING THE IMPOSSIBLE
37. Key Customers - Commercial
Cleveland USGIF Las Vegas Baltimore
Cavaliers Motor Speedway Grand Prix
iSpatial framework serves thousands of mobile devices
COMMERCIAL CUSTOMERS
ACCOMPLISHING THE IMPOSSIBLE
Notes de l'éditeur
Screen shot of UDOP…blow-out of key features (sharing, presentation builder, etc)