SlideShare une entreprise Scribd logo
1  sur  37
WHY WE CHOSE MONGODB TO
 PUT BIG-DATA ‘ON THE MAP’
          JUNE 2012




           @nknize
        +Nicholas Knize
“The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in one
location…this capability allows for unprecedented situational awareness and information sharing”
                                                                -Gen. Doug Frasier




  TST PRODUCTS
  ACCOMPLISHING THE IMPOSSIBLE
• Expose enterprise data in a geo-temporal user defined
  environment
• Provide a flexible and scalable spatial indexing framework
  for heterogeneous data
• Visualize spatially referenced data on 3D globe & 2D maps
• Manage real-time data feeds and mobile messaging
• View data over geo-rectified imagery with 3D terrain
• Support mission planning and simulation
• Provide real-time collaboration and sharing
  ISPATIAL OVERVIEW
  ACCOMPLISHING THE IMPOSSIBLE
Desired Data Store Characteristic for ‘Big Data’

• Horizontally scalable – Large volume / elastic

• Vertically scalable – Heterogeneous data types (“Data Stack”)

• Smartly Distributed – Reduce the distance bits must travel

• Fault Tolerant – Replication Strategy and Consistency model

• High Availability – Node recovery

• Fast – Reads or writes (can’t always have both)
   BIG DATA STORAGE CHARACTERISTICS
   ACCOMPLISHING THE IMPOSSIBLE
Subset of Evaluated NoSQL Options
           • Cassandra
                 –   Nice Bring Your Own Index (BYOI) design
                 –   … but Java, Java, Java… Memory management can be an issue
                 –   Adding new nodes can be a pain (Token Changes, nodetool)
                 –   Key-Value store…good for simple data models

           • Hbase
                 – Nice BigTable model
                 – Theory grounded heavily in C.A.P, inflexible trade-offs
                 – Complicated setup and maintenance

           • CouchDB
                 – Provides some GeoSpatial functionality (Currently being rewritten)
                 – HEAVILY dependent on Map-Reduce model (complicated design)
                 – Erlang based – poor multi-threaded heap management

NOSQL OPTIONS
ACCOMPLISHING THE IMPOSSIBLE
Why MongoDB for Thermopylae?
• Documents based on JSON – A GEOJSON match made in heaven!

• C++ - No Garbage Collection Overhead! Efficient memory management
  design reduces disk swapping and paging

• Disk storage is memory mapped, enabling fast swapping when necessary

• Built in auto-failover with replica sets and fast recovery with journaling

• Tunable Consistency – Consistency defined at application layer

• Schema Flexible – friendly properties of SQL enable easy port

• Provided initial spatial indexing support – Point based limited!
  WHY TST LIKES MONGODB
  ACCOMPLISHING THE IMPOSSIBLE
... The Spatial Indexer wasn’t quite right
• MongoDB (like nearly all relational DBs) uses a b-Tree
     – Data structure for storing sorted data in log time
     – Great for indexing numerical and text documents (1D attribute data)
     – Cannot store multi-dimension (>2D) data – NOT COMPLEX GEOMETRY
       FRIENDLY




 MONGODB SPATIAL INDEXER
 ACCOMPLISHING THE IMPOSSIBLE
How does MongoDB solve the dimensionality problem?

• Space Filling (Z) Curve
     – A continuous line that
       intersects every point in a
       two-dimensional plane


• Use Geohash to
  represent lat/lon values
     – Interleave the bits of a
       lat/long pair
     – Base32 encode the result




 DIMENSIONALITY REDUCTION
 ACCOMPLISHING THE IMPOSSIBLE
Issues with the Geohash b-Tree approach

• Neighbors aren’t so
  close!
     – Neighboring points on the
       Geoid may end up on
       opposite ends of the
       plane
     – Impacts search efficiency


• What about Geometry?
     – Doesn’t support > 2D
     – Mongo uses Multi-
       Location documents
       which really just indexes
       multiple points that link
       back to a single document

 GEOHASH BTREE ISSUES
 ACCOMPLISHING THE IMPOSSIBLE
Mongo Multi-location Document Clipping Issues
                         ($within search doesn’t always work w/ multi-location)
     Case 1: Success!                                  Case 3: Fail!




     Case 2: Success!                                  Case 4: Fail!




            Multi-Location Document (aka. Polygon)                          Search Polygon
MULTI-LOCATION CLIPPING
ACCOMPLISHING THE IMPOSSIBLE
Potential Solutions

 • Constrain the system to single point searches
       – Multi-dimension support will be exponentially complex (won’t scale)



 • Interpolate points along the edge of the shape
       – Multi-dimension support will be exponentially complex (won’t scale)



 • Customize the spatial indexer
       – Selected approach



SOLUTIONS TO GEOHASH PROBLEM
ACCOMPLISHING THE IMPOSSIBLE
Thermopylae Custom Tuned MongoDB               for Geo

TST Leverage’s Guttman’s 1984 Research in R/R* Trees
• R-Trees organize any-dimensional data by representing
  the data as a minimum bounding box.
• Each node bounds it’s children. A node can have many
  objects in it (max: m min: ceil(m/2) )
• Splits and merges optimized by minimizing overlaps
• The leaves point to the actual objects (stored on disk
  probably)
• Height balanced – search is always O(log n)


 CUSTOM TUNED SPATIAL INDEXER
 ACCOMPLISHING THE IMPOSSIBLE
Spatial Indexing at Scale with R-Trees

Spatial data represented as minimum bounding rectangles (2-dimension),
cubes (3-dimension), hexadecant (4-dimension)




Index represented as: <I, DiskLoc> where:

    I = (I0, I1, … In) : n = number of dimensions
    Each I is a set in the form of [min,max] describing MBR range along a dimension




  RTREE THEORY
  ACCOMPLISHING THE IMPOSSIBLE
mn o p
    R*-Tree Spatial Index Example
• Sample insertion result for 4th order
  tree
• Objectives:                              a b cd   e f            g h i   jk l

    1.   Minimize area
    2.   Minimize overlaps
    3.   Minimize margins
    4.   Maximize inner node utilization




   R*-TREE INDEX OBJECTIVES
   ACCOMPLISHING THE IMPOSSIBLE
Insert
 • Similar to insertion into B+-tree but may insert
   into any leaf; leaf splits in case capacity exceeded.
       – Which leaf to insert into?
       – How to split a node?




R*-TREE INSERT EXAMPLE
ACCOMPLISHING THE IMPOSSIBLE
Insert—Leaf Selection
• Follow a path from root to leaf.
• At each node move into subtree whose MBR area
  increases least with addition of new rectangle.
                                      n
              m




                                  o                 p
Insert—Leaf Selection
• Insert into m.



               m
Insert—Leaf Selection
• Insert into n.


                        n
Insert—Leaf Selection
• Insert into o.




                        o
Insert—Leaf Selection
• Insert into p.




                        p
mn o p



Query
  • Start at root                     a b cd           e f            g h i   jk l
  • Find all overlapping MBRs
  • Search subtrees recursively


                                                   n
                    m
                                                       a


                                               o                                     p
                                  a                                      x
     a
mn o p




Query
                                 a b cd           e f            g h i   jk l
• Search m.


                             e                n
         a           m
                             a
     a
 b
         a               g   a
                                          o                                     p
 c
             d   x
     x
R*-Tree Leverages B-Tree Base Data Structures (buckets)




 R*-TREE MONGODB IMPLEMENTATION
 ACCOMPLISHING THE IMPOSSIBLE
Geo-Sharding – (in work)
     Scalable Distributed R* Tree (SD-r*Tree)
“Balanced” binary tree, with
nodes distributed on a set of
servers:
• Each internal node has
  exactly two children

• Each leaf node stores a
  subset of the indexed
  dataset

• At each node, the height
  of the subtrees differ by
  at most one

• mongos “routing” node
  maintains binary tree

   GEO-SHARDING
   ACCOMPLISHING THE IMPOSSIBLE
SD-r*Tree Data Structure Illustration

                               a                                a                              a
                                                                    c                                   c

        d0                                                r1    b                    r1        b
   Data Node                    Spatial
                               Coverage


                                                                                                            c
                                                 b   d0        d1       c   b   d0        r2                d
                                                                                                    e



                                                                                e    d1            d2       d


           • di = Data Node (Chunk)
           • ri = Coverage Node
Leveraged work from Litwin, Mouza, Rigaux 2007


           SD-r*Tree DATA STRUCTURE
           ACCOMPLISHING THE IMPOSSIBLE
SD-r*Tree Structure Distribution

                             a
                                      c       GeoShard 2        GeoShard 3

                 r1          b
                                                 d1                d2


                                                mongos
                                          c
    b     d0            r2                d
                                                    r1     r2   GeoShard 1
                                  e


                                                                   d0
           e     d1              d2       d




SD-r*TREE STRUCTURE DISTRIBUTION
ACCOMPLISHING THE IMPOSSIBLE
GeoSharding Alternative – 3D / 4D Hilbert Scanning Order




  GEO-SHARDING ALTERNATIVE
  ACCOMPLISHING THE IMPOSSIBLE
Next Steps: Beyond 4-Dimensions - X-Tree
                                  (Berchtold, Keim, Kriegel – 1996)




                        Normal Internal Nodes                  Supernodes   Data Nodes


• Avoid MBR overlaps

• Avoid node splits (main cause for high overlap)

• Introduce new node structure: Supernodes – Large Directory nodes of variable size

 BEYOND 4-DIMENSIONS
 ACCOMPLISHING THE IMPOSSIBLE
X-Tree Performance Results
                               (Berchtold, Keim, Kriegel – 1996)




X-TREE PERFORMANCE
ACCOMPLISHING THE IMPOSSIBLE
T-Sciences Custom Tuned Spatial Indexer
• Optimized Spatial Search – Finds intersecting MBR and recurses into
  those nodes

• Optimized Spatial Inserts – Uses the Hilbert Value of MBR centroid to
  guide search
   – 28% reduction in number of nodes touched

• Optimize Deletes – Leverages R* split/merge approach for rebalancing
  tree when nodes become over/under-full

• Low maintenance – Leverages MongoDB’s automatic data compaction
  and partitioning

  CONCLUSION
  ACCOMPLISHING THE IMPOSSIBLE
Example Use Case – OSINT (Foursquare Data)

• Sample Foursquare
  data set mashed with
  Government Intel
  Data (poly reports)

• 100 million Geo
  Document test (3D
  points and polys)

• 4 server replica set

• ~350ms query
  response

• ~300%
  improvement over
  PostGIS

   EXAMPLE
   ACCOMPLISHING THE IMPOSSIBLE
Community Support

• Thermopylae contributes fixes to the codebase
      – http://github.com/mongodb


• TST will work with 10gen to fold into the baseline

• Active developer collaboration
      – IRC: #mongodb freenode.net




FIND US
ACCOMPLISHING THE IMPOSSIBLE
THANK YOU
                                 Questions?

                                   Nicholas Knize
                               nknize@t-sciences.com

THANK YOU
ACCOMPLISHING THE IMPOSSIBLE
Backup
Thermopylae Sciences & Technology – Who are we?
• Advanced technology w/ 160+ employees
• Core customers in national security, venues and
  events, military and police, and city planning
• Partnered with Google and imagery providers
• Long term relationship focused – TS/SCI Staff
        TST + 10gen + Google = Game-changing approach


ENTERPRISE
 PARTNER




WHO ARE THESE GUYS?
ACCOMPLISHING THE IMPOSSIBLE
Key Customers - Government
        • US Dept of State Bureau of Diplomatic Security
              – Build and support 30 TB Google Earth Globe with multi-
                terabytes of individual globes sent to embassies throughout
                the world. Integrated Google Earth and iSpatial framework.
        • US Army Intelligence Security Command
              – Provide expertise in managing technology integration –
                prime contractor providing operations, intelligence, and IT
                support worldwide. Partners include IBM, Lockheed Martin,
                Google, MIT, Carnegie Mellon. Integrated Google Earth and
                iSpatial framework.
        • US Southern Command
              – Coordinate Intelligence management systems spatial data
                collection, indexing, and distribution. Integrated Google
                Earth, iSpatial, and iHarvest.
              – Index large volume imagery and expose it for different
                services (Air Force, Navy, Army, Marines, Coast Guard)
GOVERNMENT CUSTOMERS
ACCOMPLISHING THE IMPOSSIBLE
Key Customers - Commercial




     Cleveland                 USGIF      Las Vegas     Baltimore
     Cavaliers                         Motor Speedway   Grand Prix


iSpatial framework serves thousands of mobile devices
COMMERCIAL CUSTOMERS
ACCOMPLISHING THE IMPOSSIBLE

Contenu connexe

Tendances

Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use CasesDATAVERSITY
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architectureBishal Khanal
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Beat Signer
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
 
MongoDB Fundamentals
MongoDB FundamentalsMongoDB Fundamentals
MongoDB FundamentalsMongoDB
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerMongoDB
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseDataWorks Summit
 
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)MongoDB
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdfBOSupport
 

Tendances (20)

NoSql
NoSqlNoSql
NoSql
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
 
MongoDB Fundamentals
MongoDB FundamentalsMongoDB Fundamentals
MongoDB Fundamentals
 
MongoDB 101
MongoDB 101MongoDB 101
MongoDB 101
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTiger
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBase
 
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdf
 
Mongo DB
Mongo DB Mongo DB
Mongo DB
 
NoSQL
NoSQLNoSQL
NoSQL
 

Similaire à Why We Chose MongoDB for Big Spatial Data

High Dimensional Indexing using MongoDB (MongoSV 2012)
High Dimensional Indexing using MongoDB (MongoSV 2012)High Dimensional Indexing using MongoDB (MongoSV 2012)
High Dimensional Indexing using MongoDB (MongoSV 2012)Nicholas Knize, Ph.D., GISP
 
Bcn On Rails May2010 On Graph Databases
Bcn On Rails May2010 On Graph DatabasesBcn On Rails May2010 On Graph Databases
Bcn On Rails May2010 On Graph DatabasesPere Urbón-Bayes
 
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT RasterLe projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT RasterACSG Section Montréal
 
Terrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable SystemTerrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable SystemElectronic Arts / DICE
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014Avinash Ramineni
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014clairvoyantllc
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLMongoDB
 
Chap 2 – dynamic versions of r trees
Chap 2 – dynamic versions of r treesChap 2 – dynamic versions of r trees
Chap 2 – dynamic versions of r treesHendry Chen
 
Chap 2 – dynamic versions of r trees
Chap 2 – dynamic versions of r treesChap 2 – dynamic versions of r trees
Chap 2 – dynamic versions of r treesBoHengOrz
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataRoger Xia
 
Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Saltmarch Media
 
Mongodb - Scaling write performance
Mongodb - Scaling write performanceMongodb - Scaling write performance
Mongodb - Scaling write performanceDaum DNA
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010jbellis
 
Modern Database Systems (for Genealogy)
Modern Database Systems (for Genealogy)Modern Database Systems (for Genealogy)
Modern Database Systems (for Genealogy)Steven Francia
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Don Demcsak
 
NoSQL overview #phptostart turin 11.07.2011
NoSQL overview #phptostart turin 11.07.2011NoSQL overview #phptostart turin 11.07.2011
NoSQL overview #phptostart turin 11.07.2011David Funaro
 
Gis and digital_map_fundamentals
Gis and digital_map_fundamentalsGis and digital_map_fundamentals
Gis and digital_map_fundamentalsSumant Diwakar
 

Similaire à Why We Chose MongoDB for Big Spatial Data (20)

High Dimensional Indexing using MongoDB (MongoSV 2012)
High Dimensional Indexing using MongoDB (MongoSV 2012)High Dimensional Indexing using MongoDB (MongoSV 2012)
High Dimensional Indexing using MongoDB (MongoSV 2012)
 
Bcn On Rails May2010 On Graph Databases
Bcn On Rails May2010 On Graph DatabasesBcn On Rails May2010 On Graph Databases
Bcn On Rails May2010 On Graph Databases
 
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT RasterLe projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
 
Terrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable SystemTerrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable System
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQL
 
Chap 2 – dynamic versions of r trees
Chap 2 – dynamic versions of r treesChap 2 – dynamic versions of r trees
Chap 2 – dynamic versions of r trees
 
Chap 2 – dynamic versions of r trees
Chap 2 – dynamic versions of r treesChap 2 – dynamic versions of r trees
Chap 2 – dynamic versions of r trees
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?
 
Mongodb - Scaling write performance
Mongodb - Scaling write performanceMongodb - Scaling write performance
Mongodb - Scaling write performance
 
Grails goes Graph
Grails goes GraphGrails goes Graph
Grails goes Graph
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
 
Modern Database Systems (for Genealogy)
Modern Database Systems (for Genealogy)Modern Database Systems (for Genealogy)
Modern Database Systems (for Genealogy)
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
 
NoSQL overview #phptostart turin 11.07.2011
NoSQL overview #phptostart turin 11.07.2011NoSQL overview #phptostart turin 11.07.2011
NoSQL overview #phptostart turin 11.07.2011
 
Gis and digital_map_fundamentals
Gis and digital_map_fundamentalsGis and digital_map_fundamentals
Gis and digital_map_fundamentals
 

Dernier

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Dernier (20)

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Why We Chose MongoDB for Big Spatial Data

  • 1. WHY WE CHOSE MONGODB TO PUT BIG-DATA ‘ON THE MAP’ JUNE 2012 @nknize +Nicholas Knize
  • 2. “The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in one location…this capability allows for unprecedented situational awareness and information sharing” -Gen. Doug Frasier TST PRODUCTS ACCOMPLISHING THE IMPOSSIBLE
  • 3. • Expose enterprise data in a geo-temporal user defined environment • Provide a flexible and scalable spatial indexing framework for heterogeneous data • Visualize spatially referenced data on 3D globe & 2D maps • Manage real-time data feeds and mobile messaging • View data over geo-rectified imagery with 3D terrain • Support mission planning and simulation • Provide real-time collaboration and sharing ISPATIAL OVERVIEW ACCOMPLISHING THE IMPOSSIBLE
  • 4. Desired Data Store Characteristic for ‘Big Data’ • Horizontally scalable – Large volume / elastic • Vertically scalable – Heterogeneous data types (“Data Stack”) • Smartly Distributed – Reduce the distance bits must travel • Fault Tolerant – Replication Strategy and Consistency model • High Availability – Node recovery • Fast – Reads or writes (can’t always have both) BIG DATA STORAGE CHARACTERISTICS ACCOMPLISHING THE IMPOSSIBLE
  • 5. Subset of Evaluated NoSQL Options • Cassandra – Nice Bring Your Own Index (BYOI) design – … but Java, Java, Java… Memory management can be an issue – Adding new nodes can be a pain (Token Changes, nodetool) – Key-Value store…good for simple data models • Hbase – Nice BigTable model – Theory grounded heavily in C.A.P, inflexible trade-offs – Complicated setup and maintenance • CouchDB – Provides some GeoSpatial functionality (Currently being rewritten) – HEAVILY dependent on Map-Reduce model (complicated design) – Erlang based – poor multi-threaded heap management NOSQL OPTIONS ACCOMPLISHING THE IMPOSSIBLE
  • 6. Why MongoDB for Thermopylae? • Documents based on JSON – A GEOJSON match made in heaven! • C++ - No Garbage Collection Overhead! Efficient memory management design reduces disk swapping and paging • Disk storage is memory mapped, enabling fast swapping when necessary • Built in auto-failover with replica sets and fast recovery with journaling • Tunable Consistency – Consistency defined at application layer • Schema Flexible – friendly properties of SQL enable easy port • Provided initial spatial indexing support – Point based limited! WHY TST LIKES MONGODB ACCOMPLISHING THE IMPOSSIBLE
  • 7. ... The Spatial Indexer wasn’t quite right • MongoDB (like nearly all relational DBs) uses a b-Tree – Data structure for storing sorted data in log time – Great for indexing numerical and text documents (1D attribute data) – Cannot store multi-dimension (>2D) data – NOT COMPLEX GEOMETRY FRIENDLY MONGODB SPATIAL INDEXER ACCOMPLISHING THE IMPOSSIBLE
  • 8. How does MongoDB solve the dimensionality problem? • Space Filling (Z) Curve – A continuous line that intersects every point in a two-dimensional plane • Use Geohash to represent lat/lon values – Interleave the bits of a lat/long pair – Base32 encode the result DIMENSIONALITY REDUCTION ACCOMPLISHING THE IMPOSSIBLE
  • 9. Issues with the Geohash b-Tree approach • Neighbors aren’t so close! – Neighboring points on the Geoid may end up on opposite ends of the plane – Impacts search efficiency • What about Geometry? – Doesn’t support > 2D – Mongo uses Multi- Location documents which really just indexes multiple points that link back to a single document GEOHASH BTREE ISSUES ACCOMPLISHING THE IMPOSSIBLE
  • 10. Mongo Multi-location Document Clipping Issues ($within search doesn’t always work w/ multi-location) Case 1: Success! Case 3: Fail! Case 2: Success! Case 4: Fail! Multi-Location Document (aka. Polygon) Search Polygon MULTI-LOCATION CLIPPING ACCOMPLISHING THE IMPOSSIBLE
  • 11. Potential Solutions • Constrain the system to single point searches – Multi-dimension support will be exponentially complex (won’t scale) • Interpolate points along the edge of the shape – Multi-dimension support will be exponentially complex (won’t scale) • Customize the spatial indexer – Selected approach SOLUTIONS TO GEOHASH PROBLEM ACCOMPLISHING THE IMPOSSIBLE
  • 12. Thermopylae Custom Tuned MongoDB for Geo TST Leverage’s Guttman’s 1984 Research in R/R* Trees • R-Trees organize any-dimensional data by representing the data as a minimum bounding box. • Each node bounds it’s children. A node can have many objects in it (max: m min: ceil(m/2) ) • Splits and merges optimized by minimizing overlaps • The leaves point to the actual objects (stored on disk probably) • Height balanced – search is always O(log n) CUSTOM TUNED SPATIAL INDEXER ACCOMPLISHING THE IMPOSSIBLE
  • 13. Spatial Indexing at Scale with R-Trees Spatial data represented as minimum bounding rectangles (2-dimension), cubes (3-dimension), hexadecant (4-dimension) Index represented as: <I, DiskLoc> where: I = (I0, I1, … In) : n = number of dimensions Each I is a set in the form of [min,max] describing MBR range along a dimension RTREE THEORY ACCOMPLISHING THE IMPOSSIBLE
  • 14. mn o p R*-Tree Spatial Index Example • Sample insertion result for 4th order tree • Objectives: a b cd e f g h i jk l 1. Minimize area 2. Minimize overlaps 3. Minimize margins 4. Maximize inner node utilization R*-TREE INDEX OBJECTIVES ACCOMPLISHING THE IMPOSSIBLE
  • 15. Insert • Similar to insertion into B+-tree but may insert into any leaf; leaf splits in case capacity exceeded. – Which leaf to insert into? – How to split a node? R*-TREE INSERT EXAMPLE ACCOMPLISHING THE IMPOSSIBLE
  • 16. Insert—Leaf Selection • Follow a path from root to leaf. • At each node move into subtree whose MBR area increases least with addition of new rectangle. n m o p
  • 21. mn o p Query • Start at root a b cd e f g h i jk l • Find all overlapping MBRs • Search subtrees recursively n m a o p a x a
  • 22. mn o p Query a b cd e f g h i jk l • Search m. e n a m a a b a g a o p c d x x
  • 23. R*-Tree Leverages B-Tree Base Data Structures (buckets) R*-TREE MONGODB IMPLEMENTATION ACCOMPLISHING THE IMPOSSIBLE
  • 24. Geo-Sharding – (in work) Scalable Distributed R* Tree (SD-r*Tree) “Balanced” binary tree, with nodes distributed on a set of servers: • Each internal node has exactly two children • Each leaf node stores a subset of the indexed dataset • At each node, the height of the subtrees differ by at most one • mongos “routing” node maintains binary tree GEO-SHARDING ACCOMPLISHING THE IMPOSSIBLE
  • 25. SD-r*Tree Data Structure Illustration a a a c c d0 r1 b r1 b Data Node Spatial Coverage c b d0 d1 c b d0 r2 d e e d1 d2 d • di = Data Node (Chunk) • ri = Coverage Node Leveraged work from Litwin, Mouza, Rigaux 2007 SD-r*Tree DATA STRUCTURE ACCOMPLISHING THE IMPOSSIBLE
  • 26. SD-r*Tree Structure Distribution a c GeoShard 2 GeoShard 3 r1 b d1 d2 mongos c b d0 r2 d r1 r2 GeoShard 1 e d0 e d1 d2 d SD-r*TREE STRUCTURE DISTRIBUTION ACCOMPLISHING THE IMPOSSIBLE
  • 27. GeoSharding Alternative – 3D / 4D Hilbert Scanning Order GEO-SHARDING ALTERNATIVE ACCOMPLISHING THE IMPOSSIBLE
  • 28. Next Steps: Beyond 4-Dimensions - X-Tree (Berchtold, Keim, Kriegel – 1996) Normal Internal Nodes Supernodes Data Nodes • Avoid MBR overlaps • Avoid node splits (main cause for high overlap) • Introduce new node structure: Supernodes – Large Directory nodes of variable size BEYOND 4-DIMENSIONS ACCOMPLISHING THE IMPOSSIBLE
  • 29. X-Tree Performance Results (Berchtold, Keim, Kriegel – 1996) X-TREE PERFORMANCE ACCOMPLISHING THE IMPOSSIBLE
  • 30. T-Sciences Custom Tuned Spatial Indexer • Optimized Spatial Search – Finds intersecting MBR and recurses into those nodes • Optimized Spatial Inserts – Uses the Hilbert Value of MBR centroid to guide search – 28% reduction in number of nodes touched • Optimize Deletes – Leverages R* split/merge approach for rebalancing tree when nodes become over/under-full • Low maintenance – Leverages MongoDB’s automatic data compaction and partitioning CONCLUSION ACCOMPLISHING THE IMPOSSIBLE
  • 31. Example Use Case – OSINT (Foursquare Data) • Sample Foursquare data set mashed with Government Intel Data (poly reports) • 100 million Geo Document test (3D points and polys) • 4 server replica set • ~350ms query response • ~300% improvement over PostGIS EXAMPLE ACCOMPLISHING THE IMPOSSIBLE
  • 32. Community Support • Thermopylae contributes fixes to the codebase – http://github.com/mongodb • TST will work with 10gen to fold into the baseline • Active developer collaboration – IRC: #mongodb freenode.net FIND US ACCOMPLISHING THE IMPOSSIBLE
  • 33. THANK YOU Questions? Nicholas Knize nknize@t-sciences.com THANK YOU ACCOMPLISHING THE IMPOSSIBLE
  • 35. Thermopylae Sciences & Technology – Who are we? • Advanced technology w/ 160+ employees • Core customers in national security, venues and events, military and police, and city planning • Partnered with Google and imagery providers • Long term relationship focused – TS/SCI Staff TST + 10gen + Google = Game-changing approach ENTERPRISE PARTNER WHO ARE THESE GUYS? ACCOMPLISHING THE IMPOSSIBLE
  • 36. Key Customers - Government • US Dept of State Bureau of Diplomatic Security – Build and support 30 TB Google Earth Globe with multi- terabytes of individual globes sent to embassies throughout the world. Integrated Google Earth and iSpatial framework. • US Army Intelligence Security Command – Provide expertise in managing technology integration – prime contractor providing operations, intelligence, and IT support worldwide. Partners include IBM, Lockheed Martin, Google, MIT, Carnegie Mellon. Integrated Google Earth and iSpatial framework. • US Southern Command – Coordinate Intelligence management systems spatial data collection, indexing, and distribution. Integrated Google Earth, iSpatial, and iHarvest. – Index large volume imagery and expose it for different services (Air Force, Navy, Army, Marines, Coast Guard) GOVERNMENT CUSTOMERS ACCOMPLISHING THE IMPOSSIBLE
  • 37. Key Customers - Commercial Cleveland USGIF Las Vegas Baltimore Cavaliers Motor Speedway Grand Prix iSpatial framework serves thousands of mobile devices COMMERCIAL CUSTOMERS ACCOMPLISHING THE IMPOSSIBLE

Notes de l'éditeur

  1. Screen shot of UDOP…blow-out of key features (sharing, presentation builder, etc)