By CCRi's Chris Eichelberger: The query is slow. Ah, create an index! The query is fast. Now another query is slow... Quickly, the complexity of the query-planner, in addition to the volume of stored data, balloons. Is it worth it? That depends. It depends upon your data distribution, your per-field (and per-field-group) selectivities, your query planner fu, your query distribution, your support responsibilities... it depends on a lot of things. In this talk, we summarize our experience with GeoMesa, first as a geo-temporal data store, and then as a more general purpose data management platform; we cover the benefits and costs of adding exciting, new indexes every time someone's query is slow.
4. to answer the hard questions... which comes first?
4
● Virginia
● Massachusetts
this is the entire purpose of an index: given a data element, tell me which
bin (disk page, tablet, ...) in which it will be found if it exists
6. indexing properties
DATA-SPECIFIC
● think: Japanese street addresses
● lot numbers depend on building order
● good when: cross-index joins are cheap
(RDBMS)
SPACE-SPECIFIC
● think: US street addresses
● lots are aligned to block ranges
● good when: cross-index joins are
expensive (NoSQL)
6
23. life is good
23
● we have a (POTENTIALLY SHARDED) index
● we can ingest geo-temporal data
● we can query with geometric bounds and a time span
● we don't suffer from hot-spotting
24. part 3: the pain
"My heart entreats, just hear those savage beats / and go put on your cleats / and come and trample me"
24
25. more than geo-temporal attributes?
25
https://www.britannica.com/technology/airplane/Types-of-aircraft
26. add more indexes!
26
● add an ATTRIBUTE index
○ tied to the SimpleFeatureType (in user data)
○ each indexed attributes has all values recorded
○ contains a complete copy of every simple feature
● add a RECORD ID index
○ automatically created, populated
○ values are assumed to be unique to the SimpleFeature
○ contains a complete copy of every simple feature
27. index selection
27
● simple cases
○ if you only filter on an indexed attribute, use the attribute index
○ if you only filter on a record ID, use the record-ID index
○ if you only filter on location and time, use the geo-temporal index
● all other cases
○ this is a geo-temporal store... use the geo-temporal index
28. life is good
28
● we have some indexes
● we can ingest geo-temporal data
● we can query with geometric bounds and a time span
● we don't suffer from hot-spotting
● we have per-attribute indexes and a record-ID index
● we have the option of querying by any one attribute OR record ID or geo-time
29. part 4: the pain
"Your heart is hard as stone or mahogany / that's why I'm in such exquisite agony"
29
31. handling non-point geometries
31
Christian Böhm, Gerald Klump and Hans-Peter Kriegel. "XZ-Ordering: A Space-Filling Curve for Objects with Spatial Extension".
6th Int. Symposium on Large Spatial Databases (SSD), 1999, Hong Kong, China
32. add more indexes!
32
● add an XZ3 index
○ indexes longitude, latitude, and time
○ contains a complete copy of every simple feature
● add an XZ2 index, just to be sure
○ indexes longitude and latitude alone
○ contains a complete copy of every simple feature
33. life is good
33
● we have some indexes
● we can ingest geo-temporal data
● we can query with geometric bounds and a time span
● we don't suffer from hot-spotting
● we have per-attribute indexes and a record-ID index
● we have the option of querying by any one attribute
● we have non-duplicative indexes for non-point geometries, even those that
cross the anti-meridian
34. part 5: the pain
"My soul is on fire; it's aflame with desire / which is why I perspire when we tango"
34
35. an embarrassment of riches
35http://i.ebayimg.com/00/s/NTY2WDg0OQ==/z/U~IAAOSw-jhUBFhb/$_32.JPG?set_id=880000500F
37. cost-based optimizer... oh, and summary statistics
37
● CBO
○ rewrite query using DNF... or CNF
○ estimate cost of using a particular index
■ at least whether a full-table scan is required
○ requires knowing something about cardinalities
○ ought to be able to explain why it made its choice
● statistics collection
○ responsible for providing some estimates of cardinalities (HyperLogLog, count-min sketch,
etc.)
this is really just a fancy version of the board game Guess Who?
38. life is good
38
● we have some indexes
● we can ingest geo-temporal data
● we can query with geometric bounds and a time span
● we don't suffer from hot-spotting
● we have per-attribute indexes and a record-ID index
● we have the option of querying by any one attribute
● we have non-duplicative indexes for non-point geometries, even those that
cross the anti-meridian
39. part 6: the pain
"You caught my nose in your left castanet, love / I can feel the pain yet, love / everytime I hear drums"
39
42. "who knew [geo data] could be so complicated?"
● there exist simpler solutions
○ D4M works very well, albeit not specifically for geo-time data
○ Elasticsearch has geographic, temporal indexes
● do you have a simpler problem?
○ do you need low-latency, high-velocity streaming data ingest, processing?
○ does even your streaming, in-memory geo-time data store require secondary indexing?
○ do your clients require access via OGC services, the GeoTools API?
○ must you support multiple flavors of NoSQL?
42
43. if it doesn't hurt, you're doing it wrong
"Fracture my spine / and swear that you're mine / as we dance to the Masochism Tango"
43