Indexes in geo-temporal data sets... How much is enough?

Strictly (Ordered) Ballroom
(AKA, "geo indexing and sufficiency")
Chris Eichelberger
FOSS4G NA 2017

part 1: the pain
"You can raise welts like nobody else / as we dance to the Masochism Tango"
2
with sincere apologies to the great Tom Lehrer

3
searching for a NoSQL analog...

to answer the hard questions... which comes first?
4
● Virginia
● Massachusetts
this is the entire purpose of an index: given a data element, tell me which
bin (disk page, tablet, ...) in which it will be found if it exists

indexing properties
DATA-SPECIFIC
● think: Japanese street addresses
● lot numbers depend on building order
● good when: cross-index joins are cheap
(RDBMS)
SPACE-SPECIFIC
● think: US street addresses
● lots are aligned to block ranges
● good when: cross-index joins are
expensive (NoSQL)
6

8
what does an SFC look like, do?

Z-order curve, 4 bits (2x2), 16 cells
9

Z-order curve, 6 bits (3x3), 64 cells
10

SFCurve... a LocationTech project is born
12

but have you tried the...
13
API with the help of

this solution would become FOSS
14
with the help of

life is good
15
● we have AN INDEX
● we can ingest geo-temporal data
● we can query with geometric bounds and a time span

part 2: the pain
"Blacken my eye, set fire to my tie / as we dance to the Masochism Tango"
16

real data are often non-uniformly distributed
17

19
how real data are often distributed

20
how SFC indexes might be distributed (gridded)

21
how real data tend to map to SFC indexes (bins)

22
how to trade density for uniformity

life is good
23
● we have a (POTENTIALLY SHARDED) index
● we don't suffer from hot-spotting

part 3: the pain
"My heart entreats, just hear those savage beats / and go put on your cleats / and come and trample me"
24

more than geo-temporal attributes?
25
https://www.britannica.com/technology/airplane/Types-of-aircraft

add more indexes!
26
● add an ATTRIBUTE index
○ tied to the SimpleFeatureType (in user data)
○ each indexed attributes has all values recorded
○ contains a complete copy of every simple feature
● add a RECORD ID index
○ automatically created, populated
○ values are assumed to be unique to the SimpleFeature

index selection
27
● simple cases
○ if you only filter on an indexed attribute, use the attribute index
○ if you only filter on a record ID, use the record-ID index
○ if you only filter on location and time, use the geo-temporal index
● all other cases
○ this is a geo-temporal store... use the geo-temporal index

life is good
28
● we have some indexes
● we have per-attribute indexes and a record-ID index
● we have the option of querying by any one attribute OR record ID or geo-time

part 4: the pain
"Your heart is hard as stone or mahogany / that's why I'm in such exquisite agony"
29

pointedly...
30
● the world is not flat
● it (the world) contains non-point geometries

handling non-point geometries
31
Christian Böhm, Gerald Klump and Hans-Peter Kriegel. "XZ-Ordering: A Space-Filling Curve for Objects with Spatial Extension".
6th Int. Symposium on Large Spatial Databases (SSD), 1999, Hong Kong, China

add more indexes!
32
● add an XZ3 index
○ indexes longitude, latitude, and time
● add an XZ2 index, just to be sure
○ indexes longitude and latitude alone

life is good
33
● we have the option of querying by any one attribute
● we have non-duplicative indexes for non-point geometries, even those that
cross the anti-meridian

part 5: the pain
"My soul is on fire; it's aflame with desire / which is why I perspire when we tango"
34

an embarrassment of riches
35http://i.ebayimg.com/00/s/NTY2WDg0OQ==/z/U~IAAOSw-jhUBFhb/$_32.JPG?set_id=880000500F

for once!
36
● what we need is NOT another index... exactly

cost-based optimizer... oh, and summary statistics
37
● CBO
○ rewrite query using DNF... or CNF
○ estimate cost of using a particular index
■ at least whether a full-table scan is required
○ requires knowing something about cardinalities
○ ought to be able to explain why it made its choice
● statistics collection
○ responsible for providing some estimates of cardinalities (HyperLogLog, count-min sketch,
etc.)
this is really just a fancy version of the board game Guess Who?

life is good
38
● we have the option of querying by any one attribute
● we have non-duplicative indexes for non-point geometries, even those that
cross the anti-meridian

part 6: the pain
"You caught my nose in your left castanet, love / I can feel the pain yet, love / everytime I hear drums"
39

40
serious fun requires serious thought

analytics, streaming, and cross-platform support
41
Apache Arrow

"who knew [geo data] could be so complicated?"
● there exist simpler solutions
○ D4M works very well, albeit not specifically for geo-time data
○ Elasticsearch has geographic, temporal indexes
● do you have a simpler problem?
○ do you need low-latency, high-velocity streaming data ingest, processing?
○ does even your streaming, in-memory geo-time data store require secondary indexing?
○ do your clients require access via OGC services, the GeoTools API?
○ must you support multiple flavors of NoSQL?
42

if it doesn't hurt, you're doing it wrong
"Fracture my spine / and swear that you're mine / as we dance to the Masochism Tango"
43

for additional questions...
44

Indexes in geo-temporal data sets... How much is enough?

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Indexes in geo-temporal data sets... How much is enough?