The document is a presentation titled "Scaling GIS in 3 Acts - Lightning Edition" presented by Nick Dimiduk on May 22, 2012. It discusses scaling geographic information systems (GIS) in three acts: Act I defines what GIS is and that it involves data on maps; Act II discusses what can be done with GIS such as geospatial queries and non-Euclidean geometry; Act III covers implementing GIS on HBase including spatial partitioning and indices.
IntrouctionsWho am INick Dimiduk, Data Platform teamwhat I do:Help growers manage risk. Sell insurance.3 bad years = lose your farm”what willthe weather be like this spring in Jasper County, IL?”“How many consecutive days above 74 degs?”“How similar is the weather Sioux Falls, SD vs. Dayton OH?”the data:*get data stats from zimmer*
IntrouctionsWho am INick Dimiduk, Data Platform teamwhat I do:Help growers manage risk. Sell insurance.3 bad years = lose your farm”what willthe weather be like this spring in Jasper County, IL?”“How many consecutive days above 74 degs?”“How similar is the weather Sioux Falls, SD vs. Dayton OH?”the data:*get data stats from zimmer*
On the sideCoauthor: Amandeep Khurana, ClouderaIn print this fall“Generous” discount code
USGS (US Geological Survey) has a boring definition“Software system capable of capturing, storing, analyzing, displaying geographically referenced information”
TN River Gorge, ChattanoogaMaps + Data actionable insight from dataThat happens to make really pretty picturesDifferent views for different peopleReally not all that different from this whole “Big Data” thingMap = dataBase layers:Terrain information: dataRiver and lake boundaries: dataCities, roads: dataReally just data + data in a picture (interactive?)Quick Poll: “how many of you make pretty pictures in your day-to-day data activities?”
Not all fun and gamesMost GIS built by geographers… for bureaucratsNo software engineering experience or motivation“state of the art” is ArcGISthis is not a modern technical landscapeThe World isn’t flatMostly 3D informationWhich, btw, often changes over time (4D)Reduced to 2D for everyone’s convenienceStored in a 1D world (disk platters read linearly)“Here are Dragons”Hunt-Lenox Globe, 1503-07Unknown areas, stories of lions and dragons
What does Climate do with Geo data?Historically:StoringAnalyzingNow and future:CapturingDisplaying
Queries against spatial dataGeometry/Geography as first-class citizensIntersperse spatial queries with other attributesit’s all just data, remember?Geometric queries:Example of an “intersection” queryAlso: containment, overlap.Describe intersection between two geometries: “Dimensionally Extended 9-intersection model (DE-9IM)”Nearest neighborNo linear measurements (miles, kilometers) involved!GIS visualizing query results from early prototype.
Measuring units becomes trickyAngular distance is not linear distance.Know that old joke about physicists? “Assume a spherical horse”Earth is an irregular sphereApproximated into idealUsing planar (2D) coordinatesCoordinate Reference Systems180 deg does not a triangle make. Validate your assumptions.
Or “Horizontal scaling of geospatial systems.” In the cloud.What’s the data?Vector data: 1.5B features (geometry + metadata)Raster data: whole US on 10m resolutionReference into 30-100+ years’ worth of historical time-seriesAll on AWS.
Preformant access to data requires indexingLinearization via Space-filling curveIndex 2D data in a single dimensionPreserve locality as much as possibleZ curve, Hilbert curve, etcGeohashingFar from perfect, edge-cases still hurtWorks okay for points but not for arbitrary geometries
Horizontal scaling requires partitioningMany (2) ways to slice it (boo)Domain-partitioning: cut the world into chunksLon, lat are fixed domain. Simple to split evenly (hemispheres)Poor distribution of work.Range-partitioning: split according to the values you haveScale according to data densityEffective partitioning requires knowledge of your dataOr a specialized data-structure (foreshadowing)
Many (2) ways to slice it (boo)Domain-partitioning: cut the world into chunksLon, lat are fixed domain. Simple to split evenly (hemispheres)Poor distribution of work.Range-partitioning: split according to the values you haveScale according to data density
Preformant access to data requires indexingDimensionally aware indicesKD-Trees work great for point data (nearest neighbors)R-Tree variants for arbitrary geometries, but costly to constructUniform partitions => uniform trees => uniform access performanceTwo approaches to scaling:2-layer indices1st layer: coarse-grained partition2nd layer: specialized indexThis is MD-HBaseEasier to implement. Potentially miss geometries => incomplete results!!!Persisted spatial indicesImplement persisted R-TreeCustom regions via RegionSplitPolicy (0.94+)Should be more correct… “there are dragons”