This document summarizes a presentation on geo data analytics. It discusses why geo data matters, common data formats and libraries for working with spatial data, challenges of working with spatial data at scale, and solutions including dimension reduction techniques and spatial databases. It also provides tips for working with spatial data in tools like Spark, R, and Javascript libraries.
4. Agenda
● Why?
● Common Language
● Problems at scale
● Solutions at scale
● Tips & Tricks for scientists
(/Wizards)
● Art
● Keep an eye out for…
● Dog Pictures
13. GIS Systems
List of most popular ones -
http://en.wikipedia.org/wiki/List_of_geographic_information_systems_software
QGIS TileMillGRASS
14.
15. Problem?
● Non scalar data types
○ Aggregating
○ Sharding
○ Unordered
● Speed & Accuracy
○ The Physical World is non-euclidian
http://www.jandrewrogers.com/2015/03/02/geospatial-
databases-are-hard/
19. Dimension Reduction
● GeoHash - The mainstream way
○ Linear (non tangant), up to x5 difference in cell area
○ Same Prefix - Close areas (sort of…)
○ http://geohash.org/
○ https://github.com/google/open-location-
code/blob/master/docs/comparison.adoc
● S2 - The google way
○ Quadratic, same level cell ~ similar area
○ Faces of a projected cube - divided by Quad-Trees to levels -
Referenced to position on face by a Hilbert Curve
○ https://code.google.com/p/s2-geometry-library/
20. ● MongoDB Geospatial Indexing
● elastic / solr spatial indexing
● GeoMesa
● Build your own - Store the bytes in a fast
key-value store with reduced keys (HBase /
Cassandra)
Near Real Time Answers
21. ● ESRI - Hive UDFs -
https://github.com/Esri/spatial-framework-for-
hadoop/wiki/UDF-Documentation
● Pigeon - Pig UDFs -
https://github.com/aseldawy/pigeon
● Spark -
○ SpatialSpark
○ GeoTrellis
Big Processing - It’s a UDF World
22. Graph Representation
● Use Cases
○ Routing
○ Supply Chains
○ Users Networks
● Tools
○ GraphX (Spark!) / Giraph (MR)
○ Dato SGraph (formerly known as GraphLab)
○ Gephi (On small parts for exploration)
● Algorithms
○ Shortest Path - Dijkstra / A-*
○ Communities - Triangle Counting
○ Importance - Centrality / Page Rank
28. // Word Count
val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
// Modified Word Count
val textFile = spark.textFile("hdfs://...")
val counts = textFile.map(line => line.split(","))
.map(point => (coord2S2Cell(point(1),point(2)), 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
// Take that from a library!
def coord2S2Cell(longitude: Double, latitude: Double, lvl = 14) : Int =
{
return S2Cell(longitude,latitude, lvl).CellId()
}
Good Old Word Count
29. Advanced - Precision is of the Essence
● Density Based Clustering
○ DBSCAN
■ Minimum cluster size (>
Noise)
■ Epsilon (Spatial Radius)
○ R - MASS - kde2d
■ RGoogleMaps for the map
■ http://www.everydayanalytics.ca
/2014/04/heatmap-of-toronto-
traffic-signals.html
30. rJava
● Wrap geospatial functions of your choice
● call them from R
● Use apply on an entire Dataframe!
● Use as features!
● Visualize??? (in 5 minutes)