Geo data analytics

@dmarcous
● DBA (@IDF)
● Big Data Professional (@IDF)
● Data Wizard - Magic with Data (@Google - Waze)

● Pure professional
● Best practices
● Tools
● Tips & Tricks
● Free Advice!

Agenda
● Why?
● Common Language
● Problems at scale
● Solutions at scale
● Tips & Tricks for scientists
(/Wizards)
● Art
● Keep an eye out for…
● Dog Pictures

● C/C++, GEOS: http://trac.osgeo.org/geos
● C#, NTS: http://code.google.com/p/nettopologysuite/
● Java, JTS:
○ http://tsusiatsoftware.net/jts/main.html
○ http://www.vividsolutions.com/jts/JTSHome.htm
● Python, shapely: https://github.com/Toblerity/Shapely
● Ruby, ffi-geos: https://github.com/dark-panda/ffi-geos
● Javascript, JSTS: http://github.com/bjornharrtell/jsts

● WKT / WKB - Geospatial Markup Language
○ POLYGON((34.807841777801514 32.164333053441936,34.81168270111084
32.164859820966136,34.81337785720825 32.1613540349589,34.80865716934204
32.16046394346568,34.807841777801514 32.164333053441936))
○ http://arthur-e.github.io/Wicket/sandbox-gmaps3.html
● GeoJSON
○ { "type": "FeatureCollection", "features": [{ "type": "Feature", "properties": { "Name": "Verint", "Guest":
"dmarcous", "Accomodations": "Beer; Pizza" }, "geometry": { "type": "Polygon", "coordinates": [ [
[ 34.807841777801514, 32.164333053441936 ], [ 34.81168270111084,
32.164859820966136 ], [ 34.81337785720825, 32.1613540349589 ], [
34.80865716934204, 32.16046394346568 ], [ 34.807841777801514,
32.164333053441936 ]]]}}]}
○ http://geojson.io/#map=17/32.16267/34.81061
● Shape Files - ESRI vector format
● GML - The Geography Markup Language (GML) is an XML grammar for expressing
geographical features.
● Raster - Display file built from coordinates
Formats

Databases
● RDBMS
○ Postgres (PostGIS)
○ MS-SQL / DB2 / Oracle
● NoSQL
○ MongoDB
○ IBM Cloudant
○ Lucene spatial module (elastic/ solr)
● Pure Geospatial Database
○ CartoDB (OS / Hosted)
○ GeoMesa (Accumulo)
■ GeoTrellis - Scala framework for processing raster data

GIS Systems
List of most popular ones -
http://en.wikipedia.org/wiki/List_of_geographic_information_systems_software
QGIS TileMillGRASS

Problem?
● Non scalar data types
○ Aggregating
○ Sharding
○ Unordered
● Speed & Accuracy
○ The Physical World is non-euclidian
http://www.jandrewrogers.com/2015/03/02/geospatial-
databases-are-hard/

Data Structures
● R-Tree (PostGIS, actually R+Tree)
● Quad Tree (DB2)
● Hyperdimensional Hashing
● Space Filling Curves
○ Z Order Curve (MS-SQL)
○ Hilbert Curve

Dimension Reduction
● GeoHash - The mainstream way
○ Linear (non tangant), up to x5 difference in cell area
○ Same Prefix - Close areas (sort of…)
○ http://geohash.org/
○ https://github.com/google/open-location-
code/blob/master/docs/comparison.adoc
● S2 - The google way
○ Quadratic, same level cell ~ similar area
○ Faces of a projected cube - divided by Quad-Trees to levels -
Referenced to position on face by a Hilbert Curve
○ https://code.google.com/p/s2-geometry-library/

● MongoDB Geospatial Indexing
● elastic / solr spatial indexing
● GeoMesa
● Build your own - Store the bytes in a fast
key-value store with reduced keys (HBase /
Cassandra)
Near Real Time Answers

● ESRI - Hive UDFs -
https://github.com/Esri/spatial-framework-for-
hadoop/wiki/UDF-Documentation
● Pigeon - Pig UDFs -
https://github.com/aseldawy/pigeon
● Spark -
○ SpatialSpark
○ GeoTrellis
Big Processing - It’s a UDF World

Graph Representation
● Use Cases
○ Routing
○ Supply Chains
○ Users Networks
● Tools
○ GraphX (Spark!) / Giraph (MR)
○ Dato SGraph (formerly known as GraphLab)
○ Gephi (On small parts for exploration)
● Algorithms
○ Shortest Path - Dijkstra / A-*
○ Communities - Triangle Counting
○ Importance - Centrality / Page Rank

Timezones
● tz_world
○ http://efele.net/maps/tz/world/
○ What do we do with shapefiles?
● APIs
○ Geonames
○ http://www.earthtools.org/
○ Google Timezone API
● UDFs?
○ Hive - from_utc_timestamp(timestamp, string timezone)

// Word Count
val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
// Modified Word Count
val textFile = spark.textFile("hdfs://...")
val counts = textFile.map(line => line.split(","))
.map(point => (coord2S2Cell(point(1),point(2)), 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
// Take that from a library!
def coord2S2Cell(longitude: Double, latitude: Double, lvl = 14) : Int =
{
return S2Cell(longitude,latitude, lvl).CellId()
}
Good Old Word Count

Advanced - Precision is of the Essence
● Density Based Clustering
○ DBSCAN
■ Minimum cluster size (>
Noise)
■ Epsilon (Spatial Radius)
○ R - MASS - kde2d
■ RGoogleMaps for the map
■ http://www.everydayanalytics.ca
/2014/04/heatmap-of-toronto-
traffic-signals.html

rJava
● Wrap geospatial functions of your choice
● call them from R
● Use apply on an entire Dataframe!
● Use as features!
● Visualize??? (in 5 minutes)

R Packs for Geospatial Analysis
● geonames
○ Timezone
○ Weather
○ Nearby places
● RGoogleMaps
○ download+paint Maps
○ getGeoCode
● sp / maps / maptools
○ OGC object abstractions
○ Manipulate / display geo data
● rgdal - spTransform
○ Convert formats / coordinates systems
● geosphere - distances / circles / centroids
● fpc - DBSCAN
● Coverage -
○ http://cran.r-project.org/web/views/Spatial.html

Engineered Geo features
● LOCAL
○ time
○ is_early / is_late
○ day of week
○ is_workday / is_weekend
○ is_day_light (sunrise/ sunset tz_world)
● Weather
○ Temperature
○ is_ Rain/ Fog / Hail / Snow
● Squared (s2cell/ geohash) statistics
○ Probability of users in square to predict X
● Address - is_residence / is_business
● News - GDELT

Frontend = Javascript?
● Google Maps API
○ https://developers.google.com/maps/documentation/javascript/examples/layer-
heatmap
● Leaflet

R for Visualisation
● ggplot2 + geospatial packs
○ http://uce.uniovi.es/mundor/howtoplotashapemap.html
○ http://stackoverflow.com/questions/9558040/ggplot-map-with-l
○ http://spatial.ly/2012/02/great-maps-ggplot2/
● RGoogleMaps
○ http://rforwork.info/tag/rgooglemaps/

R For Interactive
● Shiny
○ Leaflet
■ http://rstudio.github.io/leaflet/
■ http://shiny.rstudio.com/gallery/superzip-example.html
■ http://shiny.rstudio.com/gallery/bus-dashboard.html
○ Globe
■ https://github.com/trestletech/shinyGlobe

R Animation
● http://rmaps.github.io/blog/posts/animated-choropleths/

Keep an Eye Out!
https://locationtech.org/list-of-projects

Contact
● Daniel Marcous
● dmarcous@gmail.com

Geo data analytics

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Geo data analytics

Similaire à Geo data analytics (20)

Plus de Daniel Marcous

Plus de Daniel Marcous (10)

Dernier

Dernier (20)

Geo data analytics