This document discusses processing large geospatial data at scale. It provides background on geospatial concepts like raster and vector data. It then discusses big data frameworks like Hadoop, Spark, and Accumulo that can be used to process geospatial data in parallel across large clusters. Finally, it presents several LocationTech projects like GeoTrellis, GeoJinni, and GeoWave that build geospatial capabilities on top of these frameworks to allow distributed processing and querying of large raster and vector maps.
2. What we’ll be covering…
Background on geospatial concepts
What is LocationTech?
Background on big data frameworks
Overview of LocationTech projects for
processing big geo data.
21. Large geospatial data
Landsat 8 on AWS: 311,405 scenes @ ~800 MB
each.That's 250 TB and counting.
OpenStreetMap: planet.osm is 617 GB.
3 years of geotagged tweets: 3 TB
31. After reading the papers, Nutch developers
added a distributed file system and MapReduce
model to Nutch.
In 2006, those portions were spun out of Nutch
to form…
34. Matei Zaharia
Worked with Hadoop at UC Berklee
Noticed Hadoop was not a good fit for
Machine Learning algorithms and other
iterative models.
So in 2009, he created…
43. Apache Accumulo
Created by the NSA in 2008
Donated to the Apache Foundation in 2011
Graduated to a top level project in 2012
Almost defunded by the US government the
same year.
44. (Sec. 929) Prohibits any DOD component from utilizing the
cloud computing database developed by the National Security
Agency (NSA) and known as "Accumulo" after the end of
FY2013, unless the DOD CIO certifies that: (1) there are no
viable commercial open source databases that have such security
features, or (2) Accumulo itself has become a successful open
source database project. Requires DOD and intelligence
community officials to coordinate the use by DOD components
of cloud computing infrastructure and services offered by the
intelligence community for purposes other than intelligence
analysis.
45. (Sec. 929) Prohibits any DOD component from utilizing the
cloud computing database developed by the National Security
Agency (NSA) and known as "Accumulo" after the end of
FY2013, unless the DOD CIO certifies that: (1) there are no
viable commercial open source databases that have such
security features, or (2) Accumulo itself has become a
successful open source database project. Requires DOD and
intelligence community officials to coordinate the use by DOD
components of cloud computing infrastructure and services
offered by the intelligence community for purposes other than
intelligence analysis.
53. 72 Frames × 14 Billion points per frame
Total = 1 Trillion points
Generated in three hours on a 10-node cluster
HEAT MAP FROM 2009 TO 2014 MONTH-BY-MONTH
58. SELECT
tweet.text,
user.name
FROM
tweet, user
WHERE
bbox(tweet.location, -115, 45, -110, 50) AND
tweet.user_id = user.user_id
+
59.
60. GeoTrellis
a Scala library for geospatial data types and
operations.
enables Spark with geospatial capabilities (raster
now, soon vector!).
storage and query raster from HDFS,
Accumulo, and S3
65. Benchmark Results
439.5 GB of monthly temperature model output data
USA temperature yearly average, 2006 to 2100
66. Benchmark Results
439.5 GB of monthly temperature model output data
USA temperature yearly average, 2006 to 2100
40 m3.xlarge instances
(estimated $2.00 USD per hour
on spot market)