2017 saw critical outages and interruptions that affected air- and maritime-traffic: hurricanes, computer outages, physical threats, and more. In this talk, we show how these events can be detected using rapidly-developed geo-temporal analytics in Zeppelin and Jupyter notebooks, all backed by GeoMesa (and visualized in Stealth).
Key points to review include:
- identify example outages: Delta airlines outage in January 2017 (also British Airways and others); China's Jupiter container-ship runs aground outside of Antwerp; maritime traffic rerouted by hurricane Irma. Both expected and unexpected interruptions are included.
- ingest data
- open the analyst notebook, connected to GeoMesa
- conduct live examples of analytics that highlight both the onset of service outages as well as documenting the magnitude of the effects
Key takeaway: F/OSS -- including but not limited to packages such as GeoMesa, GeoServer, HBase, Jupyter/Zeppelin, D3, and R -- provide a rich platform to develop algorithms and queries rapidly that can identify, report, and monitor outages in critical infrastructure.
2. The problem
British Airways has cancelled all flights from Gatwick
and Heathrow as computer problems cause disruption
worldwide. [...] A spokeswoman for the airline said:
"We have experienced a major IT system failure that
is causing very severe disruption to our flight
operations worldwide."
The Independent; May 27, 2017
[...] a massive power outage Sunday afternoon [at
Hartsfield-Jackson Atlanta International Airport] left
planes and passengers stranded for hours, forced
airlines to cancel more than 1,100 flights and created a
logistical nightmare during the already-busy holiday
travel season.
The Atlanta Journal Constitution; December 18, 2017
Then there was the NotPetya cyberattack that hit Danish
shipping giant Maersk in June. The crippling attack
seized the industry’s attention after it cost the company
$200 million to $300 million and led to a temporary
shutdown of the largest cargo terminal in the Port of Los
Angeles.
Risk Management; March 1, 2018
Around half of the flights in Europe on
Tuesday face delays after a computer failure
at the Eurocontrol center in Brussels, Belgium.
CNBC; April 3, 2018
Over 4,400 flights were canceled within, into or out of
the United States on Wednesday, according to aviation
data services company FlightAware. More than 700
flights have been canceled today, and airlines have
canceled over 10,000 flights this month because of the
weather.
ABC; March 22, 2018
Ultra-large container ship CSCL JUPITER ran
aground on Scheldt river bank at around 0700 LT
Aug 14 at Bath, Zeeland, Netherlands, while
proceeding downstream en route from Antwerp to
Hamburg. [...] If hull of the giant ship is breached,
dramatic situation may well turn into a nightmare.
Maritime Bulletin; August 15, 2017
5. How to handle big geo-temporal data?
A suite of tools for persisting,
querying, analyzing, and
streaming spatio-temporal data at
scale... and it can SQL!
7. Interactive Analysis in Notebooks
Writing (and debugging!) MapReduce /
Spark jobs is slow and requires
expertise.
A long development cycle for an
analytic saps energy and creativity.
The answer to both is interactive
‘notebook’ servers like Apache
Zeppelin and Jupyter (formerly
iPython Notebook).
8. What's missing?
We have...
● a problem
● big geo-time data
● big geo-time indexing
● interactive analyst notebook
We still need...
● a place to bring these all together
● to do it cheaply
● to do it flexibly and in a scalable way
11. Distributed Databases
Originally:
● Databases like Accumulo and
HBase used HDFS to store data.
● HDFS needs 3-5x storage for
replication and extra space.
● Compute and disk scaled
together!
12. Distributed Databases
Originally:
● Accumulo and HBase used HDFS
to store data.
● HDFS needs 3-5x storage for
replication and extra space.
● Compute and disk scaled
together!
Today:
● Accumulo and HBase can use
Azure Blob storage and S3*
● Cloud storage scales
● Compute and disk scale
separately!
* Accumulo works on Azure blob storage. Some modifications would required to support S3.
Consult your cloud professional for details, interactions, and additional suggestions.
13. Distributed Databases
Originally:
● Accumulo and HBase used HDFS
to store data.
● HDFS needs 3-5x storage for
replication and extra space.
● Compute and disk scaled
together!
Today:
● Accumulo and HBase can use
Azure Blob storage and S3*
● Cloud storage scales
● Compute and disk scale
separately!
* Accumulo works on Azure blob storage. Some modifications would required to support S3.
Consult your cloud professional for details, interactions, and additional suggestions.
Take away:
GeoMesa HBase on S3 works great!
Others have used GeoMesa Accumulo on Azure...
Try it today!
14. “Ditching the Database?”
● Blobstores/Filesystems are key-value stores.
● Building a key-value store on top of a key-value store is
kinda' redundant.
● GeoMesa can load and parse GDELT S3 files into Spark.
● GDELT on S3 is organized by date-keys.
● What would happen if we ‘ditched’ HBase and wrote
files by space-time keys in cloud native storage?
● What parts of a ‘database’ do we really need to
store/index non-updating, ascending temporal data
streams?
15. GeoMesa FileSystem Datastore
● Serverless architecture
○ Standalone files in S3
○ Ephemeral compute for ingest and query
● Configurable partition schemes (geo + time)
● Parquet file format
○ Column-based storage (great for SQL!)
● Works great with Spark (intended for batch analysis)
S3 Persistent Storage
Ephemeral
Ingest
Time
Ephemeral
Spark Compute
Ephemeral
Ingest
Ephemeral
Spark Compute
17. Setup
● Import GeoMesa dependency
● Create dataframe backed by GeoMesa
relation
● Create SQL temporary view so we can
query it
18. Subselect Data
● Create rough subselection of data
○ Bound by time
○ Bound by bounding box roughy
around the Gulf of Mexico
● Create a new temporary view from this
subselection
● Cache the data (pull into memory)
19. Data Exploration
● Query for Tankers in the Gulf
● Get counts for each type of Tanker
● Group the counts by day
● Grach counts to see trends
22. Extra Data
● Pull in Gas price data
○ Acquired from EIA.gov
○ Two Gas Price Indexes
■ NYH: New York Harbor
■ GC: Gulf Coast
● Create temporary view so we can
analyze with SQL
25. Data Exploration
● Backfill the price data with the last
value to give us day-continuous data
● Min/Max Normalize gas and ship
counts
● Graph gas prices and ship counts
together
26. Questions?
Find out more at http://geomesa.org
Connect with us on Gitter: https://gitter.im/locationtech/geomesa
See applications at CCRi’s blog: http://www.ccri.com/blog/
Editor's Notes
introduction
name and company
Start and come back to this
bin/geomesa-fs aws -c s3://optix-fsds/exactAIS -z -p gmirad -k CCRi --WorkerCount 10 --WorkerType m3.2xlarge --ec2-attributes AdditionalMasterSecurityGroups=sg-0e5adb7c
bin/geomesa-fs aws -c s3://optix-fsds/exactAIS -z -p gmirad -k CCRi --WorkerCount 10 --WorkerType m3.2xlarge --ec2-attributes AdditionalMasterSecurityGroups=sg-0e5adb7c --no-upload
logistics depending more on core systems
computer systems
products early in supply lines
increased risk of cascading failure
easy to find articles about, hard to observe as situations unfold
identifying not merely the initial disruption
knock-on effects quickly becomes critical... that's what the rest of this talk is about
think less of "prediction" per se, and more of analytics:
How can we identify what is happening as first- and second-order effects?
To identify outages and effects
need to identify data sources and feeds.
The problem quickly becomes knowing how to
USE all of the data that exist.
use it quickly
not spend lots of devops and developer time
https://optix.earth/stealth
AIS Explanation (Automatic Identification System)
Requirement
international voyaging ships with 300 or more gross tonnage
any passenger ship
additional regional requirements ship length
Update rate
booking it 2-3s
docked or loitering 10-15 mins
anything, self reported
Satellite to visualization time
10-15s
near real time
about 10TB of data uncompressed
How do we do this?
GeoMesa has been to FOSS4G for years
F/OSS that handles big, streaming, geo-time data.
Uses space filling curves to index data
utilities to handle spatial queries
Most years we present "This is what it's going to do soon..."
This year, we're going to talk about "This is what we've done with it already...",
today so example of working with data with these tools
showing example of rapid prototyping of large-scale detection and analysis.
To make this prototyping more streamlined
geomesa integrates with juptyer and zeppelin
powered by standard (even cloudy) infrastructure: Hadoop; Spark; Kafka
add a little GeoMesa, and suddenly your Jupyter and Zeppelin notebooks are geo-temporally aware / capable
this is where the real innovation, prototyping, and iteration will happen
without notebooks iteration cycles can be long and difficult
visualization becomes a second step
Austin: I added this slide merely to serve as a more natural transition from the introductory material to the "GeoMesa storage" slides we have next. Please use or ignore as you see fit. Thanks!
There’s a predicament
faster storage is more expensive
about order of magnitude per tier
We want a way to keep data in slow/cheap storage
want to query it fast
Traditional databases store data on local disks
It would be great to keep data as low down on this list as possible
accumulo and hbase used hdfs to store data
hdfs has high overhead
need 3-5x extra storage space for replication
compute and disk scale together
today
hbase and accumulo can be backed by blob storage
cloud storage handles replication
computer and disk are seperate
geomesa hbase works great on s3
we’ll show some of that today
what would happen if we just do away with the datastore all together?
do we really need it?
what does it gain us?
caching
better range lookups
have to know how to partition data
database handles file splits
a priori
Geomesa can store and query files directly in s3
what do we gain from this?
serverless architecture
only stand up servers when you need them
ingest
query
very cheap backend storage
less devops
works great for devops