SlideShare a Scribd company logo
1 of 26
Rapid Analytic
Development on
Near Real-Time Data
Austin Heyne, CCRi
aheyne@ccri.com
The problem
British Airways has cancelled all flights from Gatwick
and Heathrow as computer problems cause disruption
worldwide. [...] A spokeswoman for the airline said:
"We have experienced a major IT system failure that
is causing very severe disruption to our flight
operations worldwide."
The Independent; May 27, 2017
[...] a massive power outage Sunday afternoon [at
Hartsfield-Jackson Atlanta International Airport] left
planes and passengers stranded for hours, forced
airlines to cancel more than 1,100 flights and created a
logistical nightmare during the already-busy holiday
travel season.
The Atlanta Journal Constitution; December 18, 2017
Then there was the NotPetya cyberattack that hit Danish
shipping giant Maersk in June. The crippling attack
seized the industry’s attention after it cost the company
$200 million to $300 million and led to a temporary
shutdown of the largest cargo terminal in the Port of Los
Angeles.
Risk Management; March 1, 2018
Around half of the flights in Europe on
Tuesday face delays after a computer failure
at the Eurocontrol center in Brussels, Belgium.
CNBC; April 3, 2018
Over 4,400 flights were canceled within, into or out of
the United States on Wednesday, according to aviation
data services company FlightAware. More than 700
flights have been canceled today, and airlines have
canceled over 10,000 flights this month because of the
weather.
ABC; March 22, 2018
Ultra-large container ship CSCL JUPITER ran
aground on Scheldt river bank at around 0700 LT
Aug 14 at Bath, Zeeland, Netherlands, while
proceeding downstream en route from Antwerp to
Hamburg. [...] If hull of the giant ship is breached,
dramatic situation may well turn into a nightmare.
Maritime Bulletin; August 15, 2017
Are there any data?
Yes. Lots.
Satellite AIS
How to handle big geo-temporal data?
A suite of tools for persisting,
querying, analyzing, and
streaming spatio-temporal data at
scale... and it can SQL!
Analyst notebooks support innovation
Interactive Analysis in Notebooks
Writing (and debugging!) MapReduce /
Spark jobs is slow and requires
expertise.
A long development cycle for an
analytic saps energy and creativity.
The answer to both is interactive
‘notebook’ servers like Apache
Zeppelin and Jupyter (formerly
iPython Notebook).
What's missing?
We have...
● a problem
● big geo-time data
● big geo-time indexing
● interactive analyst notebook
We still need...
● a place to bring these all together
● to do it cheaply
● to do it flexibly and in a scalable way
Storage Options
● GPU RAM
● RAM
● Attached Disk (SSD, Platter)
● Cloud Blob Storage (S3, WASB)
● Cloud archive storage (Glacier)
Faster / $$$
Slower / $
Storage Options
● GPU RAM
● RAM
● Attached Disk (SSD, Platter)
● Cloud Blob Storage (S3, WASB)
● Cloud archive storage (Glacier)
Faster / $$$
Slower / $
Distributed Databases
Originally:
● Databases like Accumulo and
HBase used HDFS to store data.
● HDFS needs 3-5x storage for
replication and extra space.
● Compute and disk scaled
together!
Distributed Databases
Originally:
● Accumulo and HBase used HDFS
to store data.
● HDFS needs 3-5x storage for
replication and extra space.
● Compute and disk scaled
together!
Today:
● Accumulo and HBase can use
Azure Blob storage and S3*
● Cloud storage scales
● Compute and disk scale
separately!
* Accumulo works on Azure blob storage. Some modifications would required to support S3.
Consult your cloud professional for details, interactions, and additional suggestions.
Distributed Databases
Originally:
● Accumulo and HBase used HDFS
to store data.
● HDFS needs 3-5x storage for
replication and extra space.
● Compute and disk scaled
together!
Today:
● Accumulo and HBase can use
Azure Blob storage and S3*
● Cloud storage scales
● Compute and disk scale
separately!
* Accumulo works on Azure blob storage. Some modifications would required to support S3.
Consult your cloud professional for details, interactions, and additional suggestions.
Take away:
GeoMesa HBase on S3 works great!
Others have used GeoMesa Accumulo on Azure...
Try it today!
“Ditching the Database?”
● Blobstores/Filesystems are key-value stores.
● Building a key-value store on top of a key-value store is
kinda' redundant.
● GeoMesa can load and parse GDELT S3 files into Spark.
● GDELT on S3 is organized by date-keys.
● What would happen if we ‘ditched’ HBase and wrote
files by space-time keys in cloud native storage?
● What parts of a ‘database’ do we really need to
store/index non-updating, ascending temporal data
streams?
GeoMesa FileSystem Datastore
● Serverless architecture
○ Standalone files in S3
○ Ephemeral compute for ingest and query
● Configurable partition schemes (geo + time)
● Parquet file format
○ Column-based storage (great for SQL!)
● Works great with Spark (intended for batch analysis)
S3 Persistent Storage
Ephemeral
Ingest
Time
Ephemeral
Spark Compute
Ephemeral
Ingest
Ephemeral
Spark Compute
Hurricane Harvey’s
Effect on Fuel Prices
Can oil tanker positions predict prices?
Setup
● Import GeoMesa dependency
● Create dataframe backed by GeoMesa
relation
● Create SQL temporary view so we can
query it
Subselect Data
● Create rough subselection of data
○ Bound by time
○ Bound by bounding box roughy
around the Gulf of Mexico
● Create a new temporary view from this
subselection
● Cache the data (pull into memory)
Data Exploration
● Query for Tankers in the Gulf
● Get counts for each type of Tanker
● Group the counts by day
● Grach counts to see trends
Data Exploration
● Restrict our search to just Trinity Bay
Data Exploration
● Create a new temporary view of the
number of ships in Trinity Bay
Extra Data
● Pull in Gas price data
○ Acquired from EIA.gov
○ Two Gas Price Indexes
■ NYH: New York Harbor
■ GC: Gulf Coast
● Create temporary view so we can
analyze with SQL
Data Exploration
● Graph data over time period of Harvey
● Notice we don’t have daily values
Data Exploration
● Create temporary view of gas price
data around our time of interest
Data Exploration
● Backfill the price data with the last
value to give us day-continuous data
● Min/Max Normalize gas and ship
counts
● Graph gas prices and ship counts
together
Questions?
Find out more at http://geomesa.org
Connect with us on Gitter: https://gitter.im/locationtech/geomesa
See applications at CCRi’s blog: http://www.ccri.com/blog/

More Related Content

What's hot

STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...
STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...
STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...GEO Analytics Canada
 
SC5 Hangout2 pilot 1 description
SC5 Hangout2  pilot 1 descriptionSC5 Hangout2  pilot 1 description
SC5 Hangout2 pilot 1 descriptionBigData_Europe
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationRob Emanuele
 
Natura2000 V2 En
Natura2000 V2 EnNatura2000 V2 En
Natura2000 V2 Enatelletxea
 
SC4 Hangout - Luigi Selmi, Transport pilot architecture
SC4 Hangout - Luigi Selmi, Transport pilot architectureSC4 Hangout - Luigi Selmi, Transport pilot architecture
SC4 Hangout - Luigi Selmi, Transport pilot architectureBigData_Europe
 
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...InfluxData
 
Tran Minh: big data platform in high performance computing at NISCI
Tran Minh: big data platform in high performance computing at NISCITran Minh: big data platform in high performance computing at NISCI
Tran Minh: big data platform in high performance computing at NISCIVu Hung Nguyen
 
Analysis and interpretation of monitoring data
Analysis and interpretation of monitoring dataAnalysis and interpretation of monitoring data
Analysis and interpretation of monitoring datacorehard_by
 
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016Ivan Ermilov
 
Xldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseXldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseliqiang xu
 
Jawg maurice vs google maps
Jawg   maurice vs google mapsJawg   maurice vs google maps
Jawg maurice vs google mapsLoic Ortola
 
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkFOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkRob Emanuele
 
GeoKnow: Making the Web an Exploratory Place for Spatial Data
GeoKnow: Making the Web an Exploratory Place for Spatial DataGeoKnow: Making the Web an Exploratory Place for Spatial Data
GeoKnow: Making the Web an Exploratory Place for Spatial DataOpenLink Software
 
DSD-INT 2018 Earth Science Through Datacubes - Merticariu
DSD-INT 2018 Earth Science Through Datacubes - MerticariuDSD-INT 2018 Earth Science Through Datacubes - Merticariu
DSD-INT 2018 Earth Science Through Datacubes - MerticariuDeltares
 
DSD-INT 2020 BlueEarth Data
DSD-INT 2020 BlueEarth DataDSD-INT 2020 BlueEarth Data
DSD-INT 2020 BlueEarth DataDeltares
 
Optimized_HD_Kaizen_Jan15
Optimized_HD_Kaizen_Jan15Optimized_HD_Kaizen_Jan15
Optimized_HD_Kaizen_Jan15Ahmad Firdaus
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planningRiyaz Shaikh
 
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...BigData_Europe
 

What's hot (20)

STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...
STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...
STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...
 
SC5 Hangout2 pilot 1 description
SC5 Hangout2  pilot 1 descriptionSC5 Hangout2  pilot 1 description
SC5 Hangout2 pilot 1 description
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
Natura2000 V2 En
Natura2000 V2 EnNatura2000 V2 En
Natura2000 V2 En
 
SC4 Hangout - Luigi Selmi, Transport pilot architecture
SC4 Hangout - Luigi Selmi, Transport pilot architectureSC4 Hangout - Luigi Selmi, Transport pilot architecture
SC4 Hangout - Luigi Selmi, Transport pilot architecture
 
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...
 
Tran Minh: big data platform in high performance computing at NISCI
Tran Minh: big data platform in high performance computing at NISCITran Minh: big data platform in high performance computing at NISCI
Tran Minh: big data platform in high performance computing at NISCI
 
Analysis and interpretation of monitoring data
Analysis and interpretation of monitoring dataAnalysis and interpretation of monitoring data
Analysis and interpretation of monitoring data
 
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
 
Xldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseXldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouse
 
Jawg maurice vs google maps
Jawg   maurice vs google mapsJawg   maurice vs google maps
Jawg maurice vs google maps
 
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkFOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
 
GeoKnow: Making the Web an Exploratory Place for Spatial Data
GeoKnow: Making the Web an Exploratory Place for Spatial DataGeoKnow: Making the Web an Exploratory Place for Spatial Data
GeoKnow: Making the Web an Exploratory Place for Spatial Data
 
DSD-INT 2018 Earth Science Through Datacubes - Merticariu
DSD-INT 2018 Earth Science Through Datacubes - MerticariuDSD-INT 2018 Earth Science Through Datacubes - Merticariu
DSD-INT 2018 Earth Science Through Datacubes - Merticariu
 
DSD-INT 2020 BlueEarth Data
DSD-INT 2020 BlueEarth DataDSD-INT 2020 BlueEarth Data
DSD-INT 2020 BlueEarth Data
 
Optimized_HD_Kaizen_Jan15
Optimized_HD_Kaizen_Jan15Optimized_HD_Kaizen_Jan15
Optimized_HD_Kaizen_Jan15
 
Redis IU
Redis IURedis IU
Redis IU
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planning
 
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
 
view_hdf
view_hdfview_hdf
view_hdf
 

Similar to Rapid analytic development on near real time data

WSO2Con ASIA 2016: IoT Analytics
WSO2Con ASIA 2016: IoT AnalyticsWSO2Con ASIA 2016: IoT Analytics
WSO2Con ASIA 2016: IoT AnalyticsWSO2
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
Processing large point clouds
Processing large point cloudsProcessing large point clouds
Processing large point cloudsMathieu Carette
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsAshish Mrig
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloudAvinash Ramineni
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabMayaData Inc
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
Brian Bulkowski. Aerospike
Brian Bulkowski. AerospikeBrian Bulkowski. Aerospike
Brian Bulkowski. AerospikeVolha Banadyseva
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFAmazon Web Services
 
AWS Earth and Space 2018 - Element 84 Processing and Streaming GOES-16 Data...
AWS Earth and Space 2018 -   Element 84 Processing and Streaming GOES-16 Data...AWS Earth and Space 2018 -   Element 84 Processing and Streaming GOES-16 Data...
AWS Earth and Space 2018 - Element 84 Processing and Streaming GOES-16 Data...Dan Pilone
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
 
Managing 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloudlohitvijayarenu
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
 
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...In-Memory Computing Summit
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 

Similar to Rapid analytic development on near real time data (20)

WSO2Con ASIA 2016: IoT Analytics
WSO2Con ASIA 2016: IoT AnalyticsWSO2Con ASIA 2016: IoT Analytics
WSO2Con ASIA 2016: IoT Analytics
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Processing large point clouds
Processing large point cloudsProcessing large point clouds
Processing large point clouds
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLab
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Brian Bulkowski. Aerospike
Brian Bulkowski. AerospikeBrian Bulkowski. Aerospike
Brian Bulkowski. Aerospike
 
IoT Analytics
IoT AnalyticsIoT Analytics
IoT Analytics
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SF
 
AWS Earth and Space 2018 - Element 84 Processing and Streaming GOES-16 Data...
AWS Earth and Space 2018 -   Element 84 Processing and Streaming GOES-16 Data...AWS Earth and Space 2018 -   Element 84 Processing and Streaming GOES-16 Data...
AWS Earth and Space 2018 - Element 84 Processing and Streaming GOES-16 Data...
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Managing 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloud
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
 
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 

Recently uploaded

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 

Recently uploaded (20)

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 

Rapid analytic development on near real time data

  • 1. Rapid Analytic Development on Near Real-Time Data Austin Heyne, CCRi aheyne@ccri.com
  • 2. The problem British Airways has cancelled all flights from Gatwick and Heathrow as computer problems cause disruption worldwide. [...] A spokeswoman for the airline said: "We have experienced a major IT system failure that is causing very severe disruption to our flight operations worldwide." The Independent; May 27, 2017 [...] a massive power outage Sunday afternoon [at Hartsfield-Jackson Atlanta International Airport] left planes and passengers stranded for hours, forced airlines to cancel more than 1,100 flights and created a logistical nightmare during the already-busy holiday travel season. The Atlanta Journal Constitution; December 18, 2017 Then there was the NotPetya cyberattack that hit Danish shipping giant Maersk in June. The crippling attack seized the industry’s attention after it cost the company $200 million to $300 million and led to a temporary shutdown of the largest cargo terminal in the Port of Los Angeles. Risk Management; March 1, 2018 Around half of the flights in Europe on Tuesday face delays after a computer failure at the Eurocontrol center in Brussels, Belgium. CNBC; April 3, 2018 Over 4,400 flights were canceled within, into or out of the United States on Wednesday, according to aviation data services company FlightAware. More than 700 flights have been canceled today, and airlines have canceled over 10,000 flights this month because of the weather. ABC; March 22, 2018 Ultra-large container ship CSCL JUPITER ran aground on Scheldt river bank at around 0700 LT Aug 14 at Bath, Zeeland, Netherlands, while proceeding downstream en route from Antwerp to Hamburg. [...] If hull of the giant ship is breached, dramatic situation may well turn into a nightmare. Maritime Bulletin; August 15, 2017
  • 3. Are there any data? Yes. Lots.
  • 5. How to handle big geo-temporal data? A suite of tools for persisting, querying, analyzing, and streaming spatio-temporal data at scale... and it can SQL!
  • 7. Interactive Analysis in Notebooks Writing (and debugging!) MapReduce / Spark jobs is slow and requires expertise. A long development cycle for an analytic saps energy and creativity. The answer to both is interactive ‘notebook’ servers like Apache Zeppelin and Jupyter (formerly iPython Notebook).
  • 8. What's missing? We have... ● a problem ● big geo-time data ● big geo-time indexing ● interactive analyst notebook We still need... ● a place to bring these all together ● to do it cheaply ● to do it flexibly and in a scalable way
  • 9. Storage Options ● GPU RAM ● RAM ● Attached Disk (SSD, Platter) ● Cloud Blob Storage (S3, WASB) ● Cloud archive storage (Glacier) Faster / $$$ Slower / $
  • 10. Storage Options ● GPU RAM ● RAM ● Attached Disk (SSD, Platter) ● Cloud Blob Storage (S3, WASB) ● Cloud archive storage (Glacier) Faster / $$$ Slower / $
  • 11. Distributed Databases Originally: ● Databases like Accumulo and HBase used HDFS to store data. ● HDFS needs 3-5x storage for replication and extra space. ● Compute and disk scaled together!
  • 12. Distributed Databases Originally: ● Accumulo and HBase used HDFS to store data. ● HDFS needs 3-5x storage for replication and extra space. ● Compute and disk scaled together! Today: ● Accumulo and HBase can use Azure Blob storage and S3* ● Cloud storage scales ● Compute and disk scale separately! * Accumulo works on Azure blob storage. Some modifications would required to support S3. Consult your cloud professional for details, interactions, and additional suggestions.
  • 13. Distributed Databases Originally: ● Accumulo and HBase used HDFS to store data. ● HDFS needs 3-5x storage for replication and extra space. ● Compute and disk scaled together! Today: ● Accumulo and HBase can use Azure Blob storage and S3* ● Cloud storage scales ● Compute and disk scale separately! * Accumulo works on Azure blob storage. Some modifications would required to support S3. Consult your cloud professional for details, interactions, and additional suggestions. Take away: GeoMesa HBase on S3 works great! Others have used GeoMesa Accumulo on Azure... Try it today!
  • 14. “Ditching the Database?” ● Blobstores/Filesystems are key-value stores. ● Building a key-value store on top of a key-value store is kinda' redundant. ● GeoMesa can load and parse GDELT S3 files into Spark. ● GDELT on S3 is organized by date-keys. ● What would happen if we ‘ditched’ HBase and wrote files by space-time keys in cloud native storage? ● What parts of a ‘database’ do we really need to store/index non-updating, ascending temporal data streams?
  • 15. GeoMesa FileSystem Datastore ● Serverless architecture ○ Standalone files in S3 ○ Ephemeral compute for ingest and query ● Configurable partition schemes (geo + time) ● Parquet file format ○ Column-based storage (great for SQL!) ● Works great with Spark (intended for batch analysis) S3 Persistent Storage Ephemeral Ingest Time Ephemeral Spark Compute Ephemeral Ingest Ephemeral Spark Compute
  • 16. Hurricane Harvey’s Effect on Fuel Prices Can oil tanker positions predict prices?
  • 17. Setup ● Import GeoMesa dependency ● Create dataframe backed by GeoMesa relation ● Create SQL temporary view so we can query it
  • 18. Subselect Data ● Create rough subselection of data ○ Bound by time ○ Bound by bounding box roughy around the Gulf of Mexico ● Create a new temporary view from this subselection ● Cache the data (pull into memory)
  • 19. Data Exploration ● Query for Tankers in the Gulf ● Get counts for each type of Tanker ● Group the counts by day ● Grach counts to see trends
  • 20. Data Exploration ● Restrict our search to just Trinity Bay
  • 21. Data Exploration ● Create a new temporary view of the number of ships in Trinity Bay
  • 22. Extra Data ● Pull in Gas price data ○ Acquired from EIA.gov ○ Two Gas Price Indexes ■ NYH: New York Harbor ■ GC: Gulf Coast ● Create temporary view so we can analyze with SQL
  • 23. Data Exploration ● Graph data over time period of Harvey ● Notice we don’t have daily values
  • 24. Data Exploration ● Create temporary view of gas price data around our time of interest
  • 25. Data Exploration ● Backfill the price data with the last value to give us day-continuous data ● Min/Max Normalize gas and ship counts ● Graph gas prices and ship counts together
  • 26. Questions? Find out more at http://geomesa.org Connect with us on Gitter: https://gitter.im/locationtech/geomesa See applications at CCRi’s blog: http://www.ccri.com/blog/

Editor's Notes

  1. introduction name and company Start and come back to this bin/geomesa-fs aws -c s3://optix-fsds/exactAIS -z -p gmirad -k CCRi --WorkerCount 10 --WorkerType m3.2xlarge --ec2-attributes AdditionalMasterSecurityGroups=sg-0e5adb7c bin/geomesa-fs aws -c s3://optix-fsds/exactAIS -z -p gmirad -k CCRi --WorkerCount 10 --WorkerType m3.2xlarge --ec2-attributes AdditionalMasterSecurityGroups=sg-0e5adb7c --no-upload
  2. logistics depending more on core systems computer systems products early in supply lines increased risk of cascading failure easy to find articles about, hard to observe as situations unfold identifying not merely the initial disruption knock-on effects quickly becomes critical... that's what the rest of this talk is about think less of "prediction" per se, and more of analytics: How can we identify what is happening as first- and second-order effects?
  3. To identify outages and effects need to identify data sources and feeds. The problem quickly becomes knowing how to USE all of the data that exist. use it quickly not spend lots of devops and developer time
  4. https://optix.earth/stealth AIS Explanation (Automatic Identification System) Requirement international voyaging ships with 300 or more gross tonnage any passenger ship additional regional requirements ship length Update rate booking it 2-3s docked or loitering 10-15 mins anything, self reported Satellite to visualization time 10-15s near real time about 10TB of data uncompressed How do we do this?
  5. GeoMesa has been to FOSS4G for years F/OSS that handles big, streaming, geo-time data. Uses space filling curves to index data utilities to handle spatial queries Most years we present "This is what it's going to do soon..." This year, we're going to talk about "This is what we've done with it already...", today so example of working with data with these tools showing example of rapid prototyping of large-scale detection and analysis.
  6. To make this prototyping more streamlined geomesa integrates with juptyer and zeppelin powered by standard (even cloudy) infrastructure: Hadoop; Spark; Kafka add a little GeoMesa, and suddenly your Jupyter and Zeppelin notebooks are geo-temporally aware / capable this is where the real innovation, prototyping, and iteration will happen
  7. without notebooks iteration cycles can be long and difficult visualization becomes a second step
  8. Austin: I added this slide merely to serve as a more natural transition from the introductory material to the "GeoMesa storage" slides we have next. Please use or ignore as you see fit. Thanks!
  9. There’s a predicament faster storage is more expensive about order of magnitude per tier We want a way to keep data in slow/cheap storage want to query it fast
  10. Traditional databases store data on local disks It would be great to keep data as low down on this list as possible
  11. accumulo and hbase used hdfs to store data hdfs has high overhead need 3-5x extra storage space for replication compute and disk scale together
  12. today hbase and accumulo can be backed by blob storage cloud storage handles replication computer and disk are seperate
  13. geomesa hbase works great on s3 we’ll show some of that today
  14. what would happen if we just do away with the datastore all together? do we really need it? what does it gain us? caching better range lookups have to know how to partition data database handles file splits a priori Geomesa can store and query files directly in s3 what do we gain from this?
  15. serverless architecture only stand up servers when you need them ingest query very cheap backend storage less devops works great for devops
  16. Backup cluster http://ec2-34-229-152-87.compute-1.amazonaws.com:8890/#/notebook/2DDY4K5JF