SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
The Source of Truth for Physical Places
Felix Cheung, VP Eng
Large Scale Geospatial Indexing and Analysis on Apache Spark
About me
- VPE at SafeGraph
- ex-Uber - Data Platform teams
- Apache Software Foundation: Member, part of PMC
for Apache Spark, Apache Zeppelin, Apache Superset,
Apache Incubator
- Mentor of Apache Sedona (incubating)
Agenda
- Intro to geospatial data
- Distributed processing
- Use cases
- Overall architecture
Geospatial
We power innovation through open access to geospatial data.
We believe data should be an open platform, not a trade secret.
SafeGraph is just a data company
Fully Remote Founded 2016 Founders have deep
experience with
data and privacy
Previous company was
LiveRamp NYSE:RAMP
Data Scientists, Data
Engineers and Data
Business Experts
We power innovation through open access to geospatial data.
We believe data should be an open platform.
SafeGraph is just a data company
Our Mission:
The Source of Truth for Physical Places
● Accurate and aggregated foot-traffic
data, derived from panel of MM
anonymized devices
● 8+ MM Points-of-Interest
● Easy to use, download as CSVs
SafeGraph Patterns Provides a Powerful Window
Into Consumer Behavior
Please see the Places schema & summary statistics for a complete list of attributes and coverage.
SafeGraph Products:
The source of truth for physical places
Core Places Geometry Patterns
Join on Placekey
Available for 8+ MM POI. Available for 8+ MM POI. Available for ~4.5MM POI.
Trade
Area
Retail &
Real Estate
Common Use Cases with SafeGraph Data
Marketing &
Advertising
Visit
Attribution
Location-
Based Ads
Geospatial
Analytics
Private Equity
Due Diligence
Site
Selection
Trade
Area
Mapping &
GIS Software
GIS
Services
Public
Equities
The Source of Truth for Physical Places
Financial Services &
Investment Research
What is geospatial data?
- Geospatial describes data that represents features or
objects on the Earth's surface.
- Records in a dataset have locational information tied
to them such as coordinates, address, city, or postal
code
- Often around what/who on where - demographic
Key challenges
- Earth’s surface area is 196.9 million mi²
- Computing “where is it” can be expensive
- Scaling such computation is a constant challenge
- Lack of truthset
- “The real world”
Processing
Common toolsets and frameworks
Common toolsets and frameworks - Limits
- Single machine
- New approaches:
- Parallel execution
- GPU acceleration
Apache Sedona (incubating) intro
- Started as GeoSpark, 2015 at Arizona State University
- A cluster computing system for processing
large-scale spatial data, by extending Apache Spark
- Distributed execution
Apache Sedona (incubating) intro
- Core/RDD
- Spatial SQL - spatial query
- Complex geometries / trajectories
- Spatial Index
- Spatial Partitioning
- Coordinate Reference System
- High resolution map generation
Key advances
- Spatial SQL - spatial query
- Spatial Index
- Spatial Partitioning
2x-10x faster
50% reduction to peak memory consumption
… than other Spark-based geospatial systems
Spatial SQL
- Ease of Use
- Open Standards - SQL/MM Spatial 3
OGC Simple Features for SQL
- Geometry data types: point, line, multiline, polygon…
- Relationships between geometry data types
SELECT superhero.name
FROM city, superhero
WHERE ST_Contains(city.geom, superhero.geom)
AND city.name = 'Gotham'
Spatial Query Optimization
- Range Query
- Join Query
- KNN
- KNN Join
- Optimized Spatial Join Strategy
Data format
- Geospatial formats: WKT, WKB, GeoJSON, Shapefile,
HDF…
- Geospatial geometries
POLYGON ((-97.019...
POINT (-88.331492 32.324142)
Spatial Indexes
- R-Tree, Quad-Tree
https://en.wikipedia.org/wiki/R-tree
Spatial Indexes
- R-Tree, Quad-Tree
- Local Performance
in spatial range query,
area 1% - 16%
Jia Yu, ApacheCon 2019
Spatial Partitioning
- Partitioning - essential to distributed processing
- Strategy: by spatial proximity
- Step 1: random sample
- Step 2: build tree
- Step 3: leaf nodes -> global partitioning
Spatial Partitioning
- Uniform grids, Quad-Tree, KDB-Tree, R-Tree, Voronoi
diagram, Hilbert curve
Xie, Dong, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. "Simba: Efficient in-memory spatial analytics." In Proceedings of the 2016 International Conference on
Management of Data, pp. 1071-1085. ACM, 2016.
Spatial Partitioning + Indexing
- Distributed spatial indexing
- Global index - same tree in partitioning - bounding boxes
- Local index
Driver
Spatial Partitioning + Indexing
- Distributed hierarchical spatial indexing
- Global index - same tree in partitioning - bounding boxes
- Local index
Driver
Executor
Executor
Executor
What is H3?
- Geospatial indexing system, a multi-precision
hexagonal tiling of the sphere indexed with
hierarchical linear indexes
- Created at Uber, opened-source
https://h3geo.org/
Why H3?
- Geospatial analysis can be by bucketing locations
- Equidistant
- Traversal, neighboring, truncation
- Polyfill (region)
- Unidirectional edge
https://eng.uber.com/h3/
Why H3?
- Truncation
- h3ToParent
- kRing
H3 - basis of Placekey
- Universal identifier for physical places
- eg. handle address mismatches..
https://www.placekey.io/
Use cases
Use Case 1 - Visit Attribution
https://www.safegraph.com/visit-attribution
Use Case 1 - Visit Attribution
1. Clustering
2. Spatial Join
3. Prediction
Use Case 1 - Visit Attribution - Implementation
Use Case 1 - Visit Attribution - Implementation
Spatial Join
Use Case 2 - Geometry Overlap
- Geometry processing - detect overlapping polygons
- Auto QA - automatic analysis at scale
- Analyzing geospatial distributions
Architecture
Overall Architecture
Training
HITL Annotation
Auto QA
HITL QA
SafeGraph Blog
SafeGraph Blog
We are hiring!
safegraph.com/careers
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
We are hiring!
safegraph.com/careers

Contenu connexe

Tendances

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 

Tendances (20)

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
The Benefits of Data Fabric
The Benefits of Data FabricThe Benefits of Data Fabric
The Benefits of Data Fabric
 
Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
 
逻辑数据编织 – 构建先进的现代企业数据架构
逻辑数据编织 – 构建先进的现代企业数据架构逻辑数据编织 – 构建先进的现代企业数据架构
逻辑数据编织 – 构建先进的现代企业数据架构
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture Deliverables
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Preparing a data migration plan: A practical guide
Preparing a data migration plan: A practical guidePreparing a data migration plan: A practical guide
Preparing a data migration plan: A practical guide
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
DataOps @ Scale: A Modern Framework for Data Management in the Public Sector
DataOps @ Scale: A Modern Framework for Data Management in the Public SectorDataOps @ Scale: A Modern Framework for Data Management in the Public Sector
DataOps @ Scale: A Modern Framework for Data Management in the Public Sector
 

Similaire à Large Scale Geospatial Indexing and Analysis on Apache Spark

Similaire à Large Scale Geospatial Indexing and Analysis on Apache Spark (20)

Thinking spatially with your open data
Thinking spatially with your open dataThinking spatially with your open data
Thinking spatially with your open data
 
Drupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open data
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
A Performance Study of Big Spatial Data Systems
A Performance Study of Big Spatial Data SystemsA Performance Study of Big Spatial Data Systems
A Performance Study of Big Spatial Data Systems
 
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
 
Spark summit europe 2015 magellan
Spark summit europe 2015 magellanSpark summit europe 2015 magellan
Spark summit europe 2015 magellan
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
True Reusable Code - DevSum2016
True Reusable Code - DevSum2016True Reusable Code - DevSum2016
True Reusable Code - DevSum2016
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Aioug big data and hadoop
Aioug  big data and hadoopAioug  big data and hadoop
Aioug big data and hadoop
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Arnold webuquerque20110302
Arnold webuquerque20110302Arnold webuquerque20110302
Arnold webuquerque20110302
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Dernier

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Dernier (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Large Scale Geospatial Indexing and Analysis on Apache Spark

  • 1. The Source of Truth for Physical Places Felix Cheung, VP Eng Large Scale Geospatial Indexing and Analysis on Apache Spark
  • 2. About me - VPE at SafeGraph - ex-Uber - Data Platform teams - Apache Software Foundation: Member, part of PMC for Apache Spark, Apache Zeppelin, Apache Superset, Apache Incubator - Mentor of Apache Sedona (incubating)
  • 3. Agenda - Intro to geospatial data - Distributed processing - Use cases - Overall architecture
  • 5. We power innovation through open access to geospatial data. We believe data should be an open platform, not a trade secret. SafeGraph is just a data company Fully Remote Founded 2016 Founders have deep experience with data and privacy Previous company was LiveRamp NYSE:RAMP Data Scientists, Data Engineers and Data Business Experts
  • 6. We power innovation through open access to geospatial data. We believe data should be an open platform. SafeGraph is just a data company Our Mission: The Source of Truth for Physical Places
  • 7. ● Accurate and aggregated foot-traffic data, derived from panel of MM anonymized devices ● 8+ MM Points-of-Interest ● Easy to use, download as CSVs SafeGraph Patterns Provides a Powerful Window Into Consumer Behavior Please see the Places schema & summary statistics for a complete list of attributes and coverage.
  • 8. SafeGraph Products: The source of truth for physical places Core Places Geometry Patterns Join on Placekey Available for 8+ MM POI. Available for 8+ MM POI. Available for ~4.5MM POI.
  • 9. Trade Area Retail & Real Estate Common Use Cases with SafeGraph Data Marketing & Advertising Visit Attribution Location- Based Ads Geospatial Analytics Private Equity Due Diligence Site Selection Trade Area Mapping & GIS Software GIS Services Public Equities The Source of Truth for Physical Places Financial Services & Investment Research
  • 10. What is geospatial data? - Geospatial describes data that represents features or objects on the Earth's surface. - Records in a dataset have locational information tied to them such as coordinates, address, city, or postal code - Often around what/who on where - demographic
  • 11. Key challenges - Earth’s surface area is 196.9 million mi² - Computing “where is it” can be expensive - Scaling such computation is a constant challenge - Lack of truthset - “The real world”
  • 13. Common toolsets and frameworks
  • 14. Common toolsets and frameworks - Limits - Single machine - New approaches: - Parallel execution - GPU acceleration
  • 15. Apache Sedona (incubating) intro - Started as GeoSpark, 2015 at Arizona State University - A cluster computing system for processing large-scale spatial data, by extending Apache Spark - Distributed execution
  • 16. Apache Sedona (incubating) intro - Core/RDD - Spatial SQL - spatial query - Complex geometries / trajectories - Spatial Index - Spatial Partitioning - Coordinate Reference System - High resolution map generation
  • 17.
  • 18. Key advances - Spatial SQL - spatial query - Spatial Index - Spatial Partitioning 2x-10x faster 50% reduction to peak memory consumption … than other Spark-based geospatial systems
  • 19. Spatial SQL - Ease of Use - Open Standards - SQL/MM Spatial 3 OGC Simple Features for SQL - Geometry data types: point, line, multiline, polygon… - Relationships between geometry data types SELECT superhero.name FROM city, superhero WHERE ST_Contains(city.geom, superhero.geom) AND city.name = 'Gotham'
  • 20. Spatial Query Optimization - Range Query - Join Query - KNN - KNN Join - Optimized Spatial Join Strategy
  • 21. Data format - Geospatial formats: WKT, WKB, GeoJSON, Shapefile, HDF… - Geospatial geometries POLYGON ((-97.019... POINT (-88.331492 32.324142)
  • 22. Spatial Indexes - R-Tree, Quad-Tree https://en.wikipedia.org/wiki/R-tree
  • 23. Spatial Indexes - R-Tree, Quad-Tree - Local Performance in spatial range query, area 1% - 16% Jia Yu, ApacheCon 2019
  • 24. Spatial Partitioning - Partitioning - essential to distributed processing - Strategy: by spatial proximity - Step 1: random sample - Step 2: build tree - Step 3: leaf nodes -> global partitioning
  • 25. Spatial Partitioning - Uniform grids, Quad-Tree, KDB-Tree, R-Tree, Voronoi diagram, Hilbert curve Xie, Dong, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. "Simba: Efficient in-memory spatial analytics." In Proceedings of the 2016 International Conference on Management of Data, pp. 1071-1085. ACM, 2016.
  • 26. Spatial Partitioning + Indexing - Distributed spatial indexing - Global index - same tree in partitioning - bounding boxes - Local index Driver
  • 27. Spatial Partitioning + Indexing - Distributed hierarchical spatial indexing - Global index - same tree in partitioning - bounding boxes - Local index Driver Executor Executor Executor
  • 28. What is H3? - Geospatial indexing system, a multi-precision hexagonal tiling of the sphere indexed with hierarchical linear indexes - Created at Uber, opened-source https://h3geo.org/
  • 29. Why H3? - Geospatial analysis can be by bucketing locations - Equidistant - Traversal, neighboring, truncation - Polyfill (region) - Unidirectional edge https://eng.uber.com/h3/
  • 30. Why H3? - Truncation - h3ToParent - kRing
  • 31. H3 - basis of Placekey - Universal identifier for physical places - eg. handle address mismatches.. https://www.placekey.io/
  • 33. Use Case 1 - Visit Attribution https://www.safegraph.com/visit-attribution
  • 34. Use Case 1 - Visit Attribution 1. Clustering 2. Spatial Join 3. Prediction
  • 35. Use Case 1 - Visit Attribution - Implementation
  • 36. Use Case 1 - Visit Attribution - Implementation
  • 38. Use Case 2 - Geometry Overlap - Geometry processing - detect overlapping polygons - Auto QA - automatic analysis at scale - Analyzing geospatial distributions
  • 39.
  • 43. SafeGraph Blog We are hiring! safegraph.com/careers
  • 44. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions. We are hiring! safegraph.com/careers