SlideShare a Scribd company logo
1 of 74
Download to read offline
Spatial Analytics Workshop
Pete Skomoroch, LinkedIn (@peteskomoroch)
Kevin Weil, Twitter (@kevinweil)
Sean Gorman, FortiusOne (@seangorman)

#spatialanalytics
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Spatial Analysis

      Analytical techniques to determine the spatial
      distribution of a variable, the relationship between
      the spatial distribution of variables, and the
      association of the variables in an area.
Pattern Analysis
Spatial Analysis Types

     1. Spatial autocorrelation
     2. Spatial interpolation
     3. Spatial interaction
     4. Simulation and modeling
     5. Density mapping
Spatial Autocorrelation

      Spatial autocorrelation statistics measure and analyze
      the degree of dependency among observations in a
      geographic space.


      First law of geography: “everything is related to everything
      else, but near things are more related than distant things.”
        -- Waldo Tobler
Moran’s I - Per Capita
Moran’s I - Random Variable   Income in Monroe County




       Moran’s I = .012              Moran’s I = .66
Spatial Interpolation

      Spatial interpolation methods estimate the variables
      at unobserved locations in geographic space based
      on the values at observed locations.
$14.00
                                                   Chicago




                                                             $14.00
                                                              NYC



                                         $7.55
                                          Henry
Natural Gas Demand in Response to
February 21, 2003 Alberta Clipper cold
front
$18.50
                                                   Chicago




                                                             $30.00
                                                              NYC



                                         $16.00
                                          Henry
Natural Gas Demand in Response to
February 24, 2003 Alberta Clipper cold
front
$20.00
                                                   Chicago




                                                             $37.00
                                                              NYC



                                         $22.00
                                          Henry
Natural Gas Demand in Response to
February 25, 2003 Alberta Clipper cold
front
Spatial Interaction

      Spatial interaction or “gravity models” estimate
      the flow of people, material, or information
      between locations in geographic space.
Introduction
‣   Motiviation
‣   Execution
‣   Prototype
‣   Service
‣   API
‣   Operations
‣   UX

                  Global Oil Supply and Demand Gravity
                                  Model
Simulation and Modeling

      Simple interactions among proximal entities can
      lead to intricate, persistent, and functional spatial
      entities at aggregate levels (complex adaptive
      systems).
Spatial Interdependency Analysis of
                                                                            the San Francisco Failure Simulation




                        Total Number of   No. Links   % Links     %Volume
Infrastructure          Links             Congested   Congested   Delay
Refined Products
(National)
                             3,197              1       0.03%       0.05%
Refined Products
(MSA)                                                   12.50%
                              8                 1                    93%


Power Grid (Regional)        1,942              4        0%          N/A


Power Grid (MSA)              16                2        13%         N/A
Density Mapping

     Calculating the proximity and frequency of a
     spatial phenomenon by creating a probabilistic
     surface.
New York City Fiber Density Map
Standard GIS Architectures
Distributed Analytics

      Queueing analysis tasks from disparate data sources
      for agents to run across distributed servers to collate
      back to the user as answers.
Disparate Data




                                               Distributed Servers
                                      Agents
 User
                 Request Queue

                           Analysis
(http://finder.geocommons.com/overlays/20148)




       1. Rasterize
       2. Kernel
          density calc
       3. Color map              Agent
                                               Amazon EC2
User
       Request Queue



                    Amazon S3
Vector Density Mapping Demo
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Data is Getting Big
‣   NYSE: 1 TB/day
‣   Facebook: 20+ TB
    compressed/day
‣   CERN/LHC: 40 TB/day (15
    PB/year!)
‣   And growth is accelerating
‣   Need multiple machines,
    horizontal scalability
Hadoop
‣   Distributed file system (hard to store a PB)
‣   Fault-tolerant, handles replication, node failure, etc
‣   MapReduce-based parallel computation
    (even harder to process a PB)
‣   Generic key-value based computation interface
    allows for wide applicability
‣   Open source, top-level Apache project
‣   Scalable: Y! has a 4000-node cluster
‣   Powerful: sorted a TB of random integers in 62 seconds
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close
                                                    to 2x faster.
But...
‣   Analysis typically done in Java
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code
‣   Joins: lengthy, error-prone
‣   n-stage jobs: Hard to manage
‣   Prototyping/exploration requires             ‣   analytics in Eclipse?
    compilation                                      ur doin it wrong...
Enter Pig

            ‣   High level language
            ‣   Transformations on sets of records
            ‣   Process data one step at a time
            ‣   Easier than SQL?
Why Pig?
‣   Because I bet you can read the following script.
A Real Pig Script




‣   Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Simplifies Analysis

‣   The Pig version is:
‣        5% of the code, 5% of the time
‣        Within 50% of the execution time.
‣   Pig      Geo:

    ‣   Programmable: fuzzy matching, custom filtering
    ‣   Easily link multiple datasets, regardless of size/structure
    ‣   Iterative, quick
A Real Example

‣   Fire up your EMR.
    ‣   ... or follow along at http://bit.ly/whereanalytics
‣   Pete used Twitter’s streaming API to store some tweets
‣   Simplest thing: group by location and count with Pig
    ‣   http://bit.ly/where20pig


‣   Here comes some code!
tweets = LOAD 's3://where20demo/sample-tweets' as (
  user_screen_name:chararray,
  tweet_id:chararray,
  ...
  user_friends_count:int,
  user_statuses_count:int,
  user_location:chararray,
  user_lang:chararray,
  user_time_zone:chararray,
  place_id:chararray,
  ...);
tweets = LOAD 's3://where20demo/sample-tweets' as (
  user_screen_name:chararray,
  tweet_id:chararray,
  ...
  user_friends_count:int,
  user_statuses_count:int,
  user_location:chararray,
  user_lang:chararray,
  user_time_zone:chararray,
  place_id:chararray,
  ...);
tweets_with_location = FILTER tweets BY user_location !=
'NULL';
normalized_locations = FOREACH tweets_with_location
GENERATE LOWER(user_location) as user_location;
grouped_tweets = GROUP normalized_locations BY
user_location PARALLEL 10;
location_counts = FOREACH grouped_tweets GENERATE $0 as
location, SIZE($1) as user_count;
sorted_counts = ORDER location_counts BY user_count DESC;
STORE sorted_counts INTO 'global_location_tweets';
hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30

brasil           37985
indonesia        33777
brazil           22432
london           17294
usa              14564
são paulo        14238
new york         13420
tokyo            10967
singapore        10225
rio de janeiro   10135
los angeles      9934
california       9386
chicago          9155
uk               9095
jakarta          9086
germany          8741
canada           8201
                 7696
                 7121
jakarta, indonesia  6480
nyc              6456
new york, ny     6331
Neat, but...

 ‣   Wow, that data is messy!
     ‣   brasil, brazil at #1 and #3
     ‣   new york, nyc, and new york ny all in the top 30
 ‣   Pete to the rescue.
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Users by County
Lady Gaga
Tea Party
Dallas
Colbert
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Questions?   Follow us at
             twitter.com/peteskomoroch
             twitter.com/kevinweil
             twitter.com/seangorman

More Related Content

More from Peter Skomoroch

Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkPeter Skomoroch
 
Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With DataPeter Skomoroch
 
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustO'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustPeter Skomoroch
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsPeter Skomoroch
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and SearchPeter Skomoroch
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data ProductsPeter Skomoroch
 
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPractical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPeter Skomoroch
 
Street Fighting Data Science
Street Fighting Data ScienceStreet Fighting Data Science
Street Fighting Data SciencePeter Skomoroch
 
Data Mashups -Data Science Summit
Data Mashups -Data Science SummitData Mashups -Data Science Summit
Data Mashups -Data Science SummitPeter Skomoroch
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopPeter Skomoroch
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPeter Skomoroch
 

More from Peter Skomoroch (12)

Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you think
 
Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With Data
 
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustO'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data Products
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and Search
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data Products
 
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPractical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
 
Street Fighting Data Science
Street Fighting Data ScienceStreet Fighting Data Science
Street Fighting Data Science
 
Data Mashups -Data Science Summit
Data Mashups -Data Science SummitData Mashups -Data Science Summit
Data Mashups -Data Science Summit
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org
 

Recently uploaded

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

O'Reilly Where 2.0 Spatial Analytics Workshop

  • 1. Spatial Analytics Workshop Pete Skomoroch, LinkedIn (@peteskomoroch) Kevin Weil, Twitter (@kevinweil) Sean Gorman, FortiusOne (@seangorman) #spatialanalytics
  • 2. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 3. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 4. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 5. Spatial Analysis Analytical techniques to determine the spatial distribution of a variable, the relationship between the spatial distribution of variables, and the association of the variables in an area.
  • 7. Spatial Analysis Types 1. Spatial autocorrelation 2. Spatial interpolation 3. Spatial interaction 4. Simulation and modeling 5. Density mapping
  • 8. Spatial Autocorrelation Spatial autocorrelation statistics measure and analyze the degree of dependency among observations in a geographic space. First law of geography: “everything is related to everything else, but near things are more related than distant things.” -- Waldo Tobler
  • 9. Moran’s I - Per Capita Moran’s I - Random Variable Income in Monroe County Moran’s I = .012 Moran’s I = .66
  • 10. Spatial Interpolation Spatial interpolation methods estimate the variables at unobserved locations in geographic space based on the values at observed locations.
  • 11. $14.00 Chicago $14.00 NYC $7.55 Henry Natural Gas Demand in Response to February 21, 2003 Alberta Clipper cold front
  • 12. $18.50 Chicago $30.00 NYC $16.00 Henry Natural Gas Demand in Response to February 24, 2003 Alberta Clipper cold front
  • 13. $20.00 Chicago $37.00 NYC $22.00 Henry Natural Gas Demand in Response to February 25, 2003 Alberta Clipper cold front
  • 14. Spatial Interaction Spatial interaction or “gravity models” estimate the flow of people, material, or information between locations in geographic space.
  • 15. Introduction ‣ Motiviation ‣ Execution ‣ Prototype ‣ Service ‣ API ‣ Operations ‣ UX Global Oil Supply and Demand Gravity Model
  • 16. Simulation and Modeling Simple interactions among proximal entities can lead to intricate, persistent, and functional spatial entities at aggregate levels (complex adaptive systems).
  • 17. Spatial Interdependency Analysis of the San Francisco Failure Simulation Total Number of No. Links % Links %Volume Infrastructure Links Congested Congested Delay Refined Products (National) 3,197 1 0.03% 0.05% Refined Products (MSA) 12.50% 8 1 93% Power Grid (Regional) 1,942 4 0% N/A Power Grid (MSA) 16 2 13% N/A
  • 18. Density Mapping Calculating the proximity and frequency of a spatial phenomenon by creating a probabilistic surface.
  • 19. New York City Fiber Density Map
  • 21. Distributed Analytics Queueing analysis tasks from disparate data sources for agents to run across distributed servers to collate back to the user as answers.
  • 22. Disparate Data Distributed Servers Agents User Request Queue Analysis
  • 23. (http://finder.geocommons.com/overlays/20148) 1. Rasterize 2. Kernel density calc 3. Color map Agent Amazon EC2 User Request Queue Amazon S3
  • 25.
  • 26.
  • 27.
  • 28. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 29. Data is Getting Big ‣ NYSE: 1 TB/day ‣ Facebook: 20+ TB compressed/day ‣ CERN/LHC: 40 TB/day (15 PB/year!) ‣ And growth is accelerating ‣ Need multiple machines, horizontal scalability
  • 30. Hadoop ‣ Distributed file system (hard to store a PB) ‣ Fault-tolerant, handles replication, node failure, etc ‣ MapReduce-based parallel computation (even harder to process a PB) ‣ Generic key-value based computation interface allows for wide applicability ‣ Open source, top-level Apache project ‣ Scalable: Y! has a 4000-node cluster ‣ Powerful: sorted a TB of random integers in 62 seconds
  • 31. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 32. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 33. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 34. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 35. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 36. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 37. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 38. But... ‣ Analysis typically done in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins: lengthy, error-prone ‣ n-stage jobs: Hard to manage ‣ Prototyping/exploration requires ‣ analytics in Eclipse? compilation ur doin it wrong...
  • 39. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • 40. Why Pig? ‣ Because I bet you can read the following script.
  • 41. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • 43. Pig Simplifies Analysis ‣ The Pig version is: ‣ 5% of the code, 5% of the time ‣ Within 50% of the execution time. ‣ Pig Geo: ‣ Programmable: fuzzy matching, custom filtering ‣ Easily link multiple datasets, regardless of size/structure ‣ Iterative, quick
  • 44. A Real Example ‣ Fire up your EMR. ‣ ... or follow along at http://bit.ly/whereanalytics ‣ Pete used Twitter’s streaming API to store some tweets ‣ Simplest thing: group by location and count with Pig ‣ http://bit.ly/where20pig ‣ Here comes some code!
  • 45.
  • 46. tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  • 47. tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  • 48. tweets_with_location = FILTER tweets BY user_location != 'NULL';
  • 49. normalized_locations = FOREACH tweets_with_location GENERATE LOWER(user_location) as user_location;
  • 50. grouped_tweets = GROUP normalized_locations BY user_location PARALLEL 10;
  • 51. location_counts = FOREACH grouped_tweets GENERATE $0 as location, SIZE($1) as user_count;
  • 52. sorted_counts = ORDER location_counts BY user_count DESC;
  • 53. STORE sorted_counts INTO 'global_location_tweets';
  • 54. hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30 brasil 37985 indonesia 33777 brazil 22432 london 17294 usa 14564 são paulo 14238 new york 13420 tokyo 10967 singapore 10225 rio de janeiro 10135 los angeles 9934 california 9386 chicago 9155 uk 9095 jakarta 9086 germany 8741 canada 8201 7696 7121 jakarta, indonesia 6480 nyc 6456 new york, ny 6331
  • 55. Neat, but... ‣ Wow, that data is messy! ‣ brasil, brazil at #1 and #3 ‣ new york, nyc, and new york ny all in the top 30 ‣ Pete to the rescue.
  • 56. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 72. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 73. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 74. Questions? Follow us at twitter.com/peteskomoroch twitter.com/kevinweil twitter.com/seangorman