SlideShare une entreprise Scribd logo
1  sur  22
DT Core Analytical Competencies
 Data Engineering
⁻ Data Architecture Design and Development
⁻ Large Scale Enterprise Architecture and Design
⁻ Migrate, Extract, Transform, and Load Data
⁻ Spatial, Multi-Domain, and Cloud Base Data Services

Analytics – Quantitative
⁻ Data Transformation and Ingestion
⁻ Dissemination and Reporting Tools
⁻ Data Mining, Exploitation, and Correlation Tools
⁻ Spatial Data Mining and Geographic Knowledge Discovery




                                 Data Tactics Corporation Proprietary and Confidential Material
DT Core Analytical Competencies
The Team:

Graduates of top tier universities to include
Stanford, Caltech and MIT as well as ties to these and
local universities.

Degrees include Mathematics, Computer
Science, Aeronautical
Engineering, Astrophysics, Electrical
Engineering, Mechanical Engineering, Statistics and
Social Sciences.

Competencies include data mining, machine
learning, statistics, spatial statistics, Bayesian
statistics, econometrics, computational geometry, spatial
econometrics, applied mathematics, theoretical
robotics, dynamic systems, control theory.

Foci include unsupervised cross-modal clustering
algorithms, principle component analysis, independent
component analysis, regression, spatial
regression, geographic weighted regression, zeroth order
processing, nonlinear optimization, autoregressive
models, time-series analysis, spatial regime models, HAC
models.

Technical Competencies include    Data Tactics Corporation Proprietary and Confidential Material
Data Tactics Analytics Cell




     Data Tactics Corporation Proprietary and Confidential Material
Analytics Competencies




                                          ZeroFill

                                                     40
• Time Series Analytics (i)                                                                                          (i)




                                                     0
   • Applying the ARIMA model in a
                                                                                                           02-13

                                                                                                   Index


     parallelized environment to
     provide anomaly detection
• Correlation Analytics (ii)
   • Brute force pairwise Pearson‟s
     correlation over vectors in a
     cloud-backed engine
• Aggregation Analytics (iii)
   • Aggregate micro-pathing
     • Repurposing data to analyze                                                                                 (ii)
      and display movement
      patterns
  • Dwell time calculations
     • Analytic to discover areas of
      interest based on movement
      activity
• Graph Analytics (iiii)
  • Discovering social interaction
    models and paradigms within                                                                                           (iii)
    network data                                                                                  (iiii)   4
                                 Data Tactics Corporation Proprietary and Confidential Material
Analytics Competencies
• Directional Spatio-Temporal
 Analytics (i)                                                                                      (i)


  • Compare distributions with a focus on
    changes in morphology of the
    distribution and mobility of individual
    observations within the distribution
    over that same period of time over
    space (Wy)
• Local Classification (ii)
   • Non-self-similarities & self-similarities;                             (i)

     within and between group
     correlations.
• Ecological Analytics                                                                (ii)


  • Regression Modeling
     • Spatial Regression
     • Spatial Regime Models
     • HAC Models                                                                               5
                               Data Tactics Corporation Proprietary and Confidential Material
Data Tactics Data Repository




      Data Tactics Corporation Proprietary and Confidential Material
Quantitative Data Competencies
• Proxy problems definition – Different problems lead to different questions, which lead to
  different data sets. Confer acceptability of data source by the definition of the proxy problems.
• Key dimensions of variability – Key dimensions were targeted for collection such as
  time, space, identifier, etc. However, different proxy problems require different key dimensions.
• Capturing scope – The following was explicitly captured:
    • Data structure (E.G. graph relationship data vs. graph transaction data vs. dimensional data)
    • Data timespan (if time is a dimension)
    • Data geospatial footprint (if geospatial is a dimension)
    • Data volume (both in total GB and also in total # of rows)
    • Determining dataset overlap
• Capturing opinions - Current star ratings based on:
    • Data consistency, volume, and persistence
    • Data coverage (time and space)
    • Data precision (time and space)
    • Data “genuineness” (synthesized data is penalized)
    • Data distribution (IE: we may have extremely precise geo-spatial data, but if there are only 40
      unique geospatial points in the data, the geo-spatial aspects aren‟t that interesting)
    • Data dimensionality (higher dimensionality with reasonable distributions on each dimension is
      preferred)
Quantitative Data Holdings
Name of the Data                                                                          Date that statistics
Source                                                                                    were last collected
          Initial reviewer                                                                on data
                                                                                                         Location of data Data
          Opinion of Data                                                     Source where               on FTP site        format
          Quality                                      Collection start /
                                                                              data was
                       Description and                 end dates – if
                                                                              acquired                         Size of Data
                       notes on data source            known                                                   (storage space
                       as well as collection                           Geospatial                              and rows)
                                                                                     Data handling
                       information                                     coverage      requirements




                                               Data Tactics Corporation Proprietary and Confidential Material
                                                                                                                      10
Quantitative Data Holdings
Armed Conflict Location and Events Dataset (ACLED)                KDD 2003 Data
AIS Ship Data                                                     KDD 2005 Data
Atmospherics Reports                                              Kiva Data
BrightKite Data                                                   Landscan Data
Classified Ads                                                    LiveJournal Data
CNN                                                               Meme Tracker
Digital Terrain Elevation Data (DTED)                             Meme Twitter TS
Enron Data                                                        NFL Plays
Epinions Data                                                     Night Lights Data
EU Email                                                          Open Data Airtraffic accidents
Facebook                                                          Open Street Maps
Flickr Data                                                       Panoramio Data
Flight Information Data                                           Patent Citations Data
Four Square Data                                                  Photobucket Data
Friend Feed Data                                                  Picasa Web Albums Data
Geolife Data                                                      Processed Employment Data
Gowalla Data                                                      Scamper Data
International Conference on Weblogs and Social Media              ISVG
(ICWSM) Data                                                      Twitter
Identica Data                                                     UNDP
IMDB Data                                                         Weather Data
Knowledge Discovery and Data (KDD) Mining Tools                   Webgraphs
Competition                                                       Youtube



                                     Data Tactics Corporation Proprietary and Confidential Material
Quantitative Data Competencies
  Panoramio / Flickr –           Metadata on uploaded public photos provides excellent geospatial and
  temporal resolution, which also provides user information. Estimated 250 million rows of photo metadata
  with over 150 million already gathered.
        AIS – Ship tracking data that provides ship „pings‟ as they progress in movement. Precise time and
        geospatial information provided. 50 million records and counting.
        OpenStreetMaps –            Over 2 billion geospatial points of mapping enthusiasts‟ tracks across the
        world. Time and userid information also included.
        Gowalla / Brightkite – About 11 million FourSquare style check-ins with user, location, and
        time information provided.

         Example Proxy Problems:
   •   Discovering “Holes” in the data where photos are no longer taken to detect avoided areas
   •   Discovering relationships and links based on co-occurrence between users in time / space
   •   Tracking and analyzing movement patterns on a local and global scale
   •   Analyzing image data for changes in the same locations
   •   Detecting differences in photo activity in an area over time
   •   Detecting events based on abnormal photo activity behavior
   •   Mapping UserIds across data sources to create a unified analytic picture
   •   Detecting home range for each user
   •   Defining patterns of life by routine activities and movement
   •   Tracking language usage in areas to determine abnormal language presence in an area
   •   Local vs tourist movement analysis and extraction
   •   Trending of location popularity



                                               UNCLASSIFIED                                          12
Quantitative Data Competencies
  Twitter – Sampled ongoing collection of social media tweets with UserId and time.
  Some even have precise location data, but this is not the norm. Collection pulls roughly
  between 1-2 million tweets / day.
       Example Proxy Problems:
   • Discovery of crowd-sourced phenomena (e.g., people posting to beware of a certain
     neighborhood)
   • Discovery of correlated trends (e.g., finding that people posting about a certain topic in an
     area correlates to higher crime in that area)
   • Tracking sentiment on certain topics and issues
   • Tracking language usage in areas to determine abnormal language presence in an area




                                          UNCLASSIFIED                                    13
Quantitative Data Competencies
• How can we infer movement patterns from vast amounts of what appears to
  be just point data collected in time and associated with an identifier (IE:
  UserId / bank account / etc)?
• Technique is applicable to Twitter, FourSquare and MANY other sources




                                                        Volume plot of photos binned by area on log scale
                                                                    Paris as seen from Flickr over all time




                                                                                               14
Quantitative Data Competencies
 1.    Goal: to catch active moment between locations a small distance apart
 2.    Typically two to around a dozen points chained together
 3.    Located in a small area, but with a definite path through the area
 4.    Sampled in rapid succession (less than X seconds between points)
 5.    Thousands or millions of micro-paths make a full path to view
                                                                                                Segment ignored:
                                                         Segment ignored:                       Velocity too fast
                                   Photo taken           120 seconds between points
 Photo taken                                                                                                              Photo taken
                                   2012-08-15 12:35:25
 2012-08-15 12:34:59                                                                                                      2012-08-15 12:37:46
                                                                          Photo taken
                                                                          2012-08-15 12:37:35




                  Photo taken
                  2012-08-15 12:35:11                                                           Person A                               Common
                                                           Photo taken                                                                   path




                                                                                                                          10 seconds
                                                           2012-08-15 12:37:25
                                                                                                Person B




                                                                                                              3 seconds
                                                                                                                                        pattern
       A Micropath example                                                                                                             forming
                                                                                                Person C




            Overlay thousands / millions of these tiny micropaths together
                                    and you get…
                                                           UNCLASSIFIED                                                                  15
Quantitative Data Competencies
   View of Paris using a 60 second segment timeout and 80km/hour cutoff on Flickr data
                    Arc de Triomphe

                          Apparent typical approach pathway to the Arc




                                                           Place de la Concorde


                                                                             Louvre

                                                                                       Harder to see, but
                                                  Place de la
                                                                                        you can see the
                     Eiffel Tower             Concorde typically
                                                                                      typical approach /
                                               approached from
                                                                                      exit pathways from
                                              southern direction
                                                                                          Notre Dame.
                                                              Notre Dame




   Red strip appears to
    be line of sight to
     the Eiffel Tower




                                               UNCLASSIFIED                                         16
Quantitative Data Competencies
   Aggregate micro-pathing on a world of photo metadata with no speed,
                      time, or distance restrictions




                               UNCLASSIFIED                         17
Quantitative Data Competencies
   AIS ship tracking micro-path blanket with no time / space filters


                                                      Japan‟s south coast


                                                China‟s coast with
                                                high levels of activity

                                               Coast of Taiwan




                              UNCLASSIFIED                                18
Quantitative Data Competencies
Flickr Paris 2004 changes vs 2005
                                    Hh: [HIGH, high]- an increase between Xt1 -> Xt2 relative to respective (Xt1, Xt2)
                                    reference distribution where t1, t2 belong to T. HIGH reflects a strong increase
                                    of ones own values (dxi) at location i between t1 and t2 relative to the change
                                    of neighboring values (dy). high reflects a modest increase of dy relative to
                                    values of dx. Neighbors are defined with the spatially lagged variable Wy, as
                                    the eight nearest observations.

                                    lL: low, LOW [low, LOW]- a decrease between Xt1 -> Xt2 relative to respective
                                    (Xt1, Xt2) reference distribution where t1, t2 belong to T. low reflects a modest
                                    decrease of ones own values (dxi) at location i between t1 and t2 relative to the
                                    change of neighboring values (dy). LOW reflects a strong decrease of
                                    neighboring values of dx.

                                    Neighbors are defined with the spatially lagged variable Wy, as the eight
                                    nearest observations.
Flickr Paris 2011 changes vs 2010




                                     UNCLASSIFIED                                                          19
Quantitative Data Competencies
New Year provides lots of photos
                                             Paris
                                                   Bastille Day
         Recurrent red strips show the recurring
         weekend
                                                                  Number of distinct
                                                                    photographers




                                               Day in year
                                              UNCLASSIFIED                             20
Quantitative Data Competencies
5 day Carnival celebration
                                                   Caracas
                                        Some interesting dates for low
                                               volume activity       Number of distinct
                                                                            photographers




                                                              Day in year
Image from www.flickr.com/photos/globovision/6911554143
                                                          UNCLASSIFIED                      21
Quantitative Data Competencies
                                       Airline Flight Data Anomaly Detection
                During an unusual event, such as a winter storm show below, the ARIMA still follows the
                pattern but doesn‟t match as well. These areas where the red and black don‟t match are
                where unusual events have occurred.
ZeroFill

           40
           0




                                                                               02-13

                                                             Index
ZeroFill

           40
           0




                                                                               02-13

                                                             Index



            Plot of the count of
            points where the
            difference between the
            expected number of
            flights leaving an airport
            based on the model and
            the actual observed
            number of flights was
            statistically significant.
                                                      UNCLASSIFIED                                   22
Quantitative Data Competencies
 Raw data file:
 Each line is a comma separated list of values.

 key1, timestamp, value                                     Key1 2.4,3.4,0.99,…
 key2, timestamp, value                                     Key2 3.4,4.3,1.0,0.6….
                                     Cloud-backed           …..
 …
                                     transformation
                                                                     Vector file:
                                                                     Each line has a key and a comma
                                                                     separated list of values.
                           Correlation analytic

                                                                                     Implemented in:
                                    key1          Key2     Key3     Key4
                                                                                     • Python (RAM)
                          Key1         -          0.93     0.43     0.001            • Hive
                          Key2         -            -      -0.5     -0.03            • Mahout
                                                                                     • Spark
                          Key3         -            -        -       .32
                                                                                     • Giraph
                          Key4         -            -        -         -             • Cascalog
                       For each vector calculate the correlation to
                       each other vector. We use a Pearson
                       correlation.

                                                  UNCLASSIFIED                                  23
Quantitative Data Competencies
   Training         Test         Approximation engine for the O(n²) correlation
   Engine          Engine                      matrix problem

           Spark
                                 Technique based on Google Correlate

  Approximation provides
  orders of magnitude of
  speedup when compared to
  equivalent brute force
  methods. The technique
  works best for highly
  correlated items and uses a
  series of data
  projections, unsupervised
  learning, and vector
  quantization to provide
  dimensionality reduction for
  incoming complex vectors.

                                     UNCLASSIFIED                             24

Contenu connexe

Similaire à Capabilities Brief Analytics

Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Jian Qin
 
Linked Data as a Service
Linked Data as a ServiceLinked Data as a Service
Linked Data as a ServicePeter Haase
 
Skills portfolio
Skills portfolioSkills portfolio
Skills portfolioyeboyerp
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratchVinayak Hegde
 
OSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal SternOSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal SternOpenStorageSummit
 
Analytics and reporting context linkedin final
Analytics and reporting context linkedin finalAnalytics and reporting context linkedin final
Analytics and reporting context linkedin finalDennis Crow
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
The Information Workbench as a Self-Service Platform for Linked Data Applicat...
The Information Workbench as a Self-Service Platform for Linked Data Applicat...The Information Workbench as a Self-Service Platform for Linked Data Applicat...
The Information Workbench as a Self-Service Platform for Linked Data Applicat...Peter Haase
 
Statistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data UsageStatistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data UsageMarkus Luczak-Rösch
 
DT Company Overview January 2013
DT Company Overview January 2013DT Company Overview January 2013
DT Company Overview January 2013DataTactics
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsGDi Techno Solutions
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 

Similaire à Capabilities Brief Analytics (20)

Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...
 
Data mining
Data miningData mining
Data mining
 
Linked Data as a Service
Linked Data as a ServiceLinked Data as a Service
Linked Data as a Service
 
Big data use cases
Big data use casesBig data use cases
Big data use cases
 
STI Summit 2011 - Digital Worlds
STI Summit 2011 - Digital WorldsSTI Summit 2011 - Digital Worlds
STI Summit 2011 - Digital Worlds
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Skills portfolio
Skills portfolioSkills portfolio
Skills portfolio
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratch
 
OSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal SternOSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal Stern
 
Analytics and reporting context linkedin final
Analytics and reporting context linkedin finalAnalytics and reporting context linkedin final
Analytics and reporting context linkedin final
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
iRODS
iRODSiRODS
iRODS
 
The Information Workbench as a Self-Service Platform for Linked Data Applicat...
The Information Workbench as a Self-Service Platform for Linked Data Applicat...The Information Workbench as a Self-Service Platform for Linked Data Applicat...
The Information Workbench as a Self-Service Platform for Linked Data Applicat...
 
Statistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data UsageStatistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data Usage
 
DT Company Overview January 2013
DT Company Overview January 2013DT Company Overview January 2013
DT Company Overview January 2013
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 

Plus de DataTactics

NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATA
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATANETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATA
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATADataTactics
 
C Star Analytic Presentation
C Star Analytic PresentationC Star Analytic Presentation
C Star Analytic PresentationDataTactics
 
Text Analysis Using Twitter: A Case Study in Dhaka
Text Analysis Using Twitter: A Case Study in Dhaka Text Analysis Using Twitter: A Case Study in Dhaka
Text Analysis Using Twitter: A Case Study in Dhaka DataTactics
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
Data Tactics Analytics Practice
Data Tactics Analytics PracticeData Tactics Analytics Practice
Data Tactics Analytics PracticeDataTactics
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data ConferenceDataTactics
 
Discontinuities Demo
Discontinuities DemoDiscontinuities Demo
Discontinuities DemoDataTactics
 
Analytics Brownbag
Analytics Brownbag Analytics Brownbag
Analytics Brownbag DataTactics
 
Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013DataTactics
 
Ontology and Reports
Ontology and ReportsOntology and Reports
Ontology and ReportsDataTactics
 
Data Tactics Unified Dataspace Architecture and Description
Data Tactics Unified Dataspace Architecture and DescriptionData Tactics Unified Dataspace Architecture and Description
Data Tactics Unified Dataspace Architecture and DescriptionDataTactics
 
Data Tactics Semantic and Interoperability Summit Feb 12, 2013
Data Tactics Semantic and Interoperability Summit Feb 12, 2013Data Tactics Semantic and Interoperability Summit Feb 12, 2013
Data Tactics Semantic and Interoperability Summit Feb 12, 2013DataTactics
 
Horizontal Integration of Big Intelligence Data
Horizontal Integration of Big Intelligence DataHorizontal Integration of Big Intelligence Data
Horizontal Integration of Big Intelligence DataDataTactics
 
Bill Ontology Summit (08 feb 1400hrs) v2
Bill Ontology Summit (08 feb 1400hrs) v2Bill Ontology Summit (08 feb 1400hrs) v2
Bill Ontology Summit (08 feb 1400hrs) v2DataTactics
 
Data Tactics dhs introduction to cloud technologies wtc
Data Tactics dhs introduction to cloud technologies wtcData Tactics dhs introduction to cloud technologies wtc
Data Tactics dhs introduction to cloud technologies wtcDataTactics
 
Multi Discipline Intelligence Production Teams 1
Multi Discipline Intelligence Production Teams 1Multi Discipline Intelligence Production Teams 1
Multi Discipline Intelligence Production Teams 1DataTactics
 
Data Tactics and Nervve Integrated Big Data v3
Data Tactics and Nervve Integrated Big Data v3Data Tactics and Nervve Integrated Big Data v3
Data Tactics and Nervve Integrated Big Data v3DataTactics
 
Data Tactics Open Source Brief
Data Tactics Open Source BriefData Tactics Open Source Brief
Data Tactics Open Source BriefDataTactics
 

Plus de DataTactics (19)

NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATA
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATANETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATA
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATA
 
C Star Analytic Presentation
C Star Analytic PresentationC Star Analytic Presentation
C Star Analytic Presentation
 
Text Analysis Using Twitter: A Case Study in Dhaka
Text Analysis Using Twitter: A Case Study in Dhaka Text Analysis Using Twitter: A Case Study in Dhaka
Text Analysis Using Twitter: A Case Study in Dhaka
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Data Tactics Analytics Practice
Data Tactics Analytics PracticeData Tactics Analytics Practice
Data Tactics Analytics Practice
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data Conference
 
Discontinuities Demo
Discontinuities DemoDiscontinuities Demo
Discontinuities Demo
 
DLISA
DLISADLISA
DLISA
 
Analytics Brownbag
Analytics Brownbag Analytics Brownbag
Analytics Brownbag
 
Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013
 
Ontology and Reports
Ontology and ReportsOntology and Reports
Ontology and Reports
 
Data Tactics Unified Dataspace Architecture and Description
Data Tactics Unified Dataspace Architecture and DescriptionData Tactics Unified Dataspace Architecture and Description
Data Tactics Unified Dataspace Architecture and Description
 
Data Tactics Semantic and Interoperability Summit Feb 12, 2013
Data Tactics Semantic and Interoperability Summit Feb 12, 2013Data Tactics Semantic and Interoperability Summit Feb 12, 2013
Data Tactics Semantic and Interoperability Summit Feb 12, 2013
 
Horizontal Integration of Big Intelligence Data
Horizontal Integration of Big Intelligence DataHorizontal Integration of Big Intelligence Data
Horizontal Integration of Big Intelligence Data
 
Bill Ontology Summit (08 feb 1400hrs) v2
Bill Ontology Summit (08 feb 1400hrs) v2Bill Ontology Summit (08 feb 1400hrs) v2
Bill Ontology Summit (08 feb 1400hrs) v2
 
Data Tactics dhs introduction to cloud technologies wtc
Data Tactics dhs introduction to cloud technologies wtcData Tactics dhs introduction to cloud technologies wtc
Data Tactics dhs introduction to cloud technologies wtc
 
Multi Discipline Intelligence Production Teams 1
Multi Discipline Intelligence Production Teams 1Multi Discipline Intelligence Production Teams 1
Multi Discipline Intelligence Production Teams 1
 
Data Tactics and Nervve Integrated Big Data v3
Data Tactics and Nervve Integrated Big Data v3Data Tactics and Nervve Integrated Big Data v3
Data Tactics and Nervve Integrated Big Data v3
 
Data Tactics Open Source Brief
Data Tactics Open Source BriefData Tactics Open Source Brief
Data Tactics Open Source Brief
 

Capabilities Brief Analytics

  • 1. DT Core Analytical Competencies Data Engineering ⁻ Data Architecture Design and Development ⁻ Large Scale Enterprise Architecture and Design ⁻ Migrate, Extract, Transform, and Load Data ⁻ Spatial, Multi-Domain, and Cloud Base Data Services Analytics – Quantitative ⁻ Data Transformation and Ingestion ⁻ Dissemination and Reporting Tools ⁻ Data Mining, Exploitation, and Correlation Tools ⁻ Spatial Data Mining and Geographic Knowledge Discovery Data Tactics Corporation Proprietary and Confidential Material
  • 2. DT Core Analytical Competencies The Team: Graduates of top tier universities to include Stanford, Caltech and MIT as well as ties to these and local universities. Degrees include Mathematics, Computer Science, Aeronautical Engineering, Astrophysics, Electrical Engineering, Mechanical Engineering, Statistics and Social Sciences. Competencies include data mining, machine learning, statistics, spatial statistics, Bayesian statistics, econometrics, computational geometry, spatial econometrics, applied mathematics, theoretical robotics, dynamic systems, control theory. Foci include unsupervised cross-modal clustering algorithms, principle component analysis, independent component analysis, regression, spatial regression, geographic weighted regression, zeroth order processing, nonlinear optimization, autoregressive models, time-series analysis, spatial regime models, HAC models. Technical Competencies include Data Tactics Corporation Proprietary and Confidential Material
  • 3. Data Tactics Analytics Cell Data Tactics Corporation Proprietary and Confidential Material
  • 4. Analytics Competencies ZeroFill 40 • Time Series Analytics (i) (i) 0 • Applying the ARIMA model in a 02-13 Index parallelized environment to provide anomaly detection • Correlation Analytics (ii) • Brute force pairwise Pearson‟s correlation over vectors in a cloud-backed engine • Aggregation Analytics (iii) • Aggregate micro-pathing • Repurposing data to analyze (ii) and display movement patterns • Dwell time calculations • Analytic to discover areas of interest based on movement activity • Graph Analytics (iiii) • Discovering social interaction models and paradigms within (iii) network data (iiii) 4 Data Tactics Corporation Proprietary and Confidential Material
  • 5. Analytics Competencies • Directional Spatio-Temporal Analytics (i) (i) • Compare distributions with a focus on changes in morphology of the distribution and mobility of individual observations within the distribution over that same period of time over space (Wy) • Local Classification (ii) • Non-self-similarities & self-similarities; (i) within and between group correlations. • Ecological Analytics (ii) • Regression Modeling • Spatial Regression • Spatial Regime Models • HAC Models 5 Data Tactics Corporation Proprietary and Confidential Material
  • 6. Data Tactics Data Repository Data Tactics Corporation Proprietary and Confidential Material
  • 7. Quantitative Data Competencies • Proxy problems definition – Different problems lead to different questions, which lead to different data sets. Confer acceptability of data source by the definition of the proxy problems. • Key dimensions of variability – Key dimensions were targeted for collection such as time, space, identifier, etc. However, different proxy problems require different key dimensions. • Capturing scope – The following was explicitly captured: • Data structure (E.G. graph relationship data vs. graph transaction data vs. dimensional data) • Data timespan (if time is a dimension) • Data geospatial footprint (if geospatial is a dimension) • Data volume (both in total GB and also in total # of rows) • Determining dataset overlap • Capturing opinions - Current star ratings based on: • Data consistency, volume, and persistence • Data coverage (time and space) • Data precision (time and space) • Data “genuineness” (synthesized data is penalized) • Data distribution (IE: we may have extremely precise geo-spatial data, but if there are only 40 unique geospatial points in the data, the geo-spatial aspects aren‟t that interesting) • Data dimensionality (higher dimensionality with reasonable distributions on each dimension is preferred)
  • 8. Quantitative Data Holdings Name of the Data Date that statistics Source were last collected Initial reviewer on data Location of data Data Opinion of Data Source where on FTP site format Quality Collection start / data was Description and end dates – if acquired Size of Data notes on data source known (storage space as well as collection Geospatial and rows) Data handling information coverage requirements Data Tactics Corporation Proprietary and Confidential Material 10
  • 9. Quantitative Data Holdings Armed Conflict Location and Events Dataset (ACLED) KDD 2003 Data AIS Ship Data KDD 2005 Data Atmospherics Reports Kiva Data BrightKite Data Landscan Data Classified Ads LiveJournal Data CNN Meme Tracker Digital Terrain Elevation Data (DTED) Meme Twitter TS Enron Data NFL Plays Epinions Data Night Lights Data EU Email Open Data Airtraffic accidents Facebook Open Street Maps Flickr Data Panoramio Data Flight Information Data Patent Citations Data Four Square Data Photobucket Data Friend Feed Data Picasa Web Albums Data Geolife Data Processed Employment Data Gowalla Data Scamper Data International Conference on Weblogs and Social Media ISVG (ICWSM) Data Twitter Identica Data UNDP IMDB Data Weather Data Knowledge Discovery and Data (KDD) Mining Tools Webgraphs Competition Youtube Data Tactics Corporation Proprietary and Confidential Material
  • 10. Quantitative Data Competencies Panoramio / Flickr – Metadata on uploaded public photos provides excellent geospatial and temporal resolution, which also provides user information. Estimated 250 million rows of photo metadata with over 150 million already gathered. AIS – Ship tracking data that provides ship „pings‟ as they progress in movement. Precise time and geospatial information provided. 50 million records and counting. OpenStreetMaps – Over 2 billion geospatial points of mapping enthusiasts‟ tracks across the world. Time and userid information also included. Gowalla / Brightkite – About 11 million FourSquare style check-ins with user, location, and time information provided. Example Proxy Problems: • Discovering “Holes” in the data where photos are no longer taken to detect avoided areas • Discovering relationships and links based on co-occurrence between users in time / space • Tracking and analyzing movement patterns on a local and global scale • Analyzing image data for changes in the same locations • Detecting differences in photo activity in an area over time • Detecting events based on abnormal photo activity behavior • Mapping UserIds across data sources to create a unified analytic picture • Detecting home range for each user • Defining patterns of life by routine activities and movement • Tracking language usage in areas to determine abnormal language presence in an area • Local vs tourist movement analysis and extraction • Trending of location popularity UNCLASSIFIED 12
  • 11. Quantitative Data Competencies Twitter – Sampled ongoing collection of social media tweets with UserId and time. Some even have precise location data, but this is not the norm. Collection pulls roughly between 1-2 million tweets / day. Example Proxy Problems: • Discovery of crowd-sourced phenomena (e.g., people posting to beware of a certain neighborhood) • Discovery of correlated trends (e.g., finding that people posting about a certain topic in an area correlates to higher crime in that area) • Tracking sentiment on certain topics and issues • Tracking language usage in areas to determine abnormal language presence in an area UNCLASSIFIED 13
  • 12. Quantitative Data Competencies • How can we infer movement patterns from vast amounts of what appears to be just point data collected in time and associated with an identifier (IE: UserId / bank account / etc)? • Technique is applicable to Twitter, FourSquare and MANY other sources Volume plot of photos binned by area on log scale Paris as seen from Flickr over all time 14
  • 13. Quantitative Data Competencies 1. Goal: to catch active moment between locations a small distance apart 2. Typically two to around a dozen points chained together 3. Located in a small area, but with a definite path through the area 4. Sampled in rapid succession (less than X seconds between points) 5. Thousands or millions of micro-paths make a full path to view Segment ignored: Segment ignored: Velocity too fast Photo taken 120 seconds between points Photo taken Photo taken 2012-08-15 12:35:25 2012-08-15 12:34:59 2012-08-15 12:37:46 Photo taken 2012-08-15 12:37:35 Photo taken 2012-08-15 12:35:11 Person A Common Photo taken path 10 seconds 2012-08-15 12:37:25 Person B 3 seconds pattern A Micropath example forming Person C Overlay thousands / millions of these tiny micropaths together and you get… UNCLASSIFIED 15
  • 14. Quantitative Data Competencies View of Paris using a 60 second segment timeout and 80km/hour cutoff on Flickr data Arc de Triomphe Apparent typical approach pathway to the Arc Place de la Concorde Louvre Harder to see, but Place de la you can see the Eiffel Tower Concorde typically typical approach / approached from exit pathways from southern direction Notre Dame. Notre Dame Red strip appears to be line of sight to the Eiffel Tower UNCLASSIFIED 16
  • 15. Quantitative Data Competencies Aggregate micro-pathing on a world of photo metadata with no speed, time, or distance restrictions UNCLASSIFIED 17
  • 16. Quantitative Data Competencies AIS ship tracking micro-path blanket with no time / space filters Japan‟s south coast China‟s coast with high levels of activity Coast of Taiwan UNCLASSIFIED 18
  • 17. Quantitative Data Competencies Flickr Paris 2004 changes vs 2005 Hh: [HIGH, high]- an increase between Xt1 -> Xt2 relative to respective (Xt1, Xt2) reference distribution where t1, t2 belong to T. HIGH reflects a strong increase of ones own values (dxi) at location i between t1 and t2 relative to the change of neighboring values (dy). high reflects a modest increase of dy relative to values of dx. Neighbors are defined with the spatially lagged variable Wy, as the eight nearest observations. lL: low, LOW [low, LOW]- a decrease between Xt1 -> Xt2 relative to respective (Xt1, Xt2) reference distribution where t1, t2 belong to T. low reflects a modest decrease of ones own values (dxi) at location i between t1 and t2 relative to the change of neighboring values (dy). LOW reflects a strong decrease of neighboring values of dx. Neighbors are defined with the spatially lagged variable Wy, as the eight nearest observations. Flickr Paris 2011 changes vs 2010 UNCLASSIFIED 19
  • 18. Quantitative Data Competencies New Year provides lots of photos Paris Bastille Day Recurrent red strips show the recurring weekend Number of distinct photographers Day in year UNCLASSIFIED 20
  • 19. Quantitative Data Competencies 5 day Carnival celebration Caracas Some interesting dates for low volume activity Number of distinct photographers Day in year Image from www.flickr.com/photos/globovision/6911554143 UNCLASSIFIED 21
  • 20. Quantitative Data Competencies Airline Flight Data Anomaly Detection During an unusual event, such as a winter storm show below, the ARIMA still follows the pattern but doesn‟t match as well. These areas where the red and black don‟t match are where unusual events have occurred. ZeroFill 40 0 02-13 Index ZeroFill 40 0 02-13 Index Plot of the count of points where the difference between the expected number of flights leaving an airport based on the model and the actual observed number of flights was statistically significant. UNCLASSIFIED 22
  • 21. Quantitative Data Competencies Raw data file: Each line is a comma separated list of values. key1, timestamp, value Key1 2.4,3.4,0.99,… key2, timestamp, value Key2 3.4,4.3,1.0,0.6…. Cloud-backed ….. … transformation Vector file: Each line has a key and a comma separated list of values. Correlation analytic Implemented in: key1 Key2 Key3 Key4 • Python (RAM) Key1 - 0.93 0.43 0.001 • Hive Key2 - - -0.5 -0.03 • Mahout • Spark Key3 - - - .32 • Giraph Key4 - - - - • Cascalog For each vector calculate the correlation to each other vector. We use a Pearson correlation. UNCLASSIFIED 23
  • 22. Quantitative Data Competencies Training Test Approximation engine for the O(n²) correlation Engine Engine matrix problem Spark Technique based on Google Correlate Approximation provides orders of magnitude of speedup when compared to equivalent brute force methods. The technique works best for highly correlated items and uses a series of data projections, unsupervised learning, and vector quantization to provide dimensionality reduction for incoming complex vectors. UNCLASSIFIED 24