SlideShare une entreprise Scribd logo
1  sur  16
Understanding Big Data
Applications and Architectures
1st JTC 1 SGBD Meeting
SDSC San Diego March 19 2014
Geoffrey Fox
Judy Qiu
Shantenu Jha (Rutgers)
gcf@indiana.edu
http://www.infomall.org
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
51 Detailed Use Cases: Contributed July-September 2013
Covers goals, data features such as 3 V’s, software, hardware
• http://bigdatawg.nist.gov/usecases.php
• https://bigdatacoursespring2014.appspot.com/course (Section 5)
• Government Operation(4): National Archives and Records Administration, Census Bureau
• Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search,
Digital Materials, Cargo shipping (as in UPS)
• Defense(3): Sensors, Image surveillance, Situation Assessment
• Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis,
Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity
• Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd
Sourcing, Network Science, NIST benchmark datasets
• The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source
experiments
• Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron
Collider at CERN, Belle Accelerator II in Japan
• Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake,
Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate
simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry
(microbes to watersheds), AmeriFlux and FLUXNET gas sensors
• Energy(1): Smart grid 2
26 Features for each use case
Would like to capture “essence of
these use cases”
“small” kernels, mini-apps
Or Classify applications into patterns
Do it from HPC background not database view
point
i.e. focus on cases with detailed analytics
What are “mini-Applications”
• Use for benchmarks of computers and software (is my
parallel compiler any good?)
• In parallel computing, this is well established
– Linpack for measuring performance to rank machines in Top500
(changing?)
– NAS Parallel Benchmarks (originally a pencil and paper
specification to allow optimal implementations; then MPI library)
– Other specialized Benchmark sets keep changing and used to
guide procurements
• Last 2 NSF hardware solicitations had NO preset benchmarks –
perhaps as no agreement on key applications for clouds and
data intensive applications
– Berkeley dwarfs capture different structures that any approach
to parallel computing must address
– Templates used to capture parallel computing patterns
• I’ll let experts comment on database benchmarks like TPC
HPC Benchmark Classics
• Linpack or HPL: Parallel LU factorization for solution of
linear equations
• NPB version 1: Mainly classic HPC solver kernels
– MG: Multigrid
– CG: Conjugate Gradient
– FT: Fast Fourier Transform
– IS: Integer sort
– EP: Embarrassingly Parallel
– BT: Block Tridiagonal
– SP: Scalar Pentadiagonal
– LU: Lower-Upper symmetric Gauss Seidel
7 Original Berkeley Dwarfs (Colella)
1. Structured Grids (including locally structured
grids, e.g. Adaptive Mesh Refinement)
2. Unstructured Grids
3. Fast Fourier Transform
4. Dense Linear Algebra
5. Sparse Linear Algebra
6. Particles
7. Monte Carlo
8. Note “vaguer” than NPB
13 Berkeley Dwarfs
• Dense Linear Algebra
• Sparse Linear Algebra
• Spectral Methods
• N-Body Methods
• Structured Grids
• Unstructured Grids
• MapReduce
• Combinational Logic
• Graph Traversal
• Dynamic Programming
• Backtrack and Branch-and-Bound
• Graphical Models
• Finite State Machines
First 6 of these correspond to
Colella’s original.
Monte Carlo dropped
N-body methods are a subset of
Particle
Note a little inconsistent in that
MapReduce is a programming
model and spectral method is a
numerical method
Need multiple facets!
Distributed Computing MetaPatterns I
Jha, Cole, Katz, Parashar, Rana, Weissman
Distributed Computing MetaPatterns II
Jha, Cole, Katz, Parashar, Rana, Weissman
Distributed Computing MetaPatterns III
Jha, Cole, Katz, Parashar, Rana, Weissman
Problem Architecture Facet of Ogres (Meta or
MacroPattern)
i. Pleasingly Parallel – as in Blast, Protein docking, imagery
ii. Local Analytics or Machine Learning – ML or filtering
pleasingly parallel as in bio-imagery, radar images (really
just pleasingly parallel but sophisticated local analytics)
iii. Global Analytics or Machine Learning seen in LDA,
Clustering etc. with parallel ML over nodes of system
iv. SPMD (Single Program Multiple Data)
v. Bulk Synchronous Processing: well defined compute-
communication phases
vi. Fusion: Knowledge discovery often involves fusion of
multiple methods.
vii. Workflow (often used in fusion)
Core Analytics Facet of Ogres (microPattern)
i. Search/Query
ii. Local Machine Learning – pleasingly parallel
iii. Summarizing statistics
iv. Recommender Systems (Collaborative Filtering)
v. Outlier Detection (iORCA)
vi. Clustering (many methods),
vii. LDA (Latent Dirichlet Allocation) or variants like PLSI (Probabilistic
Latent Semantic Indexing),
viii. SVM and Linear Classifiers (Bayes, Random Forests),
ix. PageRank, (Find leading eigenvector of sparse matrix)
x. SVD (Singular Value Decomposition),
xi. Learning Neural Networks (Deep Learning),
xii. MDS (Multidimensional Scaling),
xiii. Graph Structure Algorithms (seen in search of RDF Triple stores),
xiv. Network Dynamics - Graph simulation Algorithms (epidemiology)
Matrix
Algebra
Global
Optimization
Analytics Features Facet of Ogres
• These core analytics/kernels can be classified by features
like
• (a) Flops per byte;
• (b) Communication Interconnect requirements;
• (c) Is application (graph) constant or dynamic
• (d) Is communication BSP or Asynchronous
• (e) Are algorithms Iterative or not?
• (f) Are data points in metric or non-metric spaces
Application Class Facet of Ogres
• (a) Search and query
• (b) Maximum Likelihood,
• (c) 2 minimizations,
• (d) Expectation Maximization (often Steepest descent)
• (e) Global Optimization (Variational Bayes)
• (f) Agents, as in epidemiology (swarm approaches)
• (g) GIS (Geographical Information Systems).
Data Source Facet of Ogres
• (i) SQL,
• (ii) NOSQL based,
• (iii) Other Enterprise data systems (10 examples from Bob Marcus)
• (iv) Set of Files (as managed in iRODS),
• (v) Internet of Things,
• (vi) Streaming and
• (vii) HPC simulations.
• Before data gets to compute system, there is often an initial data
gathering phase which is characterized by a block size and timing. Block
size varies from month (Remote Sensing, Seismic) to day (genomic) to
seconds or lower (Real time control, streaming)
• There are storage/compute system styles: Shared, Dedicated,
Permanent, Transient
• Other characteristics are need for permanent auxiliary/comparison
datasets and these could be interdisciplinary implying nontrivial data
movement/replication
Lessons / Insights
• Ogres classify Big Data applications by multiple
facets – each with several exemplars and features
– Guide to breadth and depth of Big Data
– Does your architecture/software support all the ogres?
• Add database exemplars
• In parallel computing, the simple analytic kernels
dominate mindshare even though agreed limited
• Section 5 of my class
https://bigdatacoursespring2014.appspot.com/preview
classifies 51 use cases with these facets

Contenu connexe

Tendances

High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
Saliya Ekanayake
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
Lewis Crawford
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
MLconf
 

Tendances (20)

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC Convergence
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel 
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
 
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 

Similaire à Classification of Big Data Use Cases by different Facets

Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
Nick Kypreos
 
On the-design-of-geographic-information-system-procedures
On the-design-of-geographic-information-system-proceduresOn the-design-of-geographic-information-system-procedures
On the-design-of-geographic-information-system-procedures
Armando Guevara
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
butest
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
Attila Barta
 
How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1
wang yaohui
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
Shubham Singh
 

Similaire à Classification of Big Data Use Cases by different Facets (20)

High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
 
On the-design-of-geographic-information-system-procedures
On the-design-of-geographic-information-system-proceduresOn the-design-of-geographic-information-system-procedures
On the-design-of-geographic-information-system-procedures
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1
 
Big Data and IOT
Big Data and IOTBig Data and IOT
Big Data and IOT
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
BigData
BigDataBigData
BigData
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 

Plus de Geoffrey Fox

Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Geoffrey Fox
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana University
Geoffrey Fox
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Geoffrey Fox
 

Plus de Geoffrey Fox (15)

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online Education
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana University
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC Technology
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and Education
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Remarks on MOOC's
Remarks on MOOC'sRemarks on MOOC's
Remarks on MOOC's
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a Service
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
 
CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Classification of Big Data Use Cases by different Facets

  • 1. Understanding Big Data Applications and Architectures 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Geoffrey Fox Judy Qiu Shantenu Jha (Rutgers) gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington
  • 2. 51 Detailed Use Cases: Contributed July-September 2013 Covers goals, data features such as 3 V’s, software, hardware • http://bigdatawg.nist.gov/usecases.php • https://bigdatacoursespring2014.appspot.com/course (Section 5) • Government Operation(4): National Archives and Records Administration, Census Bureau • Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) • Defense(3): Sensors, Image surveillance, Situation Assessment • Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity • Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets • The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source experiments • Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron Collider at CERN, Belle Accelerator II in Japan • Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors • Energy(1): Smart grid 2 26 Features for each use case
  • 3. Would like to capture “essence of these use cases” “small” kernels, mini-apps Or Classify applications into patterns Do it from HPC background not database view point i.e. focus on cases with detailed analytics
  • 4. What are “mini-Applications” • Use for benchmarks of computers and software (is my parallel compiler any good?) • In parallel computing, this is well established – Linpack for measuring performance to rank machines in Top500 (changing?) – NAS Parallel Benchmarks (originally a pencil and paper specification to allow optimal implementations; then MPI library) – Other specialized Benchmark sets keep changing and used to guide procurements • Last 2 NSF hardware solicitations had NO preset benchmarks – perhaps as no agreement on key applications for clouds and data intensive applications – Berkeley dwarfs capture different structures that any approach to parallel computing must address – Templates used to capture parallel computing patterns • I’ll let experts comment on database benchmarks like TPC
  • 5. HPC Benchmark Classics • Linpack or HPL: Parallel LU factorization for solution of linear equations • NPB version 1: Mainly classic HPC solver kernels – MG: Multigrid – CG: Conjugate Gradient – FT: Fast Fourier Transform – IS: Integer sort – EP: Embarrassingly Parallel – BT: Block Tridiagonal – SP: Scalar Pentadiagonal – LU: Lower-Upper symmetric Gauss Seidel
  • 6. 7 Original Berkeley Dwarfs (Colella) 1. Structured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement) 2. Unstructured Grids 3. Fast Fourier Transform 4. Dense Linear Algebra 5. Sparse Linear Algebra 6. Particles 7. Monte Carlo 8. Note “vaguer” than NPB
  • 7. 13 Berkeley Dwarfs • Dense Linear Algebra • Sparse Linear Algebra • Spectral Methods • N-Body Methods • Structured Grids • Unstructured Grids • MapReduce • Combinational Logic • Graph Traversal • Dynamic Programming • Backtrack and Branch-and-Bound • Graphical Models • Finite State Machines First 6 of these correspond to Colella’s original. Monte Carlo dropped N-body methods are a subset of Particle Note a little inconsistent in that MapReduce is a programming model and spectral method is a numerical method Need multiple facets!
  • 8. Distributed Computing MetaPatterns I Jha, Cole, Katz, Parashar, Rana, Weissman
  • 9. Distributed Computing MetaPatterns II Jha, Cole, Katz, Parashar, Rana, Weissman
  • 10. Distributed Computing MetaPatterns III Jha, Cole, Katz, Parashar, Rana, Weissman
  • 11. Problem Architecture Facet of Ogres (Meta or MacroPattern) i. Pleasingly Parallel – as in Blast, Protein docking, imagery ii. Local Analytics or Machine Learning – ML or filtering pleasingly parallel as in bio-imagery, radar images (really just pleasingly parallel but sophisticated local analytics) iii. Global Analytics or Machine Learning seen in LDA, Clustering etc. with parallel ML over nodes of system iv. SPMD (Single Program Multiple Data) v. Bulk Synchronous Processing: well defined compute- communication phases vi. Fusion: Knowledge discovery often involves fusion of multiple methods. vii. Workflow (often used in fusion)
  • 12. Core Analytics Facet of Ogres (microPattern) i. Search/Query ii. Local Machine Learning – pleasingly parallel iii. Summarizing statistics iv. Recommender Systems (Collaborative Filtering) v. Outlier Detection (iORCA) vi. Clustering (many methods), vii. LDA (Latent Dirichlet Allocation) or variants like PLSI (Probabilistic Latent Semantic Indexing), viii. SVM and Linear Classifiers (Bayes, Random Forests), ix. PageRank, (Find leading eigenvector of sparse matrix) x. SVD (Singular Value Decomposition), xi. Learning Neural Networks (Deep Learning), xii. MDS (Multidimensional Scaling), xiii. Graph Structure Algorithms (seen in search of RDF Triple stores), xiv. Network Dynamics - Graph simulation Algorithms (epidemiology) Matrix Algebra Global Optimization
  • 13. Analytics Features Facet of Ogres • These core analytics/kernels can be classified by features like • (a) Flops per byte; • (b) Communication Interconnect requirements; • (c) Is application (graph) constant or dynamic • (d) Is communication BSP or Asynchronous • (e) Are algorithms Iterative or not? • (f) Are data points in metric or non-metric spaces
  • 14. Application Class Facet of Ogres • (a) Search and query • (b) Maximum Likelihood, • (c) 2 minimizations, • (d) Expectation Maximization (often Steepest descent) • (e) Global Optimization (Variational Bayes) • (f) Agents, as in epidemiology (swarm approaches) • (g) GIS (Geographical Information Systems).
  • 15. Data Source Facet of Ogres • (i) SQL, • (ii) NOSQL based, • (iii) Other Enterprise data systems (10 examples from Bob Marcus) • (iv) Set of Files (as managed in iRODS), • (v) Internet of Things, • (vi) Streaming and • (vii) HPC simulations. • Before data gets to compute system, there is often an initial data gathering phase which is characterized by a block size and timing. Block size varies from month (Remote Sensing, Seismic) to day (genomic) to seconds or lower (Real time control, streaming) • There are storage/compute system styles: Shared, Dedicated, Permanent, Transient • Other characteristics are need for permanent auxiliary/comparison datasets and these could be interdisciplinary implying nontrivial data movement/replication
  • 16. Lessons / Insights • Ogres classify Big Data applications by multiple facets – each with several exemplars and features – Guide to breadth and depth of Big Data – Does your architecture/software support all the ogres? • Add database exemplars • In parallel computing, the simple analytic kernels dominate mindshare even though agreed limited • Section 5 of my class https://bigdatacoursespring2014.appspot.com/preview classifies 51 use cases with these facets

Notes de l'éditeur

  1. Big dwarfs are OgresImplement Ogres in ABDS+
  2. Big dwarfs are OgresImplement Ogres in ABDS+