SlideShare une entreprise Scribd logo
1  sur  81
1
Big Data Institute,
Seoul National University, Korea
Geoffrey Fox August 22, 2016
gcf@indiana.edu
http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/
Department of Intelligent Systems Engineering
School of Informatics and Computing, Digital Science Center
Indiana University Bloomington
High Performance Computing and Big Data
Abstract
• We propose a hybrid software stack with Large scale data systems for both
research and commercial applications running on the commodity (Apache)
Big Data Stack (ABDS) using High Performance Computing (HPC)
enhancements typically to improve performance. We give several examples
taken from bio and financial informatics.
• We look in detail at parallel and distributed run-times including MPI from
HPC and Apache Storm, Heron, Spark and Flink from ABDS stressing that
one needs to distinguish the different needs of parallel (tightly coupled) and
distributed (loosely coupled) systems.
• We also study "Java Grande" or the principles to use to allow Java codes to
perform as fast as those written in more traditional HPC languages. We also
note the differences between capacity (individual jobs using many nodes)
and capability (lots of independent jobs) computing.
• We discuss how this HPC-ABDS concept allows one to discuss
convergence of Big Data, Big Simulation, Cloud and HPC Systems.
See http://hpc-abds.org/kaleidoscope/
28/28/2016
Why Connect (“Converge”) Big Data and HPC
• Two major trends in computing systems are
– Growth in high performance computing (HPC) with an international exascale
initiative (China in the lead)
– Big data phenomenon with an accompanying cloud infrastructure of well
publicized dramatic and increasing size and sophistication.
• Note “Big Data” largely an industry initiative although software used is often open
source
– So HPC labels overlaps with “research” e.g. HPC community largely
responsible for Astronomy and Accelerator (LHC, Belle, BEPC ..) data analysis
• Merge HPC and Big Data to get
– More efficient sharing of large scale resources running simulations and data
analytics
– Higher performance Big Data algorithms
– Richer software environment for research community building on many big
data tools
– Easier sustainability model for HPC – HPC does not have resources to build
and maintain a full software stack
38/28/2016
Convergence Points (Nexus) for
HPC-Cloud-Big Data-Simulation
• Nexus 1: Applications – Divide use cases into Data and
Model and compare characteristics separately in these two
components with 64 Convergence Diamonds (features)
• Nexus 2: Software – High Performance Computing (HPC)
Enhanced Big Data Stack HPC-ABDS. 21 Layers adding high
performance runtime to Apache systems (Hadoop is fast!).
Establish principles to get good performance from Java or C
programming languages
• Nexus 3: Hardware – Use Infrastructure as a Service IaaS and
DevOps to automate deployment of software defined systems
on hardware designed for functionality and performance e.g.
appropriate disks, interconnect, memory
48/28/2016
SPIDAL Project
Datanet: CIF21 DIBBs: Middleware and
High Performance Analytics Libraries for
Scalable Data Science
• NSF14-43054 started October 1, 2014
• Indiana University (Fox, Qiu, Crandall, von Laszewski)
• Rutgers (Jha)
• Virginia Tech (Marathe)
• Kansas (Paden)
• Stony Brook (Wang)
• Arizona State (Beckstein)
• Utah (Cheatham)
• A co-design project: Software, algorithms, applications
8/28/2016 5
68/28/2016
Software: MIDAS
HPC-ABDS
Co-designing Building Blocks
Collaboratively
Main Components of SPIDAL Project
• Design and Build Scalable High Performance Data Analytics Library
• SPIDAL (Scalable Parallel Interoperable Data Analytics Library): Scalable
Analytics for:
– Domain specific data analytics libraries – mainly from project.
– Add Core Machine learning libraries – mainly from community.
– Performance of Java and MIDAS Inter- and Intra-node.
• NIST Big Data Application Analysis – features of data intensive Applications
deriving 64 Convergence Diamonds. Application Nexus.
• HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High
Performance Computing) and the rich functionality of the commodity Apache Big
Data Stack. Software Nexus
• MIDAS: Integrating Middleware – from project.
• Applications: Biomolecular Simulations, Network and Computational Social
Science, Epidemiology, Computer Vision, Geographical Information Systems,
Remote Sensing for Polar Science and Pathology Informatics, Streaming for
robotics, streaming stock analytics
• Implementations: HPC as well as clouds (OpenStack, Docker) Convergence with
common DevOps tool Hardware Nexus
78/28/2016
Application Nexus
Use-case Data and Model
NIST Collection
Big Data Ogres
Convergence Diamonds
8/28/2016 8
Data and Model in Big Data and Simulations I
• Need to discuss Data and Model as problems have both
intermingled, but we can get insight by separating which allows
better understanding of Big Data - Big Simulation
“convergence” (or differences!)
• The Model is a user construction and it has a “concept”,
parameters and gives results determined by the computation.
We use term “model” in a general fashion to cover all of these.
• Big Data problems can be broken up into Data and Model
– For clustering, the model parameters are cluster centers while the data
is set of points to be clustered
– For queries, the model is structure of database and results of this query
while the data is whole database queried and SQL query
– For deep learning with ImageNet, the model is chosen network with
model parameters as the network link weights. The data is set of images
used for training or classification
98/28/2016
Data and Model in Big Data and Simulations II
• Simulations can also be considered as Data plus Model
– Model can be formulation with particle dynamics or partial
differential equations defined by parameters such as particle
positions and discretized velocity, pressure, density values
– Data could be small when just boundary conditions
– Data large with data assimilation (weather forecasting) or
when data visualizations are produced by simulation
• Big Data implies Data is large but Model varies in size
– e.g. LDA with many topics or deep learning has a large
model
– Clustering or Dimension reduction can be quite small in
model size
• Data often static between iterations (unless streaming); Model
parameters vary between iterations
108/28/2016
11
8/28/2016http://hpc-abds.org/kaleidoscope/survey/
Online Use Case
Form
51 Detailed Use Cases: Contributed July-September 2013
Covers goals, data features such as 3 V’s, software, hardware
• Government Operation(4): National Archives and Records Administration, Census Bureau
• Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search,
Digital Materials, Cargo shipping (as in UPS)
• Defense(3): Sensors, Image surveillance, Situation Assessment
• Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis,
Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity
• Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd
Sourcing, Network Science, NIST benchmark datasets
• The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source
experiments
• Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron
Collider at CERN, Belle Accelerator II in Japan
• Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake,
Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation
datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to
watersheds), AmeriFlux and FLUXNET gas sensors
• Energy(1): Smart grid
• Published by NIST as http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-3.pdf
with common set of 26 features recorded for each use-case; “Version 2” being prepared
128/28/201626 Features for each use case Biased to science
Sample Features of 51 Use Cases I
• PP (26) “All” Pleasingly Parallel or Map Only
• MR (18) Classic MapReduce MR (add MRStat below for full count)
• MRStat (7) Simple version of MR where key computations are simple
reduction as found in statistical averages such as histograms and
averages
• MRIter (23) Iterative MapReduce or MPI (Flink, Spark, Twister)
• Graph (9) Complex graph data structure needed in analysis
• Fusion (11) Integrate diverse data to aid discovery/decision making;
could involve sophisticated algorithms or could just be a portal
• Streaming (41) Some data comes in incrementally and is processed
this way
• Classify (30) Classification: divide data into categories
• S/Q (12) Index, Search and Query
138/28/2016
Sample Features of 51 Use Cases II
• CF (4) Collaborative Filtering for recommender engines
• LML (36) Local Machine Learning (Independent for each parallel entity) –
application could have GML as well
• GML (23) Global Machine Learning: Deep Learning, Clustering, LDA, PLSI,
MDS,
– Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief
Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can
call EGO or Exascale Global Optimization with scalable parallel algorithm
• Workflow (51) Universal
• GIS (16) Geotagged data and often displayed in ESRI, Microsoft Virtual
Earth, Google Earth, GeoServer etc.
• HPC(5) Classic large-scale simulation of cosmos, materials, etc. generating
(visualization) data
• Agent (2) Simulations of models of data-defined macroscopic entities
represented as agents
148/28/2016
7 Computational Giants of
NRC Massive Data Analysis Report
1) G1: Basic Statistics e.g. MRStat
2) G2: Generalized N-Body Problems
3) G3: Graph-Theoretic Computations
4) G4: Linear Algebraic Computations
5) G5: Optimizations e.g. Linear Programming
6) G6: Integration e.g. LDA and other GML
7) G7: Alignment Problems e.g. BLAST
158/28/2016
http://www.nap.edu/catalog.php?record_id=18374 Big Data Models?
HPC (Simulation) Benchmark Classics
• Linpack or HPL: Parallel LU factorization
for solution of linear equations; HPCG
• NPB version 1: Mainly classic HPC solver kernels
– MG: Multigrid
– CG: Conjugate Gradient
– FT: Fast Fourier Transform
– IS: Integer sort
– EP: Embarrassingly Parallel
– BT: Block Tridiagonal
– SP: Scalar Pentadiagonal
– LU: Lower-Upper symmetric Gauss Seidel
168/28/2016
Simulation Models
13 Berkeley Dwarfs
1) Dense Linear Algebra
2) Sparse Linear Algebra
3) Spectral Methods
4) N-Body Methods
5) Structured Grids
6) Unstructured Grids
7) MapReduce
8) Combinational Logic
9) Graph Traversal
10) Dynamic Programming
11) Backtrack and
Branch-and-Bound
12) Graphical Models
13) Finite State Machines
178/28/2016
First 6 of these correspond to Colella’s
original. (Classic simulations)
Monte Carlo dropped.
N-body methods are a subset of
Particle in Colella.
Note a little inconsistent in that
MapReduce is a programming model
and spectral method is a numerical
method.
Need multiple facets to classify use
cases!
Largely Models for Data or Simulation
Classifying Use cases
188/28/2016
Classifying Use Cases
• The Big Data Ogres built on a collection of 51 big data uses gathered by
the NIST Public Working Group where 26 properties were gathered for each
application.
• This information was combined with other studies including the Berkeley
dwarfs, the NAS parallel benchmarks and the Computational Giants of
the NRC Massive Data Analysis Report.
• The Ogre analysis led to a set of 50 features divided into four views that
could be used to categorize and distinguish between applications.
• The four views are Problem Architecture (Macro pattern); Execution
Features (Micro patterns); Data Source and Style; and finally the
Processing View or runtime features.
• We generalized this approach to integrate Big Data and Simulation
applications into a single classification looking separately at Data and
Model with the total facets growing to 64 in number, called convergence
diamonds, and split between the same 4 views.
• A mapping of facets into work of the SPIDAL project has been given.
198/28/2016
20
8/28/2016
Pleasingly Parallel
Classic MapReduce
Map-Collective
Map Point-to-Point
Shared Memory
Single Program Multiple Data
Bulk Synchronous Parallel
Fusion
Dataflow
Agents
Workflow
Geospatial Information System
HPC Simulations
Internet of Things
Metadata/Provenance
Shared / Dedicated / Transient / Permanent
Archived/Batched/Streaming
HDFS/Lustre/GPFS
Files/Objects
Enterprise Data Model
SQL/NoSQL/NewSQL
PerformanceMetrics
FlopsperByte;MemoryI/O
ExecutionEnvironment;Corelibraries
Volume
Velocity
Variety
Veracity
CommunicationStructure
DataAbstraction
Metric=M/Non-Metric=N
=NN/=N
Regular=R/Irregular=I
Dynamic=D/Static=S
Visualization
GraphAlgorithms
LinearAlgebraKernels
Alignment
Streaming
OptimizationMethodology
Learning
Classification
Search/Query/Index
BaseStatistics
GlobalAnalytics
LocalAnalytics
Micro-benchmarks
Recommendations
Data Source and Style View
Execution View
Processing View
2
3
4
6
7
8
9
10
11
12
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 12 14
9 8 7 5 4 3 2 114 13 12 11 10 6
13
Map Streaming 5
4 Ogre
Views and
50 Facets
Iterative/Simple
11
1
Problem
Architecture
View
64 Features in 4 views for Unified Classification of Big Data
and Simulation Applications
21
Local(Analytics/Informatics/Simulations)
2
M
Data Source and Style View
Pleasingly Parallel
Classic MapReduce
Map-Collective
Map Point-to-Point
Shared Memory
Single Program Multiple Data
Bulk Synchronous Parallel
Fusion
Dataflow
Agents
Workflow
Geospatial Information System
HPC Simulations
Internet of Things
Metadata/Provenance
Shared / Dedicated / Transient / Permanent
Archived/Batched/Streaming – S1, S2, S3, S4, S5
HDFS/Lustre/GPFS
Files/Objects
Enterprise Data Model
SQL/NoSQL/NewSQL
1
M
Micro-benchmarks
Execution View
Processing View
1
2
3
4
6
7
8
9
10
11M
12
10D
9
8D
7D
6D
5D
4D
3D
2D
1D
Map Streaming 5
Convergence
Diamonds
Views and
Facets
Problem Architecture View
15
M
CoreLibraries
Visualization
14
M
GraphAlgorithms13
M LinearAlgebraKernels/Manysubclasses
12
M
Global(Analytics/Informatics/Simulations)
3
M
RecommenderEngine
5
M
4
M
BaseDataStatistics
10
M
StreamingDataAlgorithms
OptimizationMethodology
9
M
Learning
8
M
DataClassification
7
M
DataSearch/Query/Index
6
M
11
M
DataAlignment
Big Data Processing
Diamonds
MultiscaleMethod
17
M
16
M
IterativePDESolvers
22
M
Natureofmeshifused
EvolutionofDiscreteSystems
21
M
ParticlesandFields
20
M
N-bodyMethods
19
M
SpectralMethods
18
M
Simulation (Exascale)
Processing Diamonds
DataAbstraction
D
12
ModelAbstraction
M
12
DataMetric=M/Non-Metric=N
D
13
DataMetric=M/Non-Metric=N
M
13
=NN/=N
M
14
Regular=R/Irregular=IModel
M
10
Veracity
7
Iterative/Simple
M
11
CommunicationStructure
M
8
Dynamic=D/Static=S
D
9
Dynamic=D/Static=S
M
9
Regular=R/Irregular=IData
D
10
ModelVariety
M
6
DataVelocity
D
5
PerformanceMetrics
1
DataVariety
D
6
FlopsperByte/MemoryIO/Flopsperwatt
2
ExecutionEnvironment;Corelibraries
3
DataVolume
D
4
ModelSize
M
4
Simulations Analytics
(Model for Big Data)
Both
(All Model)
(Nearly all Data+Model)
(Nearly all Data)
(Mix of Data and Model)
8/28/2016
Examples in Problem Architecture View PA
• The facets in the Problem architecture view include 5 very common ones
describing synchronization structure of a parallel job:
– MapOnly or Pleasingly Parallel (PA1): the processing of a collection of
independent events;
– MapReduce (PA2): independent calculations (maps) followed by a final
consolidation via MapReduce;
– MapCollective (PA3): parallel machine learning dominated by scatter,
gather, reduce and broadcast;
– MapPoint-to-Point (PA4): simulations or graph processing with many
local linkages in points (nodes) of studied system.
– MapStreaming (PA5): The fifth important problem architecture is seen in
recent approaches to processing real-time data.
– We do not focus on pure shared memory architectures PA6 but look at
hybrid architectures with clusters of multicore nodes and find important
performances issues dependent on the node programming model.
• Most of our codes are SPMD (PA-7) and BSP (PA-8).
228/28/2016
6 Forms of
MapReduce
Describes
Architecture of
- Problem (Model
reflecting data)
- Machine
- Software
2 important
variants (software)
of Iterative
MapReduce and
Map-Streaming
a) “In-place” HPC
b) Flow for model
and data
238/28/2016
4) Map- Point to
Point Communication
3) Iterative MapReduce
or Map-Collective-
Input
map
reduce
Iterations
Local
Graph
5) Map-Streaming
maps brokers
Events
6) Shared-Memory
Map Communication
Shared Memory
Map & Communication
1) Map-Only
Pleasingly Parallel
3) Iterative MapReduce
or Map-Collective-
2) Classic
MapReduce
Input
map
reduce
Input
map
reduce
Iterations
Input
Output
map
Examples in Execution View EV
• The Execution view is a mix of facets describing either data or model; PA
was largely the overall Data+Model
• EV-M14 is Complexity of model (O(N2) for N points) seen in the non-
metric space models EV-M13 such as one gets with DNA sequences.
• EV-M11 describes iterative structure distinguishing Spark, Flink, and Harp
from the original Hadoop.
• The facet EV-M8 describes the communication structure which is a focus
of our research as much data analytics relies on collective communication
which is in principle understood but we find that significant new work is
needed compared to basic HPC releases which tend to address point to
point communication.
• The model size EV-M4 and data volume EV-D4 are important in describing
the algorithm performance as just like in simulation problems, the grain size
(the number of model parameters held in the unit – thread or process – of
parallel computing) is a critical measure of performance.
248/28/2016
Examples in Data View DV
• We can highlight DV-5 streaming where there is a lot of recent
progress;
• DV-9 categorizes our Biomolecular simulation application with
data produced by an HPC simulation
• DV-10 is Geospatial Information Systems covered by our
spatial algorithms.
• DV-7 provenance, is an example of an important feature that
we are not covering.
• The data storage and access DV-3 and D-4 is covered in our
pilot data work.
• The Internet of Things DV-8 is not a focus of our project
although our recent streaming work relates to this and our
addition of HPC to Apache Heron and Storm is an example of
the value of HPC-ABDS to IoT.
• 258/28/2016
Examples in Processing View PV
• The Processing view PV characterizes algorithms and is only Model (no
Data features) but covers both Big data and Simulation use cases.
• Graph PV-M13 and Visualization PV-M14 covered in SPIDAL.
• PV-M15 directly describes SPIDAL which is a library of core and other
analytics.
• This project covers many aspects of PV-M4 to PV-M11 as these
characterize the SPIDAL algorithms (such as optimization, learning,
classification).
– We are of course NOT addressing PV-M16 to PV-M22 which are
simulation algorithm characteristics and not applicable to data analytics.
• Our work largely addresses Global Machine Learning PV-M3 although
some of our image analytics are local machine learning PV-M2 with
parallelism over images and not over the analytics.
• Many of our SPIDAL algorithms have linear algebra PV-M12 at their core;
one nice example is multi-dimensional scaling MDS which is based on
matrix-matrix multiplication and conjugate gradient.
•
268/28/2016
Comparison of Data Analytics with Simulation I
• Simulations (models) produce big data as visualization of results – they
are data source
– Or consume often smallish data to define a simulation problem
– HPC simulation in (weather) data assimilation is data + model
• Pleasingly parallel often important in both
• Both are often SPMD and BSP
• Non-iterative MapReduce is major big data paradigm
– not a common simulation paradigm except where “Reduce” summarizes
pleasingly parallel execution as in some Monte Carlos
• Big Data often has large collective communication
– Classic simulation has a lot of smallish point-to-point messages
– Motivates MapCollective model
• Simulations characterized often by difference or differential operators
leading to nearest neighbor sparsity
• Some important data analytics can be sparse as in PageRank and “Bag of words”
algorithms but many involve full matrix algorithm
8/28/2016
27
Comparison of Data Analytics with Simulation II
• There are similarities between some graph problems and particle
simulations with a particular cutoff force.
– Both are MapPoint-to-Point problem architecture
• Note many big data problems are “long range force” (as in
gravitational simulations) as all points are linked.
– Easiest to parallelize. Often full matrix algorithms
– e.g. in DNA sequence studies, distance (i, j) defined by BLAST,
Smith-Waterman, etc., between all sequences i, j.
– Opportunity for “fast multipole” ideas in big data. See NRC report
• Current Ogres/Diamonds do not have facets to designate underlying
hardware: GPU v. Many-core (Xeon Phi) v. Multi-core as these
define how maps processed; they keep map-X structure fixed; maybe
should change as ability to exploit vector or SIMD parallelism could
be a model facet.
8/28/2016
28
Comparison of Data Analytics with Simulation III
• In image-based deep learning, neural network weights are block sparse
(corresponding to links to pixel blocks) but can be formulated as full
matrix operations on GPUs and MPI in blocks.
• In HPC benchmarking, Linpack being challenged by a new sparse
conjugate gradient benchmark HPCG, while I am diligently using non-
sparse conjugate gradient solvers in clustering and Multi-dimensional
scaling.
• Simulations tend to need high precision and very accurate results –
partly because of differential operators
• Big Data problems often don’t need high accuracy as seen in trend to
low precision (16 or 32 bit) deep learning networks
– There are no derivatives and the data has inevitable errors
• Note parallel machine learning (GML not LML) can benefit from HPC
style interconnects and architectures as seen in GPU-based deep
learning
– So commodity clouds not necessarily best
8/28/2016
29
8/28/2016 30
Some use of Analytics
Clustering
• The SPIDAL Library includes several clustering algorithms with
sophisticated features
– Deterministic Annealing
– Radius cutoff in cluster membership
– Elkans algorithm using triangle inequality
• They also cover two important cases
– Points are vectors – algorithm O(N) for N points
– Points not vectors – all we know is distance (i, j) between each pair of
points i and j. algorithm O(N2) for N points
• We find visualization important to judge quality of clustering
• As data typically not 2D or 3D, we use dimension reduction to project data
so we can then view it
• Have a browser viewer WebPlotViz that replaces an older Windows system
318/28/2016
2D Vector Clustering with cutoff at 3
σ
328/28/2016
LCMS Mass Spectrometer Peak Clustering. Charge 2 Sample with 10.9 million points
and 420,000 clusters visualized in WebPlotViz
Orange Star – outside all clusters; yellow circle cluster centers
Dimension Reduction
• Principal Component Analysis (linear mapping) and Multidimensional
Scaling MDS (nonlinear and applicable to non-Euclidean spaces) are
methods to map abstract spaces to three dimensions for visualization
• Both run well in parallel and give great results
• Semimetric spaces have pairwise distances defined between points in
space (i, j)
• But data is typically in a high dimensional or non vector space so use
dimension reduction. Associate each point i with a vector Xi in a Euclidean
space of dimension K so that (i, j)  d(Xi , Xj) where d(Xi , Xj) is Euclidean
distance between mapped points i and j in K dimensional space.
• K = 3 natural for visualization but other values interesting
• Principal Component analysis is best known dimension reduction approach
but a) linear b) requires original points in a vector space
• There are many other nonlinear vector space methods such as GTM
Generative Topographic Mapping
33
WDA-SMACOF “Best” MDS
• MDS Minimizes Stress (X) with pairwise distances (i, j)
(X) = i<j=1
N weight(i,j) ((i, j) - d(Xi , Xj))2
• SMACOF clever Expectation Maximization method choses good steepest
descent
• Improved by Deterministic Annealing gradually reducing Temperature 
distance scale; DA does not impact compute time much and gives DA-
SMACOF
– Deterministic Annealing like Simulated Annealing but no Monte Carlo
• Classic SMACOF is O(N2) for uniform weight and O(N3) for non trivial weights
but get nonuniform weight from
– The preferred Sammon method weight(i,j) = 1/(i, j) or
– Missing distances put in as weight(i,j) = 0
• Use conjugate gradient – converges in 5-100 iterations – a big gain for
matrix with a million rows. This removes factor of N in time complexity and
gives WDA-SMACOF
348/28/2016
358/28/2016
446K sequences
~100 clusters
Note distorted shapes
probably due to imperfect
distance measures e.g.
position correlated with
sequence length
Fungi -- 4 Classic Clustering Methods plus Species
Coloring
368/28/2016
Heatmap of original distance vs 3D Euclidean
Distances for Sequences and Stocks
• One can visualize quality of dimension by comparing as a scatterplot
or heatmap, the distances (i, j) before and after mapping to 3D.
• Perfection is a diagonal straight line and results seem good in general
378/28/2016
Proteomics Example
Stock Market Example
388/28/2016
Labelled by Species MDS: Same Species Two
groupings
RAxML result
visualized in
FigTree.
MSA or SWG
distances
39
Spherical Phylograms
8/28/2016
MSA
SWG
Quality of 3D Phylogenetic Tree
• 3 different MDS implementations and 3 different distance
measures
• EM-SMACOF is basic SMACOF for MDS
• LMA was previous best method using Levenberg-Marquardt
nonlinear 2 solver
• WDA-SMACOF finds best result
408/28/2016
Sum of branch lengths of the Spherical Phylogram
generated in 3D space on two datasets
0
5
10
15
20
25
30
MSA SWG NW
SumofBranches
Sum of Branches on 599nts Data
WDA-SMACOF LMA EM-SMACOF
0
5
10
15
20
25
MSA SWG NW
SumofBranches
Sum of Branches on 999nts Data
WDA-SMACOF LMA EM-SMACOF
HTML5 web viewer WebPlotViz
• Supports visualization of 3D point sets (typically derived by mapping from
abstract spaces) for streaming and non-streaming case
– Simple data management layer
– 3D web visualizer with various capabilities such as defining color
schemes, point sizes, glyphs, labels
• Core Technologies
– MongoDB management
– Play Server side framework
– Three.js
– WebGL
– JSON data objects
– Bootstrap Javascript web pages
• Open Source
http://spidal-gw.dsc.soic.indiana.edu/
• ~10,000 lines of extra code
418/28/2016
Front end
view
(Browser)
Plot visualization & time series
animation (Three.js)
Web Request Controllers
(Play Framework)
Upload
Data Layer
(MongoDB)
Request Plots
JSON Format
Plots
Upload format
to JSON
Converter
Server
MongoDB
Stock Daily Data Streaming Example
• Example is collection of around 7000 distinct stocks with daily values
available at ~2750 distinct times
– Clustering as provided by Wall Street – Dow Jones set of 30
stocks, S&P 500, various ETF’s etc.
• The Center for Research in Security Prices (CSRP) database through
the Wharton Research Data Services (wrds) web interface
• Available for free to the Indiana University students for research
• 2004 Jan 01 to 2015 Dec 31 have daily Stock prices in the form of a
CSV file
• We use the information
– ID, Date, Symbol, Factor to Adjust Volume, Factor to Adjust Price,
Price, Outstanding Stocks
8/28/2016
42
Relative Changes in Stock Values using one day values
measured from January 2004 and starting after one year January 2005
Filled circles are final values
438/28/2016
Apple
Mid Cap
Energy
S&P
Dow Jones
Finance
Origin
0% change
+10%
+20%
Relative Changes in Stock Values using one day values
Expansion of previous data
448/28/2016
Mid Cap
Energy
S&P
Dow Jones
Finance
8/28/2016 44
Mid Cap
Energy
S&P
Dow Jones
Finance
Origin
0% change
+10%
Algorithm Challenge
• The NRC Massive Data Analysis report stresses importance of finding O(N)
or O(NlogN) algorithms for O(N2) problems
– N is number of points
• This is well understood for O(N2) simulation problems where there is a long
range force as in gravitational (cosmology) simulations for N stars or
galaxies
• Simulations are governed by equations that allow a systematic ”multipole
expansion” with O(N) as first term with corrections
– Has been used successfully in parallel for 25 years
• O(N2) big data problems don’t have a systematic practical approach even
though there is a qualitative argument shown in next slide.
• The work wi,j is labelled by two indices i and j each running from 1 to N.
• If points i and j are near each other, need to perform accurate calculations
• If far apart, can use approximations and for example, replace points in a far
away cluster of M particles by their cluster center weighted by M
45
46
O(N2) interactions between
green and purple clusters
should be able to
represent by centroids as
in Barnes-Hut.
Hard as no Gauss
theorem; no multipole
expansion and points
really in 1000 dimension
space as clustered before
3D projection
O(N2) green-green and
purple-purple interactions
have value but green-purple
are “wasted”
“clean” sample of 446K
O(N2) reduced to
O(N) times cluster
size
Software Nexus
Application Layer On
Big Data Software Components for
Programming and Data Processing On
HPC for runtime On
IaaS and DevOps Hardware and Systems
• HPC-ABDS
• MIDAS
• Java Grande
8/28/2016 47
488/28/2016
HPC-ABDS
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies
Cross-
Cutting
Functions
1) Message
and Data
Protocols:
Avro, Thrift,
Protobuf
2) Distributed
Coordination
: Google
Chubby,
Zookeeper,
Giraffe,
JGroups
3) Security &
Privacy:
InCommon,
Eduroam
OpenStack
Keystone,
LDAP, Sentry,
Sqrrl, OpenID,
SAML OAuth
4)
Monitoring:
Ambari,
Ganglia,
Nagios, Inca
17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad,
Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA),
Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML
16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA,
Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j,
H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables,
CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK
15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud
Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT,
Agave, Atmosphere
15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq,
Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird
14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook
Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine
14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco,
Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem
13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty,
ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon
SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs
12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB,
H-Store
12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC
12) Extraction Tools: UIMA, Tika
11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal
Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL
11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB,
Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J,
graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame
Public Cloud: Azure Table, Amazon Dynamo, Google DataStore
11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet
10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST
9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm,
Torque, Globus Tools, Pilot Jobs
8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS
Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage
7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis
6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat,
Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes,
Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api
5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula,
Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds
Networking: Google Cloud DNS, Amazon Route 53
21 layers
Over 350
Software
Packages
January
29
2016
Functionality of 21 HPC-ABDS Layers
1) Message Protocols:
2) Distributed Coordination:
3) Security & Privacy:
4) Monitoring:
5) IaaS Management from HPC to
hypervisors:
6) DevOps:
7) Interoperability:
8) File systems:
9) Cluster Resource
Management:
10) Data Transport:
11) A) File management
B) NoSQL
C) SQL
498/28/2016
12) In-memory databases & caches /
Object-relational mapping / Extraction
Tools
13) Inter process communication
Collectives, point-to-point, publish-
subscribe, MPI:
14) A) Basic Programming model and
runtime, SPMD, MapReduce:
B) Streaming:
15) A) High level Programming:
B) Frameworks
16) Application and Analytics:
17) Workflow-Orchestration:
Lesson of large number (350). This is a rich software
environment that HPC cannot “compete” with. Need to
use and not regenerate
Note level 13 Inter process communication added
HPC-ABDS SPIDAL Project Activities
• Level 17: Orchestration: Apache Beam (Google Cloud Dataflow) integrated
with Heron/Flink and Cloudmesh on HPC cluster
• Level 16: Applications: Datamining for molecular dynamics, Image processing
for remote sensing and pathology, graphs, streaming, bioinformatics, social
media, financial informatics, text mining
• Level 16: Algorithms: Generic and custom for applications SPIDAL
• Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop,
Spark, Flink. Improve Inter- and Intra-node performance; science data structures
• Level 13: Runtime Communication: Enhanced Storm and Hadoop (Spark,
Flink, Giraph) using HPC runtime technologies, Harp
• Level 12: In-memory Database: Redis + Spark used in Pilot-Data Memory
• Level 11: Data management: Hbase and MongoDB integrated via use of Beam
and other Apache tools; enhance Hbase
• Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos, Spark,
Hadoop; integrate Storm and Heron with Slurm
• Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability
50
Green is MIDAS
Black is SPIDAL
8/28/2016
Java Grande
Revisited on 3 data analytics codes
Clustering
Multidimensional Scaling
Latent Dirichlet Allocation
all sophisticated algorithms
518/28/2016
Java MPI performs better than FJ Threads
128 24 core Haswell nodes on SPIDAL 200K DA-MDS Code
528/28/2016
Best FJ Threads intra
node; MPI inter node
Best MPI; inter
and intra node
MPI; inter/intra
node; Java not
optimized
Speedup compared to 1
process per node on 48 nodes
Investigating Process and Thread Models
538/28/2016
• FJ Fork Join Threads lower
performance than Long
Running Threads LRT
• Results
– Large effects for Java
– Best affinity is process
and thread binding to
cores - CE
– At best LRT mimics
performance of “all
processes”
• 6 Thread/Process Affinity
Models
Java and C K-Means LRT-FJ and LRT-BSP with different
affinity patterns over varying threads and processes.
8/28/2016
Java
C
106 points and 1000 centers on 16 nodes
106 points and 50k, and 500k centers
performance on 16 nodes
Java versus C Performance
• C and Java Comparable with Java doing better on larger problem
sizes
• All data from one million point dataset with varying number of centers
on 16 nodes 24 core Haswell
558/28/2016
Performance Dependence on Number
of Cores inside node (16 nodes total)
• Long-Running
Theads LRT Java
– All Processes
– All Threads
internal to node
– Hybrid – Use one
process per chip
• Fork Join Java
– All Threads
– Hybrid – Use one
process per chip
• Fork Join C
– All Threads
• All MPI internode
568/28/2016
HPC-ABDS
DataFlow and In-place Runtime
578/28/2016
HPC-ABDS Parallel Computing
• Both simulations and data analytics use similar parallel computing ideas
• Both do decomposition of both model and data
• Both tend use SPMD and often use BSP Bulk Synchronous Processing
• One has computing (called maps in big data terminology) and
communication/reduction (more generally collective) phases
• Big data thinks of problems as multiple linked queries even when queries
are small and uses dataflow model
• Simulation uses dataflow for multiple linked applications but small steps
such as iterations are done in place
• Reduction in HPC (MPIReduce) done as optimized tree or pipelined
communication between same processes that did computing
• Reduction in Hadoop or Flink done as separate map and reduce processes
using dataflow
– This leads to 2 forms (In-Place and Flow) of Map-X mentioned earlier
• Interesting Fault Tolerance issues highlighted by Hadoop-MPI comparisons
– not discussed here!
588/28/2016
Illustration of In-Place AllReduce in MPI
598/28/2016
Breaking Programs into Parts
608/28/2016
Coarse Grain
Dataflow
HPC or ABDS
Fine Grain Parallel Computing
Data/model parameter decomposition
Kmeans Clustering Flink and MPI
one million 2D points fixed; various # centers
24 cores on 16 nodes
618/28/2016
• MPI designed for fine grain case and typical of parallel computing
used in large scale simulations
– Only change in model parameters are transmitted
– In-place implementation
• Dataflow typical of distributed or Grid computing paradigms
– Data sometimes and model parameters certainly transmitted
– Caching in iterative MapReduce avoids data communication and
in fact systems like TensorFlow, Spark or Flink are called dataflow
but usually implement “model-parameter” flow
• We quantify this by an overhead analysis on next slide that works
for “in-place” runtimes. Flow implementations have additional
sources of overhead that we know are large but haven’t studied as
quantitatively
625/17/2016
HPC-ABDS Parallel Computing I
• Overheads are given by similar formulae for big data and
simulations
Overhead f = (1/Model parameter Size in each map)n x
(Typical Hardware communication cost/Typical computing
cost)
• Index n>0 depends on communication structure
– n=0.5 for matrix problems; n=1 for O(N2) problems
• Large f: Intra-job reduction such as Kmeans clustering
where one has center changes at end of each iteration and
• Small f: Inter-Job Reduction as at end of a query as seen in
workflow
• Increasing grain size = Model parameter Size in each map,
decreases overhead as n>0
635/17/2016
HPC-ABDS Parallel Computing II
• For a given application, need to understand:
– Are we using Data Flow or “Model-parameter” Flow
– Requirements of compute/communication ratio
• Inefficient to use same runtime mechanism independent of
characteristics
– Use In-Place or Flow Software implementations
• Classic Dataflow is approach of Spark and Flink so need to add
parallel in-place computing as done by Harp for Hadoop
– TensorFlow also uses In-Place technology
• HPC-ABDS plan is to keep current user interfaces (say to Spark
Flink Hadoop Storm Heron) and transparently use HPC to improve
performance exploiting added level 13 in HPC-ABDS
• We have done this to Hadoop (next Slide), Spark, Storm, Heron
– Working on further HPC integration with ABDS
645/17/2016
HPC-ABDS Parallel Computing III
Harp (Hadoop Plugin) brings HPC to ABDS
• Basic Harp: Iterative HPC communication; scientific data abstractions
• Careful support of distributed data AND distributed model
• Avoids parameter server approach but distributes model over worker nodes
and supports collective communication to bring global model to each node
• Applied first to Latent Dirichlet Allocation LDA with large model and data
658/28/2016
Shuffle
M M M M
Collective Communication
M M M M
R R
MapCollective ModelMapReduce Model
YARN
MapReduce V2
Harp
MapReduce
Applications
MapCollective
Applications
Clueweb
66
8/28/2016
enwiki
Bi-gram
Clueweb
Latent Dirichlet Allocation on 100 Haswell nodes: red is Harp (lgs and rtt)
67
Harp LDA on Big Red II
Supercomputer (Cray)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
0
5
10
15
20
25
30
0 50 100 150 ParallelEfficiency
ExecutionTime(hours)
Nodes
Execution Time (hours) Parallel Efficiency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
0
5
10
15
20
25
0 10 20 30 40
ParallelEfficiency
ExecutionTime(hours)
Nodes
Execution Time (hours) Parallel Efficiency
Harp LDA on Juliet (Intel Haswell)
• Big Red II: tested on 25, 50, 75, 100 and 125
nodes; each node uses 32 parallel threads;
Gemini interconnect
• Juliet: tested on 10, 15, 20, 25, 30 nodes;
each node uses 64 parallel threads on 36 core
Intel Haswell nodes (each with 2 chips);
Infiniband interconnect
Harp LDA Scaling Tests
Corpus: 3,775,554 Wikipedia documents,
Vocabulary: 1 million words;
Topics: 10k topics; alpha: 0.01; beta: 0.01;
iteration: 200
8/28/2016
Streaming Applications and
Technology
688/28/2016
Adding HPC to Storm & Heron for Streaming
Robot with a
Laser Range
Finder
Map Built from
Robot data
Robotics Applications
Robots need to
avoid collisions
when they move
N-Body Collision
Avoidance
Simultaneous Localization and Mapping
Time series data
visualization in real
time
Map High dimensional
data to 3D visualizer
Apply to Stock market
data tracking 6000
stocks
8/28/2016
69
Data Pipeline
Hosted on HPC and
OpenStack cloud
End to end delays
without any processing
is less than 10ms
Message Brokers
RabbitMQ, Kafka
Gateway
Sending
to
pub-sub
Sending
to
Persisting
storage
Streaming
workflow
A stream
application with
some tasks
running in parallel
Multiple
streaming
workflows
Streaming Workflows
Apache Heron and Storm
Storm does not support “real
parallel processing” within
bolts – add optimized inter-bolt
communication
8/28/2016
70
718/28/2016
Improvement of Storm (Heron) using HPC
communication algorithms
End of Software Discussion
728/28/2016
Workflow in HPC-ABDS
• HPC familiar with Taverna, Pegasus, Kepler, Galaxy etc. and
• ABDS has many workflow systems with recent Apache
systems being Crunch, NiFi and Beam (open source version
of Google Cloud Dataflow)
– Use ABDS for sustainability reasons?
– ABDS approaches are better integrated than HPC approaches with
ABDS data management like Hbase and are optimized for distributed
data.
• Heron, Spark and Flink provide distributed dataflow runtime
which is needed for workflow
• Beam uses Spark or Flink as runtime and supports streaming
and batch data
• Needs more study
738/28/2016
Automatic parallelization
• Database community looks at big data job as a dataflow of (SQL) queries
and filters
• Apache projects like Pig, MRQL and Flink aim at automatic query
optimization by dynamic integration of queries and filters including
iteration and different data analytics functions
• Going back to ~1993, High Performance Fortran HPF compilers optimized
set of array and loop operations for large scale parallel execution of
optimized vector and matrix operations
• HPF worked fine for initial simple regular applications but ran into trouble
for cases where parallelism hard (irregular, dynamic)
• Will same happen in Big Data world?
• Straightforward to parallelize k-means clustering but sophisticated
algorithms like Elkans method (use triangle inequality) and fuzzy
clustering are much harder (but not used much NOW)
• Will Big Data technology run into HPF-style trouble with growing use of
sophisticated data analytics?
748/28/2016
Infrastructure Nexus
IaaS
DevOps
Cloudmesh
8/28/2016 75
Constructing HPC-ABDS Exemplars
• This is one of next steps in NIST Big Data Working Group
• Jobs are defined hierarchically as a combination of Ansible (preferred over Chef or
Puppet as Python) scripts
• Scripts are invoked on Infrastructure (Cloudmesh Tool)
• INFO 524 “Big Data Open Source Software Projects” IU Data Science class
required final project to be defined in Ansible and decent grade required that script
worked (On NSF Chameleon and FutureSystems)
– 80 students gave 37 projects with ~15 pretty good such as
– “Machine Learning benchmarks on Hadoop with HiBench”, Hadoop/Yarn, Spark,
Mahout, Hbase
– “Human and Face Detection from Video”, Hadoop (Yarn), Spark, OpenCV,
Mahout, MLLib
• Build up curated collection of Ansible scripts defining use cases for
benchmarking, standards, education
https://docs.google.com/document/d/1INwwU4aUAD_bj-XpNzi2rz3qY8rBMPFRVlx95k0-xc4
• Fall 2015 class INFO 523 introductory data science class was less constrained;
students just had to run a data science application but catalog interesting
– 140 students: 45 Projects (NOT required) with 91 technologies, 39 datasets
768/28/2016
Cloudmesh Interoperability DevOps Tool
• Model: Define software configuration with tools like Ansible (Chef,
Puppet); instantiate on a virtual cluster
• Save scripts not virtual machines and let script build applications
• Cloudmesh is an easy-to-use command line program/shell and portal to
interface with heterogeneous infrastructures taking script as input
– It first defines virtual cluster and then instantiates script on it
– It has several common Ansible defined software built in
• Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud
supported clouds as well as classic HPC and Docker infrastructures
– Has an abstraction layer that makes it possible to integrate other IaaS
frameworks
• Managing VMs across different IaaS providers is easier
• Demonstrated interaction with various cloud providers:
– FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera,
AWS, Azure, virtualbox
• Status: AWS, and Azure, VirtualBox, Docker need improvements; we
focus currently on SDSC Comet and NSF resources that use OpenStack
77HPC Cloud Interoperability Layer8/28/2016
Cloudmesh Architecture
• We define a basic virtual cluster which is a set of instances with a common security context
• We then add basic tools including languages Python Java etc.
• Then add management tools such as Yarn, Mesos, Storm, Slurm etc …..
• Then add roles for different HPC-ABDS PaaS subsystems such as Hbase, Spark
– There will be dependencies e.g. Storm role uses Zookeeper
• Any one project picks some of HPC-ABDS PaaS Ansible roles and adds >=1 SaaS that are
specific to their project and for example read project data and perform project analytics
• E.g. there will be an OpenCV role used in Image processing applications
788/28/2016
Software
Engineering Process
Summary of Big Data - Big Simulation
Convergence?
HPC-Clouds convergence? (easier than converging higher levels in
stack)
Can HPC continue to do it alone?
Convergence Diamonds
HPC-ABDS Software on differently optimized hardware
infrastructure
8/28/2016 79
• Applications, Benchmarks and Libraries
– 51 NIST Big Data Use Cases, 7 Computational Giants of the NRC Massive Data Analysis,
13 Berkeley dwarfs, 7 NAS parallel benchmarks
– Unified discussion by separately discussing data & model for each application;
– 64 facets– Convergence Diamonds -- characterize applications
– Characterization identifies hardware and software features for each application across big
data, simulation; “complete” set of benchmarks (NIST)
• Software Architecture and its implementation
– HPC-ABDS: Cloud-HPC interoperable software: performance of HPC (High Performance
Computing) and the rich functionality of the Apache Big Data Stack.
– Added HPC to Hadoop, Storm, Heron, Spark; could add to Beam and Flink
– Could work in Apache model contributing code
• Run same HPC-ABDS across all platforms but “data management” nodes have different
balance in I/O, Network and Compute from “model” nodes
– Optimize to data and model functions as specified by convergence diamonds
– Do not optimize for simulation and big data
• Convergence Language: Make C++, Java, Scala, Python (R) … perform well
• Training: Students prefer to learn Big Data rather than HPC
• Sustainability: research/HPC communities cannot afford to develop everything (hardware and
software) from scratch
General Aspects of Big Data HPC Convergence
8/28/2016
80
Typical Convergence Architecture
• Running same HPC-ABDS software across all platforms but data
management machine has different balance in I/O, Network and Compute
from “model” machine
– Note data storage approach: HDFS v. Object Store v. Lustre style file
systems is still rather unclear
• The Model behaves similarly whether from Big Data or Big Simulation.
818/28/2016
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
Data Management Model for Big Data
and Big Simulation

Contenu connexe

Tendances

Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computingViet-Trung TRAN
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInOSCON Byrum
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
 
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsIgor José F. Freitas
 
From hadoop to spark
From hadoop to sparkFrom hadoop to spark
From hadoop to sparksteccami
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرdatastack
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 

Tendances (20)

Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big data
Big dataBig data
Big data
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computing
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedIn
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systems
 
From hadoop to spark
From hadoop to sparkFrom hadoop to spark
From hadoop to spark
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 

En vedette

Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC ConvergenceGeoffrey Fox
 
How to Profit from Factoring 2015
How to Profit from Factoring 2015How to Profit from Factoring 2015
How to Profit from Factoring 2015Michael Ponomarew
 
Fish Sticks by Stephen C Lundin, John Christensen and Harry Paul
Fish Sticks by Stephen C Lundin, John Christensen and Harry PaulFish Sticks by Stephen C Lundin, John Christensen and Harry Paul
Fish Sticks by Stephen C Lundin, John Christensen and Harry Paulvandananicky
 
Rate zonal centrifugation and Its applications
Rate zonal centrifugation and Its applicationsRate zonal centrifugation and Its applications
Rate zonal centrifugation and Its applicationsPaul singh
 
What is system level analysis
What is system level analysisWhat is system level analysis
What is system level analysisCAST
 
Top 10 team coordinator interview questions and answers
Top 10 team coordinator interview questions and answersTop 10 team coordinator interview questions and answers
Top 10 team coordinator interview questions and answersjanritari
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesDataWorks Summit
 
Financial aspects of marketing management
Financial aspects of marketing managementFinancial aspects of marketing management
Financial aspects of marketing managementBabasab Patil
 
Moving From a Selenium Grid to the Cloud - A Real Life Story
Moving From a Selenium Grid to the Cloud - A Real Life StoryMoving From a Selenium Grid to the Cloud - A Real Life Story
Moving From a Selenium Grid to the Cloud - A Real Life StorySauce Labs
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
 
IT Strategic Planning (Case Studies)
IT Strategic Planning (Case Studies)IT Strategic Planning (Case Studies)
IT Strategic Planning (Case Studies)Nurhazman Abdul Aziz
 
The purpose and Benefits of setting high standards for your work
The purpose and Benefits of setting high standards for your work The purpose and Benefits of setting high standards for your work
The purpose and Benefits of setting high standards for your work Cav1234
 
GRE Computer Raw Conversion Table
GRE Computer Raw Conversion TableGRE Computer Raw Conversion Table
GRE Computer Raw Conversion TableSuccess Prep
 
Cost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationCost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationDataWorks Summit
 

En vedette (19)

Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC Convergence
 
Greenplum hadoop
Greenplum hadoopGreenplum hadoop
Greenplum hadoop
 
How to Profit from Factoring 2015
How to Profit from Factoring 2015How to Profit from Factoring 2015
How to Profit from Factoring 2015
 
Fish Sticks by Stephen C Lundin, John Christensen and Harry Paul
Fish Sticks by Stephen C Lundin, John Christensen and Harry PaulFish Sticks by Stephen C Lundin, John Christensen and Harry Paul
Fish Sticks by Stephen C Lundin, John Christensen and Harry Paul
 
Rate zonal centrifugation and Its applications
Rate zonal centrifugation and Its applicationsRate zonal centrifugation and Its applications
Rate zonal centrifugation and Its applications
 
What is system level analysis
What is system level analysisWhat is system level analysis
What is system level analysis
 
Top 10 team coordinator interview questions and answers
Top 10 team coordinator interview questions and answersTop 10 team coordinator interview questions and answers
Top 10 team coordinator interview questions and answers
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual Machines
 
Financial aspects of marketing management
Financial aspects of marketing managementFinancial aspects of marketing management
Financial aspects of marketing management
 
Moving From a Selenium Grid to the Cloud - A Real Life Story
Moving From a Selenium Grid to the Cloud - A Real Life StoryMoving From a Selenium Grid to the Cloud - A Real Life Story
Moving From a Selenium Grid to the Cloud - A Real Life Story
 
Progeny LIMS
Progeny LIMSProgeny LIMS
Progeny LIMS
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
 
Getting Past No
Getting Past NoGetting Past No
Getting Past No
 
IT Strategic Planning (Case Studies)
IT Strategic Planning (Case Studies)IT Strategic Planning (Case Studies)
IT Strategic Planning (Case Studies)
 
Matrix Effect
Matrix EffectMatrix Effect
Matrix Effect
 
The purpose and Benefits of setting high standards for your work
The purpose and Benefits of setting high standards for your work The purpose and Benefits of setting high standards for your work
The purpose and Benefits of setting high standards for your work
 
GRE Computer Raw Conversion Table
GRE Computer Raw Conversion TableGRE Computer Raw Conversion Table
GRE Computer Raw Conversion Table
 
Cost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationCost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop Implementation
 

Similaire à Hybrid HPC-ABDS Software Stack for Big Data and HPC Applications

Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningGianvito Siciliano
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...Geoffrey Fox
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...Ilkay Altintas, Ph.D.
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...inside-BigData.com
 
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptxRATISHKUMAR32
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
 

Similaire à Hybrid HPC-ABDS Software Stack for Big Data and HPC Applications (20)

Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
C cerin piv2017_c
C cerin piv2017_cC cerin piv2017_c
C cerin piv2017_c
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
FC Brochure & Insert
FC Brochure & InsertFC Brochure & Insert
FC Brochure & Insert
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
 
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 

Plus de Geoffrey Fox

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...Geoffrey Fox
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online EducationGeoffrey Fox
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Geoffrey Fox
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...Geoffrey Fox
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityGeoffrey Fox
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyGeoffrey Fox
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationGeoffrey Fox
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
 
Classification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsClassification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsGeoffrey Fox
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a ServiceGeoffrey Fox
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGGeoffrey Fox
 
51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data StackGeoffrey Fox
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Geoffrey Fox
 
CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2Geoffrey Fox
 

Plus de Geoffrey Fox (20)

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online Education
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel 
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana University
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC Technology
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and Education
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
 
Classification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsClassification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different Facets
 
Remarks on MOOC's
Remarks on MOOC'sRemarks on MOOC's
Remarks on MOOC's
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a Service
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 
51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
 
CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Hybrid HPC-ABDS Software Stack for Big Data and HPC Applications

  • 1. 1 Big Data Institute, Seoul National University, Korea Geoffrey Fox August 22, 2016 gcf@indiana.edu http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington High Performance Computing and Big Data
  • 2. Abstract • We propose a hybrid software stack with Large scale data systems for both research and commercial applications running on the commodity (Apache) Big Data Stack (ABDS) using High Performance Computing (HPC) enhancements typically to improve performance. We give several examples taken from bio and financial informatics. • We look in detail at parallel and distributed run-times including MPI from HPC and Apache Storm, Heron, Spark and Flink from ABDS stressing that one needs to distinguish the different needs of parallel (tightly coupled) and distributed (loosely coupled) systems. • We also study "Java Grande" or the principles to use to allow Java codes to perform as fast as those written in more traditional HPC languages. We also note the differences between capacity (individual jobs using many nodes) and capability (lots of independent jobs) computing. • We discuss how this HPC-ABDS concept allows one to discuss convergence of Big Data, Big Simulation, Cloud and HPC Systems. See http://hpc-abds.org/kaleidoscope/ 28/28/2016
  • 3. Why Connect (“Converge”) Big Data and HPC • Two major trends in computing systems are – Growth in high performance computing (HPC) with an international exascale initiative (China in the lead) – Big data phenomenon with an accompanying cloud infrastructure of well publicized dramatic and increasing size and sophistication. • Note “Big Data” largely an industry initiative although software used is often open source – So HPC labels overlaps with “research” e.g. HPC community largely responsible for Astronomy and Accelerator (LHC, Belle, BEPC ..) data analysis • Merge HPC and Big Data to get – More efficient sharing of large scale resources running simulations and data analytics – Higher performance Big Data algorithms – Richer software environment for research community building on many big data tools – Easier sustainability model for HPC – HPC does not have resources to build and maintain a full software stack 38/28/2016
  • 4. Convergence Points (Nexus) for HPC-Cloud-Big Data-Simulation • Nexus 1: Applications – Divide use cases into Data and Model and compare characteristics separately in these two components with 64 Convergence Diamonds (features) • Nexus 2: Software – High Performance Computing (HPC) Enhanced Big Data Stack HPC-ABDS. 21 Layers adding high performance runtime to Apache systems (Hadoop is fast!). Establish principles to get good performance from Java or C programming languages • Nexus 3: Hardware – Use Infrastructure as a Service IaaS and DevOps to automate deployment of software defined systems on hardware designed for functionality and performance e.g. appropriate disks, interconnect, memory 48/28/2016
  • 5. SPIDAL Project Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science • NSF14-43054 started October 1, 2014 • Indiana University (Fox, Qiu, Crandall, von Laszewski) • Rutgers (Jha) • Virginia Tech (Marathe) • Kansas (Paden) • Stony Brook (Wang) • Arizona State (Beckstein) • Utah (Cheatham) • A co-design project: Software, algorithms, applications 8/28/2016 5
  • 7. Main Components of SPIDAL Project • Design and Build Scalable High Performance Data Analytics Library • SPIDAL (Scalable Parallel Interoperable Data Analytics Library): Scalable Analytics for: – Domain specific data analytics libraries – mainly from project. – Add Core Machine learning libraries – mainly from community. – Performance of Java and MIDAS Inter- and Intra-node. • NIST Big Data Application Analysis – features of data intensive Applications deriving 64 Convergence Diamonds. Application Nexus. • HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. Software Nexus • MIDAS: Integrating Middleware – from project. • Applications: Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics, Streaming for robotics, streaming stock analytics • Implementations: HPC as well as clouds (OpenStack, Docker) Convergence with common DevOps tool Hardware Nexus 78/28/2016
  • 8. Application Nexus Use-case Data and Model NIST Collection Big Data Ogres Convergence Diamonds 8/28/2016 8
  • 9. Data and Model in Big Data and Simulations I • Need to discuss Data and Model as problems have both intermingled, but we can get insight by separating which allows better understanding of Big Data - Big Simulation “convergence” (or differences!) • The Model is a user construction and it has a “concept”, parameters and gives results determined by the computation. We use term “model” in a general fashion to cover all of these. • Big Data problems can be broken up into Data and Model – For clustering, the model parameters are cluster centers while the data is set of points to be clustered – For queries, the model is structure of database and results of this query while the data is whole database queried and SQL query – For deep learning with ImageNet, the model is chosen network with model parameters as the network link weights. The data is set of images used for training or classification 98/28/2016
  • 10. Data and Model in Big Data and Simulations II • Simulations can also be considered as Data plus Model – Model can be formulation with particle dynamics or partial differential equations defined by parameters such as particle positions and discretized velocity, pressure, density values – Data could be small when just boundary conditions – Data large with data assimilation (weather forecasting) or when data visualizations are produced by simulation • Big Data implies Data is large but Model varies in size – e.g. LDA with many topics or deep learning has a large model – Clustering or Dimension reduction can be quite small in model size • Data often static between iterations (unless streaming); Model parameters vary between iterations 108/28/2016
  • 12. 51 Detailed Use Cases: Contributed July-September 2013 Covers goals, data features such as 3 V’s, software, hardware • Government Operation(4): National Archives and Records Administration, Census Bureau • Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) • Defense(3): Sensors, Image surveillance, Situation Assessment • Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity • Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets • The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source experiments • Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron Collider at CERN, Belle Accelerator II in Japan • Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors • Energy(1): Smart grid • Published by NIST as http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-3.pdf with common set of 26 features recorded for each use-case; “Version 2” being prepared 128/28/201626 Features for each use case Biased to science
  • 13. Sample Features of 51 Use Cases I • PP (26) “All” Pleasingly Parallel or Map Only • MR (18) Classic MapReduce MR (add MRStat below for full count) • MRStat (7) Simple version of MR where key computations are simple reduction as found in statistical averages such as histograms and averages • MRIter (23) Iterative MapReduce or MPI (Flink, Spark, Twister) • Graph (9) Complex graph data structure needed in analysis • Fusion (11) Integrate diverse data to aid discovery/decision making; could involve sophisticated algorithms or could just be a portal • Streaming (41) Some data comes in incrementally and is processed this way • Classify (30) Classification: divide data into categories • S/Q (12) Index, Search and Query 138/28/2016
  • 14. Sample Features of 51 Use Cases II • CF (4) Collaborative Filtering for recommender engines • LML (36) Local Machine Learning (Independent for each parallel entity) – application could have GML as well • GML (23) Global Machine Learning: Deep Learning, Clustering, LDA, PLSI, MDS, – Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can call EGO or Exascale Global Optimization with scalable parallel algorithm • Workflow (51) Universal • GIS (16) Geotagged data and often displayed in ESRI, Microsoft Virtual Earth, Google Earth, GeoServer etc. • HPC(5) Classic large-scale simulation of cosmos, materials, etc. generating (visualization) data • Agent (2) Simulations of models of data-defined macroscopic entities represented as agents 148/28/2016
  • 15. 7 Computational Giants of NRC Massive Data Analysis Report 1) G1: Basic Statistics e.g. MRStat 2) G2: Generalized N-Body Problems 3) G3: Graph-Theoretic Computations 4) G4: Linear Algebraic Computations 5) G5: Optimizations e.g. Linear Programming 6) G6: Integration e.g. LDA and other GML 7) G7: Alignment Problems e.g. BLAST 158/28/2016 http://www.nap.edu/catalog.php?record_id=18374 Big Data Models?
  • 16. HPC (Simulation) Benchmark Classics • Linpack or HPL: Parallel LU factorization for solution of linear equations; HPCG • NPB version 1: Mainly classic HPC solver kernels – MG: Multigrid – CG: Conjugate Gradient – FT: Fast Fourier Transform – IS: Integer sort – EP: Embarrassingly Parallel – BT: Block Tridiagonal – SP: Scalar Pentadiagonal – LU: Lower-Upper symmetric Gauss Seidel 168/28/2016 Simulation Models
  • 17. 13 Berkeley Dwarfs 1) Dense Linear Algebra 2) Sparse Linear Algebra 3) Spectral Methods 4) N-Body Methods 5) Structured Grids 6) Unstructured Grids 7) MapReduce 8) Combinational Logic 9) Graph Traversal 10) Dynamic Programming 11) Backtrack and Branch-and-Bound 12) Graphical Models 13) Finite State Machines 178/28/2016 First 6 of these correspond to Colella’s original. (Classic simulations) Monte Carlo dropped. N-body methods are a subset of Particle in Colella. Note a little inconsistent in that MapReduce is a programming model and spectral method is a numerical method. Need multiple facets to classify use cases! Largely Models for Data or Simulation
  • 19. Classifying Use Cases • The Big Data Ogres built on a collection of 51 big data uses gathered by the NIST Public Working Group where 26 properties were gathered for each application. • This information was combined with other studies including the Berkeley dwarfs, the NAS parallel benchmarks and the Computational Giants of the NRC Massive Data Analysis Report. • The Ogre analysis led to a set of 50 features divided into four views that could be used to categorize and distinguish between applications. • The four views are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing View or runtime features. • We generalized this approach to integrate Big Data and Simulation applications into a single classification looking separately at Data and Model with the total facets growing to 64 in number, called convergence diamonds, and split between the same 4 views. • A mapping of facets into work of the SPIDAL project has been given. 198/28/2016
  • 20. 20 8/28/2016 Pleasingly Parallel Classic MapReduce Map-Collective Map Point-to-Point Shared Memory Single Program Multiple Data Bulk Synchronous Parallel Fusion Dataflow Agents Workflow Geospatial Information System HPC Simulations Internet of Things Metadata/Provenance Shared / Dedicated / Transient / Permanent Archived/Batched/Streaming HDFS/Lustre/GPFS Files/Objects Enterprise Data Model SQL/NoSQL/NewSQL PerformanceMetrics FlopsperByte;MemoryI/O ExecutionEnvironment;Corelibraries Volume Velocity Variety Veracity CommunicationStructure DataAbstraction Metric=M/Non-Metric=N =NN/=N Regular=R/Irregular=I Dynamic=D/Static=S Visualization GraphAlgorithms LinearAlgebraKernels Alignment Streaming OptimizationMethodology Learning Classification Search/Query/Index BaseStatistics GlobalAnalytics LocalAnalytics Micro-benchmarks Recommendations Data Source and Style View Execution View Processing View 2 3 4 6 7 8 9 10 11 12 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 12 14 9 8 7 5 4 3 2 114 13 12 11 10 6 13 Map Streaming 5 4 Ogre Views and 50 Facets Iterative/Simple 11 1 Problem Architecture View
  • 21. 64 Features in 4 views for Unified Classification of Big Data and Simulation Applications 21 Local(Analytics/Informatics/Simulations) 2 M Data Source and Style View Pleasingly Parallel Classic MapReduce Map-Collective Map Point-to-Point Shared Memory Single Program Multiple Data Bulk Synchronous Parallel Fusion Dataflow Agents Workflow Geospatial Information System HPC Simulations Internet of Things Metadata/Provenance Shared / Dedicated / Transient / Permanent Archived/Batched/Streaming – S1, S2, S3, S4, S5 HDFS/Lustre/GPFS Files/Objects Enterprise Data Model SQL/NoSQL/NewSQL 1 M Micro-benchmarks Execution View Processing View 1 2 3 4 6 7 8 9 10 11M 12 10D 9 8D 7D 6D 5D 4D 3D 2D 1D Map Streaming 5 Convergence Diamonds Views and Facets Problem Architecture View 15 M CoreLibraries Visualization 14 M GraphAlgorithms13 M LinearAlgebraKernels/Manysubclasses 12 M Global(Analytics/Informatics/Simulations) 3 M RecommenderEngine 5 M 4 M BaseDataStatistics 10 M StreamingDataAlgorithms OptimizationMethodology 9 M Learning 8 M DataClassification 7 M DataSearch/Query/Index 6 M 11 M DataAlignment Big Data Processing Diamonds MultiscaleMethod 17 M 16 M IterativePDESolvers 22 M Natureofmeshifused EvolutionofDiscreteSystems 21 M ParticlesandFields 20 M N-bodyMethods 19 M SpectralMethods 18 M Simulation (Exascale) Processing Diamonds DataAbstraction D 12 ModelAbstraction M 12 DataMetric=M/Non-Metric=N D 13 DataMetric=M/Non-Metric=N M 13 =NN/=N M 14 Regular=R/Irregular=IModel M 10 Veracity 7 Iterative/Simple M 11 CommunicationStructure M 8 Dynamic=D/Static=S D 9 Dynamic=D/Static=S M 9 Regular=R/Irregular=IData D 10 ModelVariety M 6 DataVelocity D 5 PerformanceMetrics 1 DataVariety D 6 FlopsperByte/MemoryIO/Flopsperwatt 2 ExecutionEnvironment;Corelibraries 3 DataVolume D 4 ModelSize M 4 Simulations Analytics (Model for Big Data) Both (All Model) (Nearly all Data+Model) (Nearly all Data) (Mix of Data and Model) 8/28/2016
  • 22. Examples in Problem Architecture View PA • The facets in the Problem architecture view include 5 very common ones describing synchronization structure of a parallel job: – MapOnly or Pleasingly Parallel (PA1): the processing of a collection of independent events; – MapReduce (PA2): independent calculations (maps) followed by a final consolidation via MapReduce; – MapCollective (PA3): parallel machine learning dominated by scatter, gather, reduce and broadcast; – MapPoint-to-Point (PA4): simulations or graph processing with many local linkages in points (nodes) of studied system. – MapStreaming (PA5): The fifth important problem architecture is seen in recent approaches to processing real-time data. – We do not focus on pure shared memory architectures PA6 but look at hybrid architectures with clusters of multicore nodes and find important performances issues dependent on the node programming model. • Most of our codes are SPMD (PA-7) and BSP (PA-8). 228/28/2016
  • 23. 6 Forms of MapReduce Describes Architecture of - Problem (Model reflecting data) - Machine - Software 2 important variants (software) of Iterative MapReduce and Map-Streaming a) “In-place” HPC b) Flow for model and data 238/28/2016 4) Map- Point to Point Communication 3) Iterative MapReduce or Map-Collective- Input map reduce Iterations Local Graph 5) Map-Streaming maps brokers Events 6) Shared-Memory Map Communication Shared Memory Map & Communication 1) Map-Only Pleasingly Parallel 3) Iterative MapReduce or Map-Collective- 2) Classic MapReduce Input map reduce Input map reduce Iterations Input Output map
  • 24. Examples in Execution View EV • The Execution view is a mix of facets describing either data or model; PA was largely the overall Data+Model • EV-M14 is Complexity of model (O(N2) for N points) seen in the non- metric space models EV-M13 such as one gets with DNA sequences. • EV-M11 describes iterative structure distinguishing Spark, Flink, and Harp from the original Hadoop. • The facet EV-M8 describes the communication structure which is a focus of our research as much data analytics relies on collective communication which is in principle understood but we find that significant new work is needed compared to basic HPC releases which tend to address point to point communication. • The model size EV-M4 and data volume EV-D4 are important in describing the algorithm performance as just like in simulation problems, the grain size (the number of model parameters held in the unit – thread or process – of parallel computing) is a critical measure of performance. 248/28/2016
  • 25. Examples in Data View DV • We can highlight DV-5 streaming where there is a lot of recent progress; • DV-9 categorizes our Biomolecular simulation application with data produced by an HPC simulation • DV-10 is Geospatial Information Systems covered by our spatial algorithms. • DV-7 provenance, is an example of an important feature that we are not covering. • The data storage and access DV-3 and D-4 is covered in our pilot data work. • The Internet of Things DV-8 is not a focus of our project although our recent streaming work relates to this and our addition of HPC to Apache Heron and Storm is an example of the value of HPC-ABDS to IoT. • 258/28/2016
  • 26. Examples in Processing View PV • The Processing view PV characterizes algorithms and is only Model (no Data features) but covers both Big data and Simulation use cases. • Graph PV-M13 and Visualization PV-M14 covered in SPIDAL. • PV-M15 directly describes SPIDAL which is a library of core and other analytics. • This project covers many aspects of PV-M4 to PV-M11 as these characterize the SPIDAL algorithms (such as optimization, learning, classification). – We are of course NOT addressing PV-M16 to PV-M22 which are simulation algorithm characteristics and not applicable to data analytics. • Our work largely addresses Global Machine Learning PV-M3 although some of our image analytics are local machine learning PV-M2 with parallelism over images and not over the analytics. • Many of our SPIDAL algorithms have linear algebra PV-M12 at their core; one nice example is multi-dimensional scaling MDS which is based on matrix-matrix multiplication and conjugate gradient. • 268/28/2016
  • 27. Comparison of Data Analytics with Simulation I • Simulations (models) produce big data as visualization of results – they are data source – Or consume often smallish data to define a simulation problem – HPC simulation in (weather) data assimilation is data + model • Pleasingly parallel often important in both • Both are often SPMD and BSP • Non-iterative MapReduce is major big data paradigm – not a common simulation paradigm except where “Reduce” summarizes pleasingly parallel execution as in some Monte Carlos • Big Data often has large collective communication – Classic simulation has a lot of smallish point-to-point messages – Motivates MapCollective model • Simulations characterized often by difference or differential operators leading to nearest neighbor sparsity • Some important data analytics can be sparse as in PageRank and “Bag of words” algorithms but many involve full matrix algorithm 8/28/2016 27
  • 28. Comparison of Data Analytics with Simulation II • There are similarities between some graph problems and particle simulations with a particular cutoff force. – Both are MapPoint-to-Point problem architecture • Note many big data problems are “long range force” (as in gravitational simulations) as all points are linked. – Easiest to parallelize. Often full matrix algorithms – e.g. in DNA sequence studies, distance (i, j) defined by BLAST, Smith-Waterman, etc., between all sequences i, j. – Opportunity for “fast multipole” ideas in big data. See NRC report • Current Ogres/Diamonds do not have facets to designate underlying hardware: GPU v. Many-core (Xeon Phi) v. Multi-core as these define how maps processed; they keep map-X structure fixed; maybe should change as ability to exploit vector or SIMD parallelism could be a model facet. 8/28/2016 28
  • 29. Comparison of Data Analytics with Simulation III • In image-based deep learning, neural network weights are block sparse (corresponding to links to pixel blocks) but can be formulated as full matrix operations on GPUs and MPI in blocks. • In HPC benchmarking, Linpack being challenged by a new sparse conjugate gradient benchmark HPCG, while I am diligently using non- sparse conjugate gradient solvers in clustering and Multi-dimensional scaling. • Simulations tend to need high precision and very accurate results – partly because of differential operators • Big Data problems often don’t need high accuracy as seen in trend to low precision (16 or 32 bit) deep learning networks – There are no derivatives and the data has inevitable errors • Note parallel machine learning (GML not LML) can benefit from HPC style interconnects and architectures as seen in GPU-based deep learning – So commodity clouds not necessarily best 8/28/2016 29
  • 30. 8/28/2016 30 Some use of Analytics
  • 31. Clustering • The SPIDAL Library includes several clustering algorithms with sophisticated features – Deterministic Annealing – Radius cutoff in cluster membership – Elkans algorithm using triangle inequality • They also cover two important cases – Points are vectors – algorithm O(N) for N points – Points not vectors – all we know is distance (i, j) between each pair of points i and j. algorithm O(N2) for N points • We find visualization important to judge quality of clustering • As data typically not 2D or 3D, we use dimension reduction to project data so we can then view it • Have a browser viewer WebPlotViz that replaces an older Windows system 318/28/2016
  • 32. 2D Vector Clustering with cutoff at 3 σ 328/28/2016 LCMS Mass Spectrometer Peak Clustering. Charge 2 Sample with 10.9 million points and 420,000 clusters visualized in WebPlotViz Orange Star – outside all clusters; yellow circle cluster centers
  • 33. Dimension Reduction • Principal Component Analysis (linear mapping) and Multidimensional Scaling MDS (nonlinear and applicable to non-Euclidean spaces) are methods to map abstract spaces to three dimensions for visualization • Both run well in parallel and give great results • Semimetric spaces have pairwise distances defined between points in space (i, j) • But data is typically in a high dimensional or non vector space so use dimension reduction. Associate each point i with a vector Xi in a Euclidean space of dimension K so that (i, j)  d(Xi , Xj) where d(Xi , Xj) is Euclidean distance between mapped points i and j in K dimensional space. • K = 3 natural for visualization but other values interesting • Principal Component analysis is best known dimension reduction approach but a) linear b) requires original points in a vector space • There are many other nonlinear vector space methods such as GTM Generative Topographic Mapping 33
  • 34. WDA-SMACOF “Best” MDS • MDS Minimizes Stress (X) with pairwise distances (i, j) (X) = i<j=1 N weight(i,j) ((i, j) - d(Xi , Xj))2 • SMACOF clever Expectation Maximization method choses good steepest descent • Improved by Deterministic Annealing gradually reducing Temperature  distance scale; DA does not impact compute time much and gives DA- SMACOF – Deterministic Annealing like Simulated Annealing but no Monte Carlo • Classic SMACOF is O(N2) for uniform weight and O(N3) for non trivial weights but get nonuniform weight from – The preferred Sammon method weight(i,j) = 1/(i, j) or – Missing distances put in as weight(i,j) = 0 • Use conjugate gradient – converges in 5-100 iterations – a big gain for matrix with a million rows. This removes factor of N in time complexity and gives WDA-SMACOF 348/28/2016
  • 35. 358/28/2016 446K sequences ~100 clusters Note distorted shapes probably due to imperfect distance measures e.g. position correlated with sequence length
  • 36. Fungi -- 4 Classic Clustering Methods plus Species Coloring 368/28/2016
  • 37. Heatmap of original distance vs 3D Euclidean Distances for Sequences and Stocks • One can visualize quality of dimension by comparing as a scatterplot or heatmap, the distances (i, j) before and after mapping to 3D. • Perfection is a diagonal straight line and results seem good in general 378/28/2016 Proteomics Example Stock Market Example
  • 38. 388/28/2016 Labelled by Species MDS: Same Species Two groupings
  • 39. RAxML result visualized in FigTree. MSA or SWG distances 39 Spherical Phylograms 8/28/2016 MSA SWG
  • 40. Quality of 3D Phylogenetic Tree • 3 different MDS implementations and 3 different distance measures • EM-SMACOF is basic SMACOF for MDS • LMA was previous best method using Levenberg-Marquardt nonlinear 2 solver • WDA-SMACOF finds best result 408/28/2016 Sum of branch lengths of the Spherical Phylogram generated in 3D space on two datasets 0 5 10 15 20 25 30 MSA SWG NW SumofBranches Sum of Branches on 599nts Data WDA-SMACOF LMA EM-SMACOF 0 5 10 15 20 25 MSA SWG NW SumofBranches Sum of Branches on 999nts Data WDA-SMACOF LMA EM-SMACOF
  • 41. HTML5 web viewer WebPlotViz • Supports visualization of 3D point sets (typically derived by mapping from abstract spaces) for streaming and non-streaming case – Simple data management layer – 3D web visualizer with various capabilities such as defining color schemes, point sizes, glyphs, labels • Core Technologies – MongoDB management – Play Server side framework – Three.js – WebGL – JSON data objects – Bootstrap Javascript web pages • Open Source http://spidal-gw.dsc.soic.indiana.edu/ • ~10,000 lines of extra code 418/28/2016 Front end view (Browser) Plot visualization & time series animation (Three.js) Web Request Controllers (Play Framework) Upload Data Layer (MongoDB) Request Plots JSON Format Plots Upload format to JSON Converter Server MongoDB
  • 42. Stock Daily Data Streaming Example • Example is collection of around 7000 distinct stocks with daily values available at ~2750 distinct times – Clustering as provided by Wall Street – Dow Jones set of 30 stocks, S&P 500, various ETF’s etc. • The Center for Research in Security Prices (CSRP) database through the Wharton Research Data Services (wrds) web interface • Available for free to the Indiana University students for research • 2004 Jan 01 to 2015 Dec 31 have daily Stock prices in the form of a CSV file • We use the information – ID, Date, Symbol, Factor to Adjust Volume, Factor to Adjust Price, Price, Outstanding Stocks 8/28/2016 42
  • 43. Relative Changes in Stock Values using one day values measured from January 2004 and starting after one year January 2005 Filled circles are final values 438/28/2016 Apple Mid Cap Energy S&P Dow Jones Finance Origin 0% change +10% +20%
  • 44. Relative Changes in Stock Values using one day values Expansion of previous data 448/28/2016 Mid Cap Energy S&P Dow Jones Finance 8/28/2016 44 Mid Cap Energy S&P Dow Jones Finance Origin 0% change +10%
  • 45. Algorithm Challenge • The NRC Massive Data Analysis report stresses importance of finding O(N) or O(NlogN) algorithms for O(N2) problems – N is number of points • This is well understood for O(N2) simulation problems where there is a long range force as in gravitational (cosmology) simulations for N stars or galaxies • Simulations are governed by equations that allow a systematic ”multipole expansion” with O(N) as first term with corrections – Has been used successfully in parallel for 25 years • O(N2) big data problems don’t have a systematic practical approach even though there is a qualitative argument shown in next slide. • The work wi,j is labelled by two indices i and j each running from 1 to N. • If points i and j are near each other, need to perform accurate calculations • If far apart, can use approximations and for example, replace points in a far away cluster of M particles by their cluster center weighted by M 45
  • 46. 46 O(N2) interactions between green and purple clusters should be able to represent by centroids as in Barnes-Hut. Hard as no Gauss theorem; no multipole expansion and points really in 1000 dimension space as clustered before 3D projection O(N2) green-green and purple-purple interactions have value but green-purple are “wasted” “clean” sample of 446K O(N2) reduced to O(N) times cluster size
  • 47. Software Nexus Application Layer On Big Data Software Components for Programming and Data Processing On HPC for runtime On IaaS and DevOps Hardware and Systems • HPC-ABDS • MIDAS • Java Grande 8/28/2016 47
  • 48. 488/28/2016 HPC-ABDS Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies Cross- Cutting Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination : Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca 17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Networking: Google Cloud DNS, Amazon Route 53 21 layers Over 350 Software Packages January 29 2016
  • 49. Functionality of 21 HPC-ABDS Layers 1) Message Protocols: 2) Distributed Coordination: 3) Security & Privacy: 4) Monitoring: 5) IaaS Management from HPC to hypervisors: 6) DevOps: 7) Interoperability: 8) File systems: 9) Cluster Resource Management: 10) Data Transport: 11) A) File management B) NoSQL C) SQL 498/28/2016 12) In-memory databases & caches / Object-relational mapping / Extraction Tools 13) Inter process communication Collectives, point-to-point, publish- subscribe, MPI: 14) A) Basic Programming model and runtime, SPMD, MapReduce: B) Streaming: 15) A) High level Programming: B) Frameworks 16) Application and Analytics: 17) Workflow-Orchestration: Lesson of large number (350). This is a rich software environment that HPC cannot “compete” with. Need to use and not regenerate Note level 13 Inter process communication added
  • 50. HPC-ABDS SPIDAL Project Activities • Level 17: Orchestration: Apache Beam (Google Cloud Dataflow) integrated with Heron/Flink and Cloudmesh on HPC cluster • Level 16: Applications: Datamining for molecular dynamics, Image processing for remote sensing and pathology, graphs, streaming, bioinformatics, social media, financial informatics, text mining • Level 16: Algorithms: Generic and custom for applications SPIDAL • Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop, Spark, Flink. Improve Inter- and Intra-node performance; science data structures • Level 13: Runtime Communication: Enhanced Storm and Hadoop (Spark, Flink, Giraph) using HPC runtime technologies, Harp • Level 12: In-memory Database: Redis + Spark used in Pilot-Data Memory • Level 11: Data management: Hbase and MongoDB integrated via use of Beam and other Apache tools; enhance Hbase • Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos, Spark, Hadoop; integrate Storm and Heron with Slurm • Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability 50 Green is MIDAS Black is SPIDAL 8/28/2016
  • 51. Java Grande Revisited on 3 data analytics codes Clustering Multidimensional Scaling Latent Dirichlet Allocation all sophisticated algorithms 518/28/2016
  • 52. Java MPI performs better than FJ Threads 128 24 core Haswell nodes on SPIDAL 200K DA-MDS Code 528/28/2016 Best FJ Threads intra node; MPI inter node Best MPI; inter and intra node MPI; inter/intra node; Java not optimized Speedup compared to 1 process per node on 48 nodes
  • 53. Investigating Process and Thread Models 538/28/2016 • FJ Fork Join Threads lower performance than Long Running Threads LRT • Results – Large effects for Java – Best affinity is process and thread binding to cores - CE – At best LRT mimics performance of “all processes” • 6 Thread/Process Affinity Models
  • 54. Java and C K-Means LRT-FJ and LRT-BSP with different affinity patterns over varying threads and processes. 8/28/2016 Java C 106 points and 1000 centers on 16 nodes 106 points and 50k, and 500k centers performance on 16 nodes
  • 55. Java versus C Performance • C and Java Comparable with Java doing better on larger problem sizes • All data from one million point dataset with varying number of centers on 16 nodes 24 core Haswell 558/28/2016
  • 56. Performance Dependence on Number of Cores inside node (16 nodes total) • Long-Running Theads LRT Java – All Processes – All Threads internal to node – Hybrid – Use one process per chip • Fork Join Java – All Threads – Hybrid – Use one process per chip • Fork Join C – All Threads • All MPI internode 568/28/2016
  • 57. HPC-ABDS DataFlow and In-place Runtime 578/28/2016
  • 58. HPC-ABDS Parallel Computing • Both simulations and data analytics use similar parallel computing ideas • Both do decomposition of both model and data • Both tend use SPMD and often use BSP Bulk Synchronous Processing • One has computing (called maps in big data terminology) and communication/reduction (more generally collective) phases • Big data thinks of problems as multiple linked queries even when queries are small and uses dataflow model • Simulation uses dataflow for multiple linked applications but small steps such as iterations are done in place • Reduction in HPC (MPIReduce) done as optimized tree or pipelined communication between same processes that did computing • Reduction in Hadoop or Flink done as separate map and reduce processes using dataflow – This leads to 2 forms (In-Place and Flow) of Map-X mentioned earlier • Interesting Fault Tolerance issues highlighted by Hadoop-MPI comparisons – not discussed here! 588/28/2016
  • 59. Illustration of In-Place AllReduce in MPI 598/28/2016
  • 60. Breaking Programs into Parts 608/28/2016 Coarse Grain Dataflow HPC or ABDS Fine Grain Parallel Computing Data/model parameter decomposition
  • 61. Kmeans Clustering Flink and MPI one million 2D points fixed; various # centers 24 cores on 16 nodes 618/28/2016
  • 62. • MPI designed for fine grain case and typical of parallel computing used in large scale simulations – Only change in model parameters are transmitted – In-place implementation • Dataflow typical of distributed or Grid computing paradigms – Data sometimes and model parameters certainly transmitted – Caching in iterative MapReduce avoids data communication and in fact systems like TensorFlow, Spark or Flink are called dataflow but usually implement “model-parameter” flow • We quantify this by an overhead analysis on next slide that works for “in-place” runtimes. Flow implementations have additional sources of overhead that we know are large but haven’t studied as quantitatively 625/17/2016 HPC-ABDS Parallel Computing I
  • 63. • Overheads are given by similar formulae for big data and simulations Overhead f = (1/Model parameter Size in each map)n x (Typical Hardware communication cost/Typical computing cost) • Index n>0 depends on communication structure – n=0.5 for matrix problems; n=1 for O(N2) problems • Large f: Intra-job reduction such as Kmeans clustering where one has center changes at end of each iteration and • Small f: Inter-Job Reduction as at end of a query as seen in workflow • Increasing grain size = Model parameter Size in each map, decreases overhead as n>0 635/17/2016 HPC-ABDS Parallel Computing II
  • 64. • For a given application, need to understand: – Are we using Data Flow or “Model-parameter” Flow – Requirements of compute/communication ratio • Inefficient to use same runtime mechanism independent of characteristics – Use In-Place or Flow Software implementations • Classic Dataflow is approach of Spark and Flink so need to add parallel in-place computing as done by Harp for Hadoop – TensorFlow also uses In-Place technology • HPC-ABDS plan is to keep current user interfaces (say to Spark Flink Hadoop Storm Heron) and transparently use HPC to improve performance exploiting added level 13 in HPC-ABDS • We have done this to Hadoop (next Slide), Spark, Storm, Heron – Working on further HPC integration with ABDS 645/17/2016 HPC-ABDS Parallel Computing III
  • 65. Harp (Hadoop Plugin) brings HPC to ABDS • Basic Harp: Iterative HPC communication; scientific data abstractions • Careful support of distributed data AND distributed model • Avoids parameter server approach but distributes model over worker nodes and supports collective communication to bring global model to each node • Applied first to Latent Dirichlet Allocation LDA with large model and data 658/28/2016 Shuffle M M M M Collective Communication M M M M R R MapCollective ModelMapReduce Model YARN MapReduce V2 Harp MapReduce Applications MapCollective Applications
  • 66. Clueweb 66 8/28/2016 enwiki Bi-gram Clueweb Latent Dirichlet Allocation on 100 Haswell nodes: red is Harp (lgs and rtt)
  • 67. 67 Harp LDA on Big Red II Supercomputer (Cray) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 0 5 10 15 20 25 30 0 50 100 150 ParallelEfficiency ExecutionTime(hours) Nodes Execution Time (hours) Parallel Efficiency 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 0 5 10 15 20 25 0 10 20 30 40 ParallelEfficiency ExecutionTime(hours) Nodes Execution Time (hours) Parallel Efficiency Harp LDA on Juliet (Intel Haswell) • Big Red II: tested on 25, 50, 75, 100 and 125 nodes; each node uses 32 parallel threads; Gemini interconnect • Juliet: tested on 10, 15, 20, 25, 30 nodes; each node uses 64 parallel threads on 36 core Intel Haswell nodes (each with 2 chips); Infiniband interconnect Harp LDA Scaling Tests Corpus: 3,775,554 Wikipedia documents, Vocabulary: 1 million words; Topics: 10k topics; alpha: 0.01; beta: 0.01; iteration: 200 8/28/2016
  • 69. Adding HPC to Storm & Heron for Streaming Robot with a Laser Range Finder Map Built from Robot data Robotics Applications Robots need to avoid collisions when they move N-Body Collision Avoidance Simultaneous Localization and Mapping Time series data visualization in real time Map High dimensional data to 3D visualizer Apply to Stock market data tracking 6000 stocks 8/28/2016 69
  • 70. Data Pipeline Hosted on HPC and OpenStack cloud End to end delays without any processing is less than 10ms Message Brokers RabbitMQ, Kafka Gateway Sending to pub-sub Sending to Persisting storage Streaming workflow A stream application with some tasks running in parallel Multiple streaming workflows Streaming Workflows Apache Heron and Storm Storm does not support “real parallel processing” within bolts – add optimized inter-bolt communication 8/28/2016 70
  • 71. 718/28/2016 Improvement of Storm (Heron) using HPC communication algorithms
  • 72. End of Software Discussion 728/28/2016
  • 73. Workflow in HPC-ABDS • HPC familiar with Taverna, Pegasus, Kepler, Galaxy etc. and • ABDS has many workflow systems with recent Apache systems being Crunch, NiFi and Beam (open source version of Google Cloud Dataflow) – Use ABDS for sustainability reasons? – ABDS approaches are better integrated than HPC approaches with ABDS data management like Hbase and are optimized for distributed data. • Heron, Spark and Flink provide distributed dataflow runtime which is needed for workflow • Beam uses Spark or Flink as runtime and supports streaming and batch data • Needs more study 738/28/2016
  • 74. Automatic parallelization • Database community looks at big data job as a dataflow of (SQL) queries and filters • Apache projects like Pig, MRQL and Flink aim at automatic query optimization by dynamic integration of queries and filters including iteration and different data analytics functions • Going back to ~1993, High Performance Fortran HPF compilers optimized set of array and loop operations for large scale parallel execution of optimized vector and matrix operations • HPF worked fine for initial simple regular applications but ran into trouble for cases where parallelism hard (irregular, dynamic) • Will same happen in Big Data world? • Straightforward to parallelize k-means clustering but sophisticated algorithms like Elkans method (use triangle inequality) and fuzzy clustering are much harder (but not used much NOW) • Will Big Data technology run into HPF-style trouble with growing use of sophisticated data analytics? 748/28/2016
  • 76. Constructing HPC-ABDS Exemplars • This is one of next steps in NIST Big Data Working Group • Jobs are defined hierarchically as a combination of Ansible (preferred over Chef or Puppet as Python) scripts • Scripts are invoked on Infrastructure (Cloudmesh Tool) • INFO 524 “Big Data Open Source Software Projects” IU Data Science class required final project to be defined in Ansible and decent grade required that script worked (On NSF Chameleon and FutureSystems) – 80 students gave 37 projects with ~15 pretty good such as – “Machine Learning benchmarks on Hadoop with HiBench”, Hadoop/Yarn, Spark, Mahout, Hbase – “Human and Face Detection from Video”, Hadoop (Yarn), Spark, OpenCV, Mahout, MLLib • Build up curated collection of Ansible scripts defining use cases for benchmarking, standards, education https://docs.google.com/document/d/1INwwU4aUAD_bj-XpNzi2rz3qY8rBMPFRVlx95k0-xc4 • Fall 2015 class INFO 523 introductory data science class was less constrained; students just had to run a data science application but catalog interesting – 140 students: 45 Projects (NOT required) with 91 technologies, 39 datasets 768/28/2016
  • 77. Cloudmesh Interoperability DevOps Tool • Model: Define software configuration with tools like Ansible (Chef, Puppet); instantiate on a virtual cluster • Save scripts not virtual machines and let script build applications • Cloudmesh is an easy-to-use command line program/shell and portal to interface with heterogeneous infrastructures taking script as input – It first defines virtual cluster and then instantiates script on it – It has several common Ansible defined software built in • Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supported clouds as well as classic HPC and Docker infrastructures – Has an abstraction layer that makes it possible to integrate other IaaS frameworks • Managing VMs across different IaaS providers is easier • Demonstrated interaction with various cloud providers: – FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure, virtualbox • Status: AWS, and Azure, VirtualBox, Docker need improvements; we focus currently on SDSC Comet and NSF resources that use OpenStack 77HPC Cloud Interoperability Layer8/28/2016
  • 78. Cloudmesh Architecture • We define a basic virtual cluster which is a set of instances with a common security context • We then add basic tools including languages Python Java etc. • Then add management tools such as Yarn, Mesos, Storm, Slurm etc ….. • Then add roles for different HPC-ABDS PaaS subsystems such as Hbase, Spark – There will be dependencies e.g. Storm role uses Zookeeper • Any one project picks some of HPC-ABDS PaaS Ansible roles and adds >=1 SaaS that are specific to their project and for example read project data and perform project analytics • E.g. there will be an OpenCV role used in Image processing applications 788/28/2016 Software Engineering Process
  • 79. Summary of Big Data - Big Simulation Convergence? HPC-Clouds convergence? (easier than converging higher levels in stack) Can HPC continue to do it alone? Convergence Diamonds HPC-ABDS Software on differently optimized hardware infrastructure 8/28/2016 79
  • 80. • Applications, Benchmarks and Libraries – 51 NIST Big Data Use Cases, 7 Computational Giants of the NRC Massive Data Analysis, 13 Berkeley dwarfs, 7 NAS parallel benchmarks – Unified discussion by separately discussing data & model for each application; – 64 facets– Convergence Diamonds -- characterize applications – Characterization identifies hardware and software features for each application across big data, simulation; “complete” set of benchmarks (NIST) • Software Architecture and its implementation – HPC-ABDS: Cloud-HPC interoperable software: performance of HPC (High Performance Computing) and the rich functionality of the Apache Big Data Stack. – Added HPC to Hadoop, Storm, Heron, Spark; could add to Beam and Flink – Could work in Apache model contributing code • Run same HPC-ABDS across all platforms but “data management” nodes have different balance in I/O, Network and Compute from “model” nodes – Optimize to data and model functions as specified by convergence diamonds – Do not optimize for simulation and big data • Convergence Language: Make C++, Java, Scala, Python (R) … perform well • Training: Students prefer to learn Big Data rather than HPC • Sustainability: research/HPC communities cannot afford to develop everything (hardware and software) from scratch General Aspects of Big Data HPC Convergence 8/28/2016 80
  • 81. Typical Convergence Architecture • Running same HPC-ABDS software across all platforms but data management machine has different balance in I/O, Network and Compute from “model” machine – Note data storage approach: HDFS v. Object Store v. Lustre style file systems is still rather unclear • The Model behaves similarly whether from Big Data or Big Simulation. 818/28/2016 C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C D C C C C C C C C C C C C C C C C Data Management Model for Big Data and Big Simulation