Big data analytics_7_giants_public_24_sep_2013

1
Big Data Analytics beyond
Hadoop
Dr. Vijay Srinivas Agneeswaran,
Director and Head, Big-data R&D,
Innovation Labs, Impetus

Contents
2
Introduction
• Characterization of “7 giants”
Limitation of Hadoop
for Analytics
Introduction to Berkeley
data analytics stack – Spark
Real-time analytics
with Twitter’s Storm
GraphLab – graph processing
for Internet-like graphs

Introduction: 7 Giants
3
National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 201
Giant 1: Basic
statistics
Mean, median
variance, counting
operations
O(N) operations.
Embarrassingly parallel –
perfect for Hadoop MR.
Giant 2: Linear
Algebra
computations
Linear systems,
eigenvalue problems,
inverses from linear
regression and Principal
Component Analysis
(PCA)
Linear regression
is doable over
Hadoop
PCA is difficult, so is kernel
regression or kernel PCA

4
Giant 3:
Generalized N-
body problems
Distances/kernels
between points or
sets of points
Computation
complexity is O(N2)
or O(N3)
Range
search, nearest
neighbour
search, non-linear
reduction methods
K-means clustering ,
Kernel SVM, Kernel
discriminant
analysis
Giant 4: Graph
theoretic
computations
Computations on
graphs – centrality,
commute distances,
ranking
Statistical model is a
graph – inferencing

5
[AA11] Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, John Langford: A Reliable Effective
Terascale Linear Learning System. CoRR abs/1110.4198(2011).
Giant 5:
Optimization
problems
Objective/loss/cost/energy
function
maximizing/minimizing
Stochastic
approaches
Linear/quadratic
programmingConjugate gradient
descent
All-reduce
paradigm is
required [AA11]

6
Giant 6:
Integration
problems
Bayesian inference or
random effects
models
Quadrature
approaches for low
dimension integration
Markov Chain Monte
Carlo (MCMC) for
high dimension
integration [CA03]
Giant 7:
Alignment
problems
Image deduplication,
catalog cross
matching, multiple
sequence alignments
Linear algebra
Dynamic
programming/Hidden
Markov Models

Limitations of Hadoop for big
data analytics
7
LimitationsofHadoop Giant 1 is perfect for Hadoop.
Giants 2 (linear algebra), 3 (N-body), 4
(optimization) Spark from UC Berkeley
is efficient.
Logistic regression, Kernel SVMs,
Conjugate gradient descent,
collaborative filtering, Gibbs sampling,
Alternating least squares.
Interactive/On-the-fly data processing
– Storm.
OLAP – data cube operations.
Dremel/Drill
Data sets – not embarrassingly
parallel?
Giant 5 – Graph processing –
GraphLab, Pregel, Giraph

8
ML realizations: 3 Generational view

Iterative ML Algorithms
 What are iterative algorithms?
 Those that need communication among the computing entities
 Examples – neural networks, PageRank algorithms, network traffic analysis
 Conjugate gradient descent
 Commonly used to solve systems of linear equations
 [CB09] tried implementing CG on dense matrices
 DAXPY – Multiplies vector x by constant a and adds y.
 DDOT – Dot product of 2 vectors
 MatVec – Multiply matrix by vector, produce a vector.
 1 MR per primitive – 6 MRs per CG iteration, hundreds of MRs per CG
computation, leading to 10 of GBs of communication even for small
matrices.
 Other iterative algorithms – fast fourier transform, block tridiagonal
[CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific
computing, Technical Report, University of California, Computer Science Department, 2009.

10
Berkeley Big-data Analytics Stack
Hadoop Distributed File System
Tachyon: Distributed In-memory File System
Spark: Computing Paradigm
Bagel/GraphX:
Graph Processing
• Mesos – similar to Nimbus used by Storm, but more
sophisticated.
• Tachyon: DFS – could be replaced by HDFS.
• Spark – built as a computing paradigm over resilient distributed
data sets.
• Shark – comparable to Impala
Shark: SQL
Abstraction
Spark
Streaming
Mesos: Cluster Management

Spark: Third Generation ML Realization
 Resilient distributed data sets (RDDs)
 Read-only collection of objects partitioned across a cluster
 Can be rebuilt if partition is lost.
 Operations on RDDs
 Transformations – map, flatMap, reduceByKey, sort, join, partitionBy
 Actions – Foreach, reduce, collect, count, lookup
 Programmer can build RDDs from
1. a file in HDFS
2. Parallelizing Scala collection - divide into slices.
3. Transform existing RDD - Specify operations such as Map, Filter
4. Change persistence of RDD Cache or a save action – saves to HDFS.
 Shared variables
 Broadcast variables, accumulators
[MZ10] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark:
cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud
computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10-10

12
Data Flow in Spark and Hadoop

Logistic Regression: Spark VS Hadoop
13http://spark-project.org

Spark Use Cases
14
Ooyala
Uses Cassandra
for video data
personalization.
Pre-compute
aggregates VS
on-the-fly queries.
Moved to Spark
for ML and
computing views.
Moved to Shark for
on-the-fly queries –
C* OLAP aggregate
queries on
Cassandra 130 secs,
60 ms in Spark
Conviva
Uses Hive for
repeatedly running
ad-hoc queries on
video data.
Optimized ad-hoc
queries using Spark
RDDs – found Spark
is 30 times faster
than Hive
ML for connection
analysis and video
streaming
optimization.
Quantifind
Movie , video game
companies can
predict success of
new releases
Moved from Hadoop
to Spark and able to
run ML in
seconds, instead of
hours.

Instance of Architecture for Internet Traffic
Analysis Use Case

K-means Clustering Algorithm:
Mahout VS ML Over Storm
16

GraphLab: Ideal Engine for Processing Natural Graphs [YL12]
 Goals – targeted at machine learning.
 Model graph dependencies, be asynchronous, iterative, dynamic.
 Data associated with edges (weights, for instance) and vertices (user profile
data, current interests etc.).
 Update functions – lives on each vertex
 Transforms data in scope of vertex.
 Can choose to trigger neighbours (for example only if Rank changes
drastically)
 Run asynchronously till convergence – no global barrier.
 Consistency is important in ML algorithms (some do not even converge
when there are inconsistent updates – collaborative filtering).
 GraphLab – provides varying level of consistency. Parallelism VS consistency.
 Implemented several algorithms, including ALS, K-means, SVM, Belief
propagation, matrix factorization, Gibbs sampling, SVD, CoEM etc.
 Co-EM (Expectation Maximization) algorithm 15x faster than Hadoop MR – on
distributed GraphLab, only 0.3% of Hadoop execution time.[YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M.
Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the
cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.

GraphLab 2: PowerGraph – Modeling Natural Graphs [1]
 GraphLab could not scale to Altavista web graph 2002, 1.4B vertices, 6.7B
edges.
 Most graph parallel abstractions assume small neighbourhoods – low degree
vertices
 But natural graphs (LinkedIn, Facebook, Twitter) – power law graphs.
 Hard to partition power law graphs, high degree vertices limit parallelism.
 GraphLab provides new way of partitioning power law graphs
 Edges are tied to machines, vertices (esp. high degree ones) span machines
 Execution split into 3 phases:
 Gather, apply and scatter.
 Triangle counting on Twitter graph
 Hadoop MR took 423 minutes on 1536 machines
 GraphLab 2 took 1.5 minutes on 1024 cores (64 machines)
[1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph:
Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium
on Operating Systems Design and Implementation (OSDI '12).

Thank You!
• Mail
vijay.sa@impetus.co.in
• LinkedIn
http://in.linkedin.com/in/vijaysrinivasagneeswaran
• Blogs
blogs.impetus.com
• Twitter
@a_vijaysrinivas.

Big data analytics_7_giants_public_24_sep_2013

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Big data analytics_7_giants_public_24_sep_2013

Similaire à Big data analytics_7_giants_public_24_sep_2013 (20)

Dernier

Dernier (20)

Big data analytics_7_giants_public_24_sep_2013

Notes de l'éditeur