SlideShare une entreprise Scribd logo
1  sur  19
1
Big Data Analytics beyond
Hadoop
Dr. Vijay Srinivas Agneeswaran,
Director and Head, Big-data R&D,
Innovation Labs, Impetus
Contents
2
Introduction
• Characterization of “7 giants”
Limitation of Hadoop
for Analytics
Introduction to Berkeley
data analytics stack – Spark
Real-time analytics
with Twitter’s Storm
GraphLab – graph processing
for Internet-like graphs
Introduction: 7 Giants
3
National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 201
Giant 1: Basic
statistics
Mean, median
variance, counting
operations
O(N) operations.
Embarrassingly parallel –
perfect for Hadoop MR.
Giant 2: Linear
Algebra
computations
Linear systems,
eigenvalue problems,
inverses from linear
regression and Principal
Component Analysis
(PCA)
Linear regression
is doable over
Hadoop
PCA is difficult, so is kernel
regression or kernel PCA
Introduction: 7 Giants
4
Giant 3:
Generalized N-
body problems
Distances/kernels
between points or
sets of points
Computation
complexity is O(N2)
or O(N3)
Range
search, nearest
neighbour
search, non-linear
reduction methods
K-means clustering ,
Kernel SVM, Kernel
discriminant
analysis
Giant 4: Graph
theoretic
computations
Computations on
graphs – centrality,
commute distances,
ranking
Statistical model is a
graph – inferencing
Introduction: 7 Giants
5
[AA11] Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, John Langford: A Reliable Effective
Terascale Linear Learning System. CoRR abs/1110.4198(2011).
Giant 5:
Optimization
problems
Objective/loss/cost/energy
function
maximizing/minimizing
Stochastic
approaches
Linear/quadratic
programmingConjugate gradient
descent
All-reduce
paradigm is
required [AA11]
Introduction: 7 Giants
6
Giant 6:
Integration
problems
Bayesian inference or
random effects
models
Quadrature
approaches for low
dimension integration
Markov Chain Monte
Carlo (MCMC) for
high dimension
integration [CA03]
Giant 7:
Alignment
problems
Image deduplication,
catalog cross
matching, multiple
sequence alignments
Linear algebra
Dynamic
programming/Hidden
Markov Models
Limitations of Hadoop for big
data analytics
7
LimitationsofHadoop Giant 1 is perfect for Hadoop.
Giants 2 (linear algebra), 3 (N-body), 4
(optimization) Spark from UC Berkeley
is efficient.
Logistic regression, Kernel SVMs,
Conjugate gradient descent,
collaborative filtering, Gibbs sampling,
Alternating least squares.
Interactive/On-the-fly data processing
– Storm.
OLAP – data cube operations.
Dremel/Drill
Data sets – not embarrassingly
parallel?
Giant 5 – Graph processing –
GraphLab, Pregel, Giraph
8
ML realizations: 3 Generational view
Iterative ML Algorithms
 What are iterative algorithms?
 Those that need communication among the computing entities
 Examples – neural networks, PageRank algorithms, network traffic analysis
 Conjugate gradient descent
 Commonly used to solve systems of linear equations
 [CB09] tried implementing CG on dense matrices
 DAXPY – Multiplies vector x by constant a and adds y.
 DDOT – Dot product of 2 vectors
 MatVec – Multiply matrix by vector, produce a vector.
 1 MR per primitive – 6 MRs per CG iteration, hundreds of MRs per CG
computation, leading to 10 of GBs of communication even for small
matrices.
 Other iterative algorithms – fast fourier transform, block tridiagonal
[CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific
computing, Technical Report, University of California, Computer Science Department, 2009.
10
Berkeley Big-data Analytics Stack
Hadoop Distributed File System
Tachyon: Distributed In-memory File System
Spark: Computing Paradigm
Bagel/GraphX:
Graph Processing
• Mesos – similar to Nimbus used by Storm, but more
sophisticated.
• Tachyon: DFS – could be replaced by HDFS.
• Spark – built as a computing paradigm over resilient distributed
data sets.
• Shark – comparable to Impala
Shark: SQL
Abstraction
Spark
Streaming
Mesos: Cluster Management
Spark: Third Generation ML Realization
 Resilient distributed data sets (RDDs)
 Read-only collection of objects partitioned across a cluster
 Can be rebuilt if partition is lost.
 Operations on RDDs
 Transformations – map, flatMap, reduceByKey, sort, join, partitionBy
 Actions – Foreach, reduce, collect, count, lookup
 Programmer can build RDDs from
1. a file in HDFS
2. Parallelizing Scala collection - divide into slices.
3. Transform existing RDD - Specify operations such as Map, Filter
4. Change persistence of RDD Cache or a save action – saves to HDFS.
 Shared variables
 Broadcast variables, accumulators
[MZ10] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark:
cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud
computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10-10
12
Data Flow in Spark and Hadoop
Logistic Regression: Spark VS Hadoop
13http://spark-project.org
Spark Use Cases
14
Ooyala
Uses Cassandra
for video data
personalization.
Pre-compute
aggregates VS
on-the-fly queries.
Moved to Spark
for ML and
computing views.
Moved to Shark for
on-the-fly queries –
C* OLAP aggregate
queries on
Cassandra 130 secs,
60 ms in Spark
Conviva
Uses Hive for
repeatedly running
ad-hoc queries on
video data.
Optimized ad-hoc
queries using Spark
RDDs – found Spark
is 30 times faster
than Hive
ML for connection
analysis and video
streaming
optimization.
Quantifind
Movie , video game
companies can
predict success of
new releases
Moved from Hadoop
to Spark and able to
run ML in
seconds, instead of
hours.
Instance of Architecture for Internet Traffic
Analysis Use Case
K-means Clustering Algorithm:
Mahout VS ML Over Storm
16
GraphLab: Ideal Engine for Processing Natural Graphs [YL12]
 Goals – targeted at machine learning.
 Model graph dependencies, be asynchronous, iterative, dynamic.
 Data associated with edges (weights, for instance) and vertices (user profile
data, current interests etc.).
 Update functions – lives on each vertex
 Transforms data in scope of vertex.
 Can choose to trigger neighbours (for example only if Rank changes
drastically)
 Run asynchronously till convergence – no global barrier.
 Consistency is important in ML algorithms (some do not even converge
when there are inconsistent updates – collaborative filtering).
 GraphLab – provides varying level of consistency. Parallelism VS consistency.
 Implemented several algorithms, including ALS, K-means, SVM, Belief
propagation, matrix factorization, Gibbs sampling, SVD, CoEM etc.
 Co-EM (Expectation Maximization) algorithm 15x faster than Hadoop MR – on
distributed GraphLab, only 0.3% of Hadoop execution time.[YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M.
Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the
cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.
GraphLab 2: PowerGraph – Modeling Natural Graphs [1]
 GraphLab could not scale to Altavista web graph 2002, 1.4B vertices, 6.7B
edges.
 Most graph parallel abstractions assume small neighbourhoods – low degree
vertices
 But natural graphs (LinkedIn, Facebook, Twitter) – power law graphs.
 Hard to partition power law graphs, high degree vertices limit parallelism.
 GraphLab provides new way of partitioning power law graphs
 Edges are tied to machines, vertices (esp. high degree ones) span machines
 Execution split into 3 phases:
 Gather, apply and scatter.
 Triangle counting on Twitter graph
 Hadoop MR took 423 minutes on 1536 machines
 GraphLab 2 took 1.5 minutes on 1024 cores (64 machines)
[1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph:
Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium
on Operating Systems Design and Implementation (OSDI '12).
Thank You!
• Mail
vijay.sa@impetus.co.in
• LinkedIn
http://in.linkedin.com/in/vijaysrinivasagneeswaran
• Blogs
blogs.impetus.com
• Twitter
@a_vijaysrinivas.

Contenu connexe

Tendances

TensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsTensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsSeldon
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC ConvergenceGeoffrey Fox
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018TigerGraph
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierDemai Ni
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphTigerGraph
 
Keras: A versatile modeling layer for deep learning
Keras: A versatile modeling layer for deep learningKeras: A versatile modeling layer for deep learning
Keras: A versatile modeling layer for deep learningDr. Ananth Krishnamoorthy
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudKhazret Sapenov
 
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsBlue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsDatabricks
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoopdbpublications
 

Tendances (20)

TensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsTensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative models
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC Convergence
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural Networks
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel 
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise Graph
 
Keras: A versatile modeling layer for deep learning
Keras: A versatile modeling layer for deep learningKeras: A versatile modeling layer for deep learning
Keras: A versatile modeling layer for deep learning
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
Aplicações Potenciais de Deep Learning à Indústria do PetróleoAplicações Potenciais de Deep Learning à Indústria do Petróleo
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
 
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsBlue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 

Similaire à Big data analytics_7_giants_public_24_sep_2013

Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...Geoffrey Fox
 
Slide 1
Slide 1Slide 1
Slide 1butest
 
Slide 1
Slide 1Slide 1
Slide 1butest
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics Farheen Nilofer
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analyticsFarheen Nilofer
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
 
Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architecturesRaji Gogulapati
 

Similaire à Big data analytics_7_giants_public_24_sep_2013 (20)

Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Slide 1
Slide 1Slide 1
Slide 1
 
Slide 1
Slide 1Slide 1
Slide 1
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Spark
SparkSpark
Spark
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architectures
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Big data analytics_7_giants_public_24_sep_2013

  • 1. 1 Big Data Analytics beyond Hadoop Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus
  • 2. Contents 2 Introduction • Characterization of “7 giants” Limitation of Hadoop for Analytics Introduction to Berkeley data analytics stack – Spark Real-time analytics with Twitter’s Storm GraphLab – graph processing for Internet-like graphs
  • 3. Introduction: 7 Giants 3 National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 201 Giant 1: Basic statistics Mean, median variance, counting operations O(N) operations. Embarrassingly parallel – perfect for Hadoop MR. Giant 2: Linear Algebra computations Linear systems, eigenvalue problems, inverses from linear regression and Principal Component Analysis (PCA) Linear regression is doable over Hadoop PCA is difficult, so is kernel regression or kernel PCA
  • 4. Introduction: 7 Giants 4 Giant 3: Generalized N- body problems Distances/kernels between points or sets of points Computation complexity is O(N2) or O(N3) Range search, nearest neighbour search, non-linear reduction methods K-means clustering , Kernel SVM, Kernel discriminant analysis Giant 4: Graph theoretic computations Computations on graphs – centrality, commute distances, ranking Statistical model is a graph – inferencing
  • 5. Introduction: 7 Giants 5 [AA11] Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, John Langford: A Reliable Effective Terascale Linear Learning System. CoRR abs/1110.4198(2011). Giant 5: Optimization problems Objective/loss/cost/energy function maximizing/minimizing Stochastic approaches Linear/quadratic programmingConjugate gradient descent All-reduce paradigm is required [AA11]
  • 6. Introduction: 7 Giants 6 Giant 6: Integration problems Bayesian inference or random effects models Quadrature approaches for low dimension integration Markov Chain Monte Carlo (MCMC) for high dimension integration [CA03] Giant 7: Alignment problems Image deduplication, catalog cross matching, multiple sequence alignments Linear algebra Dynamic programming/Hidden Markov Models
  • 7. Limitations of Hadoop for big data analytics 7 LimitationsofHadoop Giant 1 is perfect for Hadoop. Giants 2 (linear algebra), 3 (N-body), 4 (optimization) Spark from UC Berkeley is efficient. Logistic regression, Kernel SVMs, Conjugate gradient descent, collaborative filtering, Gibbs sampling, Alternating least squares. Interactive/On-the-fly data processing – Storm. OLAP – data cube operations. Dremel/Drill Data sets – not embarrassingly parallel? Giant 5 – Graph processing – GraphLab, Pregel, Giraph
  • 8. 8 ML realizations: 3 Generational view
  • 9. Iterative ML Algorithms  What are iterative algorithms?  Those that need communication among the computing entities  Examples – neural networks, PageRank algorithms, network traffic analysis  Conjugate gradient descent  Commonly used to solve systems of linear equations  [CB09] tried implementing CG on dense matrices  DAXPY – Multiplies vector x by constant a and adds y.  DDOT – Dot product of 2 vectors  MatVec – Multiply matrix by vector, produce a vector.  1 MR per primitive – 6 MRs per CG iteration, hundreds of MRs per CG computation, leading to 10 of GBs of communication even for small matrices.  Other iterative algorithms – fast fourier transform, block tridiagonal [CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific computing, Technical Report, University of California, Computer Science Department, 2009.
  • 10. 10 Berkeley Big-data Analytics Stack Hadoop Distributed File System Tachyon: Distributed In-memory File System Spark: Computing Paradigm Bagel/GraphX: Graph Processing • Mesos – similar to Nimbus used by Storm, but more sophisticated. • Tachyon: DFS – could be replaced by HDFS. • Spark – built as a computing paradigm over resilient distributed data sets. • Shark – comparable to Impala Shark: SQL Abstraction Spark Streaming Mesos: Cluster Management
  • 11. Spark: Third Generation ML Realization  Resilient distributed data sets (RDDs)  Read-only collection of objects partitioned across a cluster  Can be rebuilt if partition is lost.  Operations on RDDs  Transformations – map, flatMap, reduceByKey, sort, join, partitionBy  Actions – Foreach, reduce, collect, count, lookup  Programmer can build RDDs from 1. a file in HDFS 2. Parallelizing Scala collection - divide into slices. 3. Transform existing RDD - Specify operations such as Map, Filter 4. Change persistence of RDD Cache or a save action – saves to HDFS.  Shared variables  Broadcast variables, accumulators [MZ10] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10-10
  • 12. 12 Data Flow in Spark and Hadoop
  • 13. Logistic Regression: Spark VS Hadoop 13http://spark-project.org
  • 14. Spark Use Cases 14 Ooyala Uses Cassandra for video data personalization. Pre-compute aggregates VS on-the-fly queries. Moved to Spark for ML and computing views. Moved to Shark for on-the-fly queries – C* OLAP aggregate queries on Cassandra 130 secs, 60 ms in Spark Conviva Uses Hive for repeatedly running ad-hoc queries on video data. Optimized ad-hoc queries using Spark RDDs – found Spark is 30 times faster than Hive ML for connection analysis and video streaming optimization. Quantifind Movie , video game companies can predict success of new releases Moved from Hadoop to Spark and able to run ML in seconds, instead of hours.
  • 15. Instance of Architecture for Internet Traffic Analysis Use Case
  • 17. GraphLab: Ideal Engine for Processing Natural Graphs [YL12]  Goals – targeted at machine learning.  Model graph dependencies, be asynchronous, iterative, dynamic.  Data associated with edges (weights, for instance) and vertices (user profile data, current interests etc.).  Update functions – lives on each vertex  Transforms data in scope of vertex.  Can choose to trigger neighbours (for example only if Rank changes drastically)  Run asynchronously till convergence – no global barrier.  Consistency is important in ML algorithms (some do not even converge when there are inconsistent updates – collaborative filtering).  GraphLab – provides varying level of consistency. Parallelism VS consistency.  Implemented several algorithms, including ALS, K-means, SVM, Belief propagation, matrix factorization, Gibbs sampling, SVD, CoEM etc.  Co-EM (Expectation Maximization) algorithm 15x faster than Hadoop MR – on distributed GraphLab, only 0.3% of Hadoop execution time.[YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.
  • 18. GraphLab 2: PowerGraph – Modeling Natural Graphs [1]  GraphLab could not scale to Altavista web graph 2002, 1.4B vertices, 6.7B edges.  Most graph parallel abstractions assume small neighbourhoods – low degree vertices  But natural graphs (LinkedIn, Facebook, Twitter) – power law graphs.  Hard to partition power law graphs, high degree vertices limit parallelism.  GraphLab provides new way of partitioning power law graphs  Edges are tied to machines, vertices (esp. high degree ones) span machines  Execution split into 3 phases:  Gather, apply and scatter.  Triangle counting on Twitter graph  Hadoop MR took 423 minutes on 1536 machines  GraphLab 2 took 1.5 minutes on 1024 cores (64 machines) [1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12).
  • 19. Thank You! • Mail vijay.sa@impetus.co.in • LinkedIn http://in.linkedin.com/in/vijaysrinivasagneeswaran • Blogs blogs.impetus.com • Twitter @a_vijaysrinivas.

Notes de l'éditeur

  1. Euclidean graph problems are hard to solve over Hadoop as they become generalized N-body problems.
  2. Euclidean graph problems are hard to solve over Hadoop as they become generalizedN-body problems.
  3. Euclidean graph problems are hard to solve over Hadoop as they become generalizedN-body problems.