SlideShare une entreprise Scribd logo
1  sur  10
Distributed Computing
Abstractions for Big Data
Cloud Analytics
Dr. Vijay Srinivas Agneeswaran
Director of Technology,
Data Sciences Group,
SapientNitro,
Bangalore.
© Tally Solutions Pvt. Ltd. All Rights Reserved 22
Big Data Computations
Computations/Operations
Giant 1 (simple stats) is perfect for
Hadoop 1.0.
Giants 2 (linear algebra), 3 (N-
body), 4 (optimization) Spark from
UC Berkeley is efficient.
Logistic regression, kernel SVMs,
conjugate gradient descent,
collaborative filtering, Gibbs
sampling, alternating least squares.
Example is social group-first
approach for consumer churn
analysis [2]
Interactive/On-the-fly data
processing – Storm.
OLAP – data cube operations.
Dremel/Drill
Data sets – not embarrassingly
parallel?
Deep Learning
Artificial Neural Networks/Deep
Belief Networks
Machine vision from Google [3]
Speech analysis from Microsoft
Giant 5 – Graph processing –
GraphLab, Pregel, Giraph
[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.
[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups.
In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741
[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W.
Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012: 1232-1240
© Tally Solutions Pvt. Ltd. All Rights Reserved 33
3
What is Hadoop?
This diagram is from
http://2.bp.blogspot.com/_mbMKzl2KhSc/TOOF6OQp5fI/AAAAAAAABb8/l2gyhVhQw9Y/s1600/D-Hadoop+MapReduce+Architecture.png
© Tally Solutions Pvt. Ltd. All Rights Reserved 44
4
Data Flow in Spark and
Hadoop
© Tally Solutions Pvt. Ltd. All Rights Reserved 55
Comparison of Graph Processing Paradigms
Graph
Paradigm
Computation
(Synchronous or
Asynchronous)
Determinism
and Effects
Efficiency VS
Asynchrony
Efficiency of
Processing Power
Law Graphs.
GraphLab1 Asynchronous Deterministic –
serializable
computation.
Uses inefficient
locking protocols
Inefficient – Locking
protocol is unfair to
high degree vertices
Pregel Synchronous – BSP
based
Deterministic –
serial
computation.
NA Inefficient – curse of
slow jobs.
Piccolo Asynchronous Non-
deterministic –
non-serializable
computation.
Efficient, but
may lead to
incorrectness.
May be efficient.
GraphLab2
(PowerGraph)
Asynchronous Deterministic –
serializable
computation.
Uses parallel
locking for
efficient,
serializable and
asynchronous
computations.
Efficient – parallel
locking and other
optimizations for
processing natural
graphs.
GraphLab: Ideal Engine for Processing Graphs [YL12] – Dato.
[YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: a
framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.
Goals – targeted at
machine learning.
• Model graph dependencies,
be asynchronous, iterative,
dynamic.
Data associated with edges (weights, for
instance) and vertices (user profile data,
current interests etc.).
Update functions – lives on each
vertex
• Transforms data in scope of vertex.
• Can choose to trigger neighbours (for
example only if Rank changes drastically)
• Run asynchronously till convergence – no
global barrier.
Consistency is important in ML
algorithms (some do not even converge
when there are inconsistent updates –
collaborative filtering).
• GraphLab – provides varying level of
consistency. Parallelism VS consistency.
Implemented several algorithms,
including ALS, K-means, SVM, Belief
propagation, matrix factorization, Gibbs
sampling, SVD, CoEM etc.
• Co-EM (Expectation Maximization) algorithm
15x faster than Hadoop MR – on distributed
GraphLab, only 0.3% of Hadoop execution time.
© Tally Solutions Pvt. Ltd. All Rights Reserved 77
Other Abstractions
• Functional data science
• Exploration of recent algorithms
• EMPCA (Expectation-Maximization Principal component analysis)
• ADMM (Alternating Direction Method of Multipliers)
• Deep Learning
• Distributed computing abstractions for data science
- Useful for engineering RFPs and next gen architectures for clients.
• Iterative computations – Apache Spark or Flink or MPI or Concord
• Graph computations – TensorFlow, GraphLab, Titan.
• High performance computations – CUDA.
© Tally Solutions Pvt. Ltd. All Rights Reserved 88
Other Abstractions
• Adhoc queries on large data sets.
– Next gen SQL
• Approximate queries – Blink DB
• Adhoc queries – Apache Drill
• Distributed SQL – Spark SQL, Apache Tajo, HAWQ
• Data governance
– Important for all big data architectures.
• Security – policies for role based fine grained access control
• Encryption – at-rest data as well as data-in-motion.
• Tokenization/Obfuscation – data masking.
© Tally Solutions Pvt. Ltd. All Rights Reserved 99
Future Abstractions
• Internet of Things (IoT) Analytics
• Fog/Edge computing – strong devices, weakly connected.
• Data Lake Management
– next gen data management
• Hadoop based data lakes
• NewSQL data stores
© Tally Solutions Pvt. Ltd. All Rights Reserved 1010
Thank You!
Mail • vagneeswaran@sapient.com
LinkedIn • http://in.linkedin.com/in/vijaysrinivasagneeswaran
Blogs • blogs.impetus.com
Twitter • @a_vijaysrinivas.

Contenu connexe

Tendances

The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
Khazret Sapenov
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Sarah Aerni
 

Tendances (20)

Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural Networks
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise Graph
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC Convergence
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsBlue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
Aplicações Potenciais de Deep Learning à Indústria do PetróleoAplicações Potenciais de Deep Learning à Indústria do Petróleo
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Plume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis LibraryPlume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis Library
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
Deep learning at nmc devin jones
Deep learning at nmc devin jones Deep learning at nmc devin jones
Deep learning at nmc devin jones
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 

En vedette

En vedette (14)

Distributed Computing and Big Data
Distributed Computing and Big DataDistributed Computing and Big Data
Distributed Computing and Big Data
 
Arjun Sharma
Arjun SharmaArjun Sharma
Arjun Sharma
 
Blue brain technology
Blue brain technologyBlue brain technology
Blue brain technology
 
Discover the Benefits of Cloud Computing with Google Apps and Salesforce.com
Discover the Benefits of Cloud Computing with Google Apps and Salesforce.comDiscover the Benefits of Cloud Computing with Google Apps and Salesforce.com
Discover the Benefits of Cloud Computing with Google Apps and Salesforce.com
 
An introduction to linear algebra
An introduction to linear algebraAn introduction to linear algebra
An introduction to linear algebra
 
Linear Algebra
Linear AlgebraLinear Algebra
Linear Algebra
 
Blue brain Technology
Blue brain TechnologyBlue brain Technology
Blue brain Technology
 
Ad hoc networks
Ad hoc networksAd hoc networks
Ad hoc networks
 
Applications of linear algebra
Applications of linear algebraApplications of linear algebra
Applications of linear algebra
 
Ad-Hoc Networks
Ad-Hoc NetworksAd-Hoc Networks
Ad-Hoc Networks
 
Best Ever PPT Of Bluebrain
Best Ever PPT Of BluebrainBest Ever PPT Of Bluebrain
Best Ever PPT Of Bluebrain
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
THE INTERNET OF THINGS
THE INTERNET OF THINGSTHE INTERNET OF THINGS
THE INTERNET OF THINGS
 
Internet of Things and its applications
Internet of Things and its applicationsInternet of Things and its applications
Internet of Things and its applications
 

Similaire à Distributed computing abstractions_data_science_6_june_2016_ver_0.4

Similaire à Distributed computing abstractions_data_science_6_june_2016_ver_0.4 (20)

AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
Data Science At Scale for IoT on the Pivotal Platform
Data Science At Scale for IoT on the Pivotal PlatformData Science At Scale for IoT on the Pivotal Platform
Data Science At Scale for IoT on the Pivotal Platform
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONSBIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 

Dernier

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 

Distributed computing abstractions_data_science_6_june_2016_ver_0.4

  • 1. Distributed Computing Abstractions for Big Data Cloud Analytics Dr. Vijay Srinivas Agneeswaran Director of Technology, Data Sciences Group, SapientNitro, Bangalore.
  • 2. © Tally Solutions Pvt. Ltd. All Rights Reserved 22 Big Data Computations Computations/Operations Giant 1 (simple stats) is perfect for Hadoop 1.0. Giants 2 (linear algebra), 3 (N- body), 4 (optimization) Spark from UC Berkeley is efficient. Logistic regression, kernel SVMs, conjugate gradient descent, collaborative filtering, Gibbs sampling, alternating least squares. Example is social group-first approach for consumer churn analysis [2] Interactive/On-the-fly data processing – Storm. OLAP – data cube operations. Dremel/Drill Data sets – not embarrassingly parallel? Deep Learning Artificial Neural Networks/Deep Belief Networks Machine vision from Google [3] Speech analysis from Microsoft Giant 5 – Graph processing – GraphLab, Pregel, Giraph [1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013. [2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741 [3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012: 1232-1240
  • 3. © Tally Solutions Pvt. Ltd. All Rights Reserved 33 3 What is Hadoop? This diagram is from http://2.bp.blogspot.com/_mbMKzl2KhSc/TOOF6OQp5fI/AAAAAAAABb8/l2gyhVhQw9Y/s1600/D-Hadoop+MapReduce+Architecture.png
  • 4. © Tally Solutions Pvt. Ltd. All Rights Reserved 44 4 Data Flow in Spark and Hadoop
  • 5. © Tally Solutions Pvt. Ltd. All Rights Reserved 55 Comparison of Graph Processing Paradigms Graph Paradigm Computation (Synchronous or Asynchronous) Determinism and Effects Efficiency VS Asynchrony Efficiency of Processing Power Law Graphs. GraphLab1 Asynchronous Deterministic – serializable computation. Uses inefficient locking protocols Inefficient – Locking protocol is unfair to high degree vertices Pregel Synchronous – BSP based Deterministic – serial computation. NA Inefficient – curse of slow jobs. Piccolo Asynchronous Non- deterministic – non-serializable computation. Efficient, but may lead to incorrectness. May be efficient. GraphLab2 (PowerGraph) Asynchronous Deterministic – serializable computation. Uses parallel locking for efficient, serializable and asynchronous computations. Efficient – parallel locking and other optimizations for processing natural graphs.
  • 6. GraphLab: Ideal Engine for Processing Graphs [YL12] – Dato. [YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727. Goals – targeted at machine learning. • Model graph dependencies, be asynchronous, iterative, dynamic. Data associated with edges (weights, for instance) and vertices (user profile data, current interests etc.). Update functions – lives on each vertex • Transforms data in scope of vertex. • Can choose to trigger neighbours (for example only if Rank changes drastically) • Run asynchronously till convergence – no global barrier. Consistency is important in ML algorithms (some do not even converge when there are inconsistent updates – collaborative filtering). • GraphLab – provides varying level of consistency. Parallelism VS consistency. Implemented several algorithms, including ALS, K-means, SVM, Belief propagation, matrix factorization, Gibbs sampling, SVD, CoEM etc. • Co-EM (Expectation Maximization) algorithm 15x faster than Hadoop MR – on distributed GraphLab, only 0.3% of Hadoop execution time.
  • 7. © Tally Solutions Pvt. Ltd. All Rights Reserved 77 Other Abstractions • Functional data science • Exploration of recent algorithms • EMPCA (Expectation-Maximization Principal component analysis) • ADMM (Alternating Direction Method of Multipliers) • Deep Learning • Distributed computing abstractions for data science - Useful for engineering RFPs and next gen architectures for clients. • Iterative computations – Apache Spark or Flink or MPI or Concord • Graph computations – TensorFlow, GraphLab, Titan. • High performance computations – CUDA.
  • 8. © Tally Solutions Pvt. Ltd. All Rights Reserved 88 Other Abstractions • Adhoc queries on large data sets. – Next gen SQL • Approximate queries – Blink DB • Adhoc queries – Apache Drill • Distributed SQL – Spark SQL, Apache Tajo, HAWQ • Data governance – Important for all big data architectures. • Security – policies for role based fine grained access control • Encryption – at-rest data as well as data-in-motion. • Tokenization/Obfuscation – data masking.
  • 9. © Tally Solutions Pvt. Ltd. All Rights Reserved 99 Future Abstractions • Internet of Things (IoT) Analytics • Fog/Edge computing – strong devices, weakly connected. • Data Lake Management – next gen data management • Hadoop based data lakes • NewSQL data stores
  • 10. © Tally Solutions Pvt. Ltd. All Rights Reserved 1010 Thank You! Mail • vagneeswaran@sapient.com LinkedIn • http://in.linkedin.com/in/vijaysrinivasagneeswaran Blogs • blogs.impetus.com Twitter • @a_vijaysrinivas.