SlideShare une entreprise Scribd logo
1  sur  51
SPARK AND DEEP LEARNING FRAMEWORKS AT
SCALE
Vartika Singh
2 © Cloudera, Inc. All rights reserved.
3 © Cloudera, Inc. All rights reserved.
OBJECTIVE
• Enabling Machine Learning in field
• Enablement and use case discovery
• Data and ML: what do we focus on?
• Typical data ingest architecture
• Extending Spark
• Deep Learning - how does the fit in?
• Hardware
Objective
4 © Cloudera, Inc. All rights reserved.
5 © Cloudera, Inc. All rights reserved.
DATA - MARKET PROPOSITION
Click Stream Smart clicks, impression and
conversions
Videos Fraud, navigation, ad placement
Medical Data Tumor detection, patient mortality,
anomaly identification
City data Planning, Resource distribution
Wafer, Oil and gas data Pipeline optimization, fault detection
?? ...
6 © Cloudera, Inc. All rights reserved.
7 © Cloudera, Inc. All rights reserved.
Ref: https://hbr.org/2017/05/whats-your-data-strategy
• Less than half of an organization’s structured data is actively used in making decisions
• Less than 1% of it’s unstructured data is analyzed or used at all
• More than 70% of employees have access to data they should not
• 80% of analysts time is spent simply discovering and preparing data
• Data breaches are common
• Rogue data sets propagate in silos
• Companies’ data technology often is not up to the demands put on it
8 © Cloudera, Inc. All rights reserved.
9 © Cloudera, Inc. All rights reserved.
Use case
discovery
Model Serving
Hidden feedback
loops
Undeclared
consumer
dependencies
Change in the
external world
Ref: Hidden Technical Debt in Machine Learning ... - NIPS Proceedings
10 © Cloudera, Inc. All rights reserved.
11 © Cloudera, Inc. All rights reserved.
Is evolving Science
We are not very good at anticipating what the next emerging serious flaw will
be.
What we’re missing is an engineering discipline with its principles of analysis
and design.
Keep It Simple Stupid!
https://medium.com/@mijordan3/artificial-intelligence-the-revolution-hasnt-happened-yet-5e1d5812e1e7
12 © Cloudera, Inc. All rights reserved.
13 © Cloudera, Inc. All rights reserved.
Data
Processes
ML
● Deconstruct the problem.
● Democratize
● Paved Pathways
© Cloudera, Inc. All rights reserved.
INTELLIGENT INFRASTRUCTURE!!!
15 © Cloudera, Inc. All rights reserved.
CLOUDERA DATA SCIENCE WORKBENCH
16 © Cloudera, Inc. All rights reserved.
OVERVIEW - PROJECTS
17 © Cloudera, Inc. All rights reserved.
OVERVIEW - GPUS
18 © Cloudera, Inc. All rights reserved.
OVERVIEW - WEBUIS
19 © Cloudera, Inc. All rights reserved.
OVERVIEW - DISTRIBUTED COMPUTING WITH WORKERS
20 © Cloudera, Inc. All rights reserved.
OTHER FEATURES
• Git
• S3/HDFS
21 © Cloudera, Inc. All rights reserved.
• Create a snapshot of model code,
dependencies, and configuration
necessary to train the model.
• Build and execute the training run
in an isolate container.
• Track specified model metrics,
performance, and model artifacts.
• Inspect, compare , or deploy prior
models.
EXPERIMENTS
22 © Cloudera, Inc. All rights reserved.
MODELS
23 © Cloudera, Inc. All rights reserved.
• In model parallelism, different machines in
the distributed system are responsible for
the computations in different parts of a
single network - for example, each layer in
the neural network may be assigned to a
different machine.
24 © Cloudera, Inc. All rights reserved.
• In data parallelism, different machines have
a complete copy of the model; each machine
simply gets a different portion of the data, and
results from each are somehow combined.
25 © Cloudera, Inc. All rights reserved.
26 © Cloudera, Inc. All rights reserved.
SPARK AND JNI
• OpenCV
• Tesseract
• Common Implementations using JavaCPP
Ref: https://github.com/bytedeco/javacpp
27 © Cloudera, Inc. All rights reserved.
SPARK/HPC WORKLOADS
Gene Sequencing/ Assembling/ Analysis
• Data parallelism and statistical methods lie at the core of all DNA sequencing
workloads.
• Sequencing - Base calling
• Variant calling
• GATK - Can run on Spark
• Canu - Transform to PySpark workload using Python C extensions
• Analysis - HAIL
Ref: https://software.broadinstitute.org/gatk/
Ref: https://hail.is/
Ref: https://blog.cloudera.com/blog/2017/05/hail-scalable-genomics-analysis-with-spark/
28 © Cloudera, Inc. All rights reserved.
HPC WORKLOADS
• Portions of the Hadoop ecosystem can open your grid to more users.
• PySpark allows a company that is using a legacy C++ grid to re-use their C++ library assets
with very little to no changes. Python to C++ bindings result in minimal performance penalties.
• Cloudera Data Science Workbench (CDSW) allow Data Scientists to rapidly develop and
visualize models with more involvement from the business.
• In infrastructures with direct attached storage, Hadoop’s locality based processing allows for
fast efficient movement of data between storage and compute.
• Deploying Hadoop on a portion or on all of your grid allows you to use the same tools on the
grid that you would use on a Cloud Based Hadoop Cluster.
29 © Cloudera, Inc. All rights reserved.
DEEP LEARNING IN BIG DATA
• A major source of difficulty in many real-
world artificial intelligence applications is
that many of the factors of variation
influence every single piece of data we can
observe.
• Deep learning solves this central problem
via representation learning by introducing
representations that are expressed in terms
of other, simpler representations.
30 © Cloudera, Inc. All rights reserved.
BIOINFORMATICS
• Protein Structure
• Gene Expression Regulation
• Protein Classification
• Anomaly Classification
• Segmentation
31 © Cloudera, Inc. All rights reserved.
BIOINFORMATICS: THE NATURE OF DATA
• Complex and expensive data acquisition processes limit the size of
bioinformatics datasets.
• Significantly unequal class distributions
• In clinical or disease-related cases, there is inevitably less data from treatment groups than
from the normal (control) group.
• Visualization
• Multimodal Deep Learning
32 © Cloudera, Inc. All rights reserved.
IOT
• A time series is a sequence of regular time-ordered observations
• Example: stock prices, weather readings, smartphone sensor data
• Challenges
• Large scale streaming data
• Heterogeneity
• Time and space correlation
• High noise data
• NRT decision on multimodal data
33 © Cloudera, Inc. All rights reserved.
IOT DEVICES
• Network compression
• Convert to sparse network
• Not general enough
• Factors to consider
• Running time
• Energy consumption
• Architectural considerations
• FFL are much faster than convolution layers in CNN
• Activation functions (ReLu are more time-efficient than Tanh than Sigmoid)
• CNNs use less storage than DNNs due to fewer stored parameters in convolutional layers
• Accelerators
• Tinymotes
• Fog Computing
34 © Cloudera, Inc. All rights reserved.
NLP
• Word Embeddings: GloVe, Word2Vec
• RNN -> LSTMs -> Attention Mechanism
• Applications
• Sentiment analysis
• Gene sequencing
• Natural language generation
35 © Cloudera, Inc. All rights reserved.
DEEP LEARNING - THE HYPERPARAMETERS
• Architecture
• How many layers
• How many nodes/filters
• Which type
• Data
• Batches size
• Size of filters
• Number of steps the
memory of cells will learn
• Training
• Regularization
• Learning rate
• Gradient expressions
• Init policy
36 © Cloudera, Inc. All rights reserved.
TRANSFER LEARNING
37 © Cloudera, Inc. All rights reserved.
TRANSFER LEARNING
• Deep neural networks trained on natural images exhibit a curious phenomenon
in common:
• In the first layer they learn features similar to Gabor filters and color blobs.
• Such first-layer features appear not to be specific to a particular dataset or task, but general in
that they are applicable to many datasets and tasks.
• Initializing a network with transferred features from almost any number of layers
can produce a boost to generalization that lingers even after fine-tuning to the
target dataset.
• The effectiveness of feature transfer is expected to decline as the base and
target tasks become less similar.
38 © Cloudera, Inc. All rights reserved.
SPARK DEEP LEARNING PIPELINES
• Transfer learning
• Distributed hyperparameter tuning
• Deploying models in SQL
39 © Cloudera, Inc. All rights reserved.
DISTRIBUTED TRAINING - WHEN TO DO IT
• Distributed training isn’t free
• Setup time
• Continue to train your networks on a single machine, until the training time
becomes prohibitive
40 © Cloudera, Inc. All rights reserved.
OPERATIONAL IMPLICATIONS
• Model exploration using small data
• Computational limits
• Irreducible errors
• Predictable
41 © Cloudera, Inc. All rights reserved.
• Neurons and Synapses
• Computed weighted sum for
each layer
• Compute the gradient of the loss
relative to the filter inputs
• Compute the gradient of the loss
relative to the weights
M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep Learning for IoT Big Data and Streaming Analytics: A Survey,” arXiv preprint arXiv:1712.04301v1 [cs.NI], 2017.
DNN
42 © Cloudera, Inc. All rights reserved.
DEEP LEARNING AT SCALE
• Backpropagation requires intermediate outputs of the network to be preserved
for the backwards computation, thus training has increased storage
requirements.
• Second, due to the gradients use for hill-climbing, the precision requirement for
training is generally higher than inference.
43 © Cloudera, Inc. All rights reserved.
DEEP LEARNING AT SCALE
• A significant amount of effort has been put into developing deep learning
systems that can scale to very large models and large training sets
• Large models in the literature are now top performers in supervised visual
recognition tasks
• Can even learn to detect objects when trained from unlabeled images alone
• The very largest of these systems are able to train neural networks with over 1
billion trainable parameters
44 © Cloudera, Inc. All rights reserved.
HARDWARE FOR DNN
• Intel Knights Landing CPU features special vector instructions for deep learning
• Nvidia PASCAL GP100 GPU features 16-bit floating point (FP16) arithmetic
support to perform two FP16 operations on a single precision core for faster
deep learning computation
• Systems have also been built specifically for DNN processing such as Nvidia
DGX-1 and Facebook’s Big Basin custom DNN server
• DNN inference has also been demonstrated on various embedded System-on-
Chips (SoC) such as Nvidia Tegra and Samsung Exynos as well as FPGAs
45 © Cloudera, Inc. All rights reserved.
GPU SUPPORT IN YARN
• As of now, only Nvidia GPUs are supported by YARN
• YARN node managers have to be pre-installed with Nvidia drivers.
• When Docker is used as container runtime context, nvidia-docker 1.0 needs to
be installed (Current supported version in YARN for nvidia-docker).
• https://issues.apache.org/jira/browse/YARN-3926
• https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-
site/UsingGpus.html
46 © Cloudera, Inc. All rights reserved.
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, Efficient Processing of Deep Neural Networks: A Tutorial and Survey
47 © Cloudera, Inc. All rights reserved.
48 © Cloudera, Inc. All rights reserved.
ACCELERATORS FOR TEMPORAL ARCHITECTURES
• The downside for using matrix multiplication for the CONV layers is that there is
redundant data in the input feature map matrix, which can lead to either
inefficiency in storage, or a complex memory access pattern
• There are software libraries designed for CPUs (e.g., Open- BLAS, Intel MKL,
etc.) and GPUs (e.g., cuBLAS, cuDNN, etc.) that optimize for matrix
multiplications
• The matrix multiplications on these platforms can be further sped up by
applying computational transforms to the data to reduce the number of
multiplications
49 © Cloudera, Inc. All rights reserved.
ACCELERATORS FOR SPATIAL ARCHITECTURES
• For DNNs, the bottleneck for processing is in the
memory access
• Accelerators, such as spatial architectures,
provide an opportunity to reduce the energy cost
of data movement by introducing several levels
of local memory hierarchy with different energy
cost
• The multiple levels of memory hierarchy help to
improve energy efficiency by providing low-cost
data accesses
50 © Cloudera, Inc. All rights reserved.
1) How do you
collect your data?
2) Where do your
data scientists play?
3) Let’s talk to
the business
THANK YOU

Contenu connexe

Tendances

Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18Cloudera, Inc.
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloudera, Inc.
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseCloudera, Inc.
 
Cloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformCloudera, Inc.
 
Secure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersSecure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersCloudera, Inc.
 
Preparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissancePreparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissanceCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...
Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...
Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...Cloudera, Inc.
 

Tendances (20)

Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for Analytics
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
 
Cloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera cluster
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data Platform
 
Secure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersSecure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game Changers
 
Preparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissancePreparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity Renaissance
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...
Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...
Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...
 

Similaire à Spark and Deep Learning Frameworks at Scale 7.19.18

Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
The Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningThe Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningCloudera, Inc.
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
 
The 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedThe 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedCloudera, Inc.
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationDataWorks Summit
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationCloudera, Inc.
 
Parallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWParallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWDataWorks Summit
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitRafael Arana
 
Data Science in Enterprise
Data Science in EnterpriseData Science in Enterprise
Data Science in EnterpriseJosh Yeh
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghData Con LA
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningYogesh Sharma
 
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...Ed Dodds
 
110307 cloud security requirements gourley
110307 cloud security requirements gourley110307 cloud security requirements gourley
110307 cloud security requirements gourleyGovCloud Network
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchCloudera, Inc.
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndCloudera, Inc.
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptxRATISHKUMAR32
 

Similaire à Spark and Deep Learning Frameworks at Scale 7.19.18 (20)

Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Federated Learning
Federated LearningFederated Learning
Federated Learning
 
The Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningThe Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine Learning
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
The 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedThe 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: Exposed
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to Implementation
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your Organization
 
Parallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWParallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSW
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks Summit
 
Data Science in Enterprise
Data Science in EnterpriseData Science in Enterprise
Data Science in Enterprise
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
 
110307 cloud security requirements gourley
110307 cloud security requirements gourley110307 cloud security requirements gourley
110307 cloud security requirements gourley
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science Workbench
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to End
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Cloudera, Inc.
 
Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (11)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
 
Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18
 

Dernier

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Dernier (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Spark and Deep Learning Frameworks at Scale 7.19.18

  • 1. SPARK AND DEEP LEARNING FRAMEWORKS AT SCALE Vartika Singh
  • 2. 2 © Cloudera, Inc. All rights reserved.
  • 3. 3 © Cloudera, Inc. All rights reserved. OBJECTIVE • Enabling Machine Learning in field • Enablement and use case discovery • Data and ML: what do we focus on? • Typical data ingest architecture • Extending Spark • Deep Learning - how does the fit in? • Hardware Objective
  • 4. 4 © Cloudera, Inc. All rights reserved.
  • 5. 5 © Cloudera, Inc. All rights reserved. DATA - MARKET PROPOSITION Click Stream Smart clicks, impression and conversions Videos Fraud, navigation, ad placement Medical Data Tumor detection, patient mortality, anomaly identification City data Planning, Resource distribution Wafer, Oil and gas data Pipeline optimization, fault detection ?? ...
  • 6. 6 © Cloudera, Inc. All rights reserved.
  • 7. 7 © Cloudera, Inc. All rights reserved. Ref: https://hbr.org/2017/05/whats-your-data-strategy • Less than half of an organization’s structured data is actively used in making decisions • Less than 1% of it’s unstructured data is analyzed or used at all • More than 70% of employees have access to data they should not • 80% of analysts time is spent simply discovering and preparing data • Data breaches are common • Rogue data sets propagate in silos • Companies’ data technology often is not up to the demands put on it
  • 8. 8 © Cloudera, Inc. All rights reserved.
  • 9. 9 © Cloudera, Inc. All rights reserved. Use case discovery Model Serving Hidden feedback loops Undeclared consumer dependencies Change in the external world Ref: Hidden Technical Debt in Machine Learning ... - NIPS Proceedings
  • 10. 10 © Cloudera, Inc. All rights reserved.
  • 11. 11 © Cloudera, Inc. All rights reserved. Is evolving Science We are not very good at anticipating what the next emerging serious flaw will be. What we’re missing is an engineering discipline with its principles of analysis and design. Keep It Simple Stupid! https://medium.com/@mijordan3/artificial-intelligence-the-revolution-hasnt-happened-yet-5e1d5812e1e7
  • 12. 12 © Cloudera, Inc. All rights reserved.
  • 13. 13 © Cloudera, Inc. All rights reserved. Data Processes ML ● Deconstruct the problem. ● Democratize ● Paved Pathways
  • 14. © Cloudera, Inc. All rights reserved. INTELLIGENT INFRASTRUCTURE!!!
  • 15. 15 © Cloudera, Inc. All rights reserved. CLOUDERA DATA SCIENCE WORKBENCH
  • 16. 16 © Cloudera, Inc. All rights reserved. OVERVIEW - PROJECTS
  • 17. 17 © Cloudera, Inc. All rights reserved. OVERVIEW - GPUS
  • 18. 18 © Cloudera, Inc. All rights reserved. OVERVIEW - WEBUIS
  • 19. 19 © Cloudera, Inc. All rights reserved. OVERVIEW - DISTRIBUTED COMPUTING WITH WORKERS
  • 20. 20 © Cloudera, Inc. All rights reserved. OTHER FEATURES • Git • S3/HDFS
  • 21. 21 © Cloudera, Inc. All rights reserved. • Create a snapshot of model code, dependencies, and configuration necessary to train the model. • Build and execute the training run in an isolate container. • Track specified model metrics, performance, and model artifacts. • Inspect, compare , or deploy prior models. EXPERIMENTS
  • 22. 22 © Cloudera, Inc. All rights reserved. MODELS
  • 23. 23 © Cloudera, Inc. All rights reserved. • In model parallelism, different machines in the distributed system are responsible for the computations in different parts of a single network - for example, each layer in the neural network may be assigned to a different machine.
  • 24. 24 © Cloudera, Inc. All rights reserved. • In data parallelism, different machines have a complete copy of the model; each machine simply gets a different portion of the data, and results from each are somehow combined.
  • 25. 25 © Cloudera, Inc. All rights reserved.
  • 26. 26 © Cloudera, Inc. All rights reserved. SPARK AND JNI • OpenCV • Tesseract • Common Implementations using JavaCPP Ref: https://github.com/bytedeco/javacpp
  • 27. 27 © Cloudera, Inc. All rights reserved. SPARK/HPC WORKLOADS Gene Sequencing/ Assembling/ Analysis • Data parallelism and statistical methods lie at the core of all DNA sequencing workloads. • Sequencing - Base calling • Variant calling • GATK - Can run on Spark • Canu - Transform to PySpark workload using Python C extensions • Analysis - HAIL Ref: https://software.broadinstitute.org/gatk/ Ref: https://hail.is/ Ref: https://blog.cloudera.com/blog/2017/05/hail-scalable-genomics-analysis-with-spark/
  • 28. 28 © Cloudera, Inc. All rights reserved. HPC WORKLOADS • Portions of the Hadoop ecosystem can open your grid to more users. • PySpark allows a company that is using a legacy C++ grid to re-use their C++ library assets with very little to no changes. Python to C++ bindings result in minimal performance penalties. • Cloudera Data Science Workbench (CDSW) allow Data Scientists to rapidly develop and visualize models with more involvement from the business. • In infrastructures with direct attached storage, Hadoop’s locality based processing allows for fast efficient movement of data between storage and compute. • Deploying Hadoop on a portion or on all of your grid allows you to use the same tools on the grid that you would use on a Cloud Based Hadoop Cluster.
  • 29. 29 © Cloudera, Inc. All rights reserved. DEEP LEARNING IN BIG DATA • A major source of difficulty in many real- world artificial intelligence applications is that many of the factors of variation influence every single piece of data we can observe. • Deep learning solves this central problem via representation learning by introducing representations that are expressed in terms of other, simpler representations.
  • 30. 30 © Cloudera, Inc. All rights reserved. BIOINFORMATICS • Protein Structure • Gene Expression Regulation • Protein Classification • Anomaly Classification • Segmentation
  • 31. 31 © Cloudera, Inc. All rights reserved. BIOINFORMATICS: THE NATURE OF DATA • Complex and expensive data acquisition processes limit the size of bioinformatics datasets. • Significantly unequal class distributions • In clinical or disease-related cases, there is inevitably less data from treatment groups than from the normal (control) group. • Visualization • Multimodal Deep Learning
  • 32. 32 © Cloudera, Inc. All rights reserved. IOT • A time series is a sequence of regular time-ordered observations • Example: stock prices, weather readings, smartphone sensor data • Challenges • Large scale streaming data • Heterogeneity • Time and space correlation • High noise data • NRT decision on multimodal data
  • 33. 33 © Cloudera, Inc. All rights reserved. IOT DEVICES • Network compression • Convert to sparse network • Not general enough • Factors to consider • Running time • Energy consumption • Architectural considerations • FFL are much faster than convolution layers in CNN • Activation functions (ReLu are more time-efficient than Tanh than Sigmoid) • CNNs use less storage than DNNs due to fewer stored parameters in convolutional layers • Accelerators • Tinymotes • Fog Computing
  • 34. 34 © Cloudera, Inc. All rights reserved. NLP • Word Embeddings: GloVe, Word2Vec • RNN -> LSTMs -> Attention Mechanism • Applications • Sentiment analysis • Gene sequencing • Natural language generation
  • 35. 35 © Cloudera, Inc. All rights reserved. DEEP LEARNING - THE HYPERPARAMETERS • Architecture • How many layers • How many nodes/filters • Which type • Data • Batches size • Size of filters • Number of steps the memory of cells will learn • Training • Regularization • Learning rate • Gradient expressions • Init policy
  • 36. 36 © Cloudera, Inc. All rights reserved. TRANSFER LEARNING
  • 37. 37 © Cloudera, Inc. All rights reserved. TRANSFER LEARNING • Deep neural networks trained on natural images exhibit a curious phenomenon in common: • In the first layer they learn features similar to Gabor filters and color blobs. • Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. • Initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset. • The effectiveness of feature transfer is expected to decline as the base and target tasks become less similar.
  • 38. 38 © Cloudera, Inc. All rights reserved. SPARK DEEP LEARNING PIPELINES • Transfer learning • Distributed hyperparameter tuning • Deploying models in SQL
  • 39. 39 © Cloudera, Inc. All rights reserved. DISTRIBUTED TRAINING - WHEN TO DO IT • Distributed training isn’t free • Setup time • Continue to train your networks on a single machine, until the training time becomes prohibitive
  • 40. 40 © Cloudera, Inc. All rights reserved. OPERATIONAL IMPLICATIONS • Model exploration using small data • Computational limits • Irreducible errors • Predictable
  • 41. 41 © Cloudera, Inc. All rights reserved. • Neurons and Synapses • Computed weighted sum for each layer • Compute the gradient of the loss relative to the filter inputs • Compute the gradient of the loss relative to the weights M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep Learning for IoT Big Data and Streaming Analytics: A Survey,” arXiv preprint arXiv:1712.04301v1 [cs.NI], 2017. DNN
  • 42. 42 © Cloudera, Inc. All rights reserved. DEEP LEARNING AT SCALE • Backpropagation requires intermediate outputs of the network to be preserved for the backwards computation, thus training has increased storage requirements. • Second, due to the gradients use for hill-climbing, the precision requirement for training is generally higher than inference.
  • 43. 43 © Cloudera, Inc. All rights reserved. DEEP LEARNING AT SCALE • A significant amount of effort has been put into developing deep learning systems that can scale to very large models and large training sets • Large models in the literature are now top performers in supervised visual recognition tasks • Can even learn to detect objects when trained from unlabeled images alone • The very largest of these systems are able to train neural networks with over 1 billion trainable parameters
  • 44. 44 © Cloudera, Inc. All rights reserved. HARDWARE FOR DNN • Intel Knights Landing CPU features special vector instructions for deep learning • Nvidia PASCAL GP100 GPU features 16-bit floating point (FP16) arithmetic support to perform two FP16 operations on a single precision core for faster deep learning computation • Systems have also been built specifically for DNN processing such as Nvidia DGX-1 and Facebook’s Big Basin custom DNN server • DNN inference has also been demonstrated on various embedded System-on- Chips (SoC) such as Nvidia Tegra and Samsung Exynos as well as FPGAs
  • 45. 45 © Cloudera, Inc. All rights reserved. GPU SUPPORT IN YARN • As of now, only Nvidia GPUs are supported by YARN • YARN node managers have to be pre-installed with Nvidia drivers. • When Docker is used as container runtime context, nvidia-docker 1.0 needs to be installed (Current supported version in YARN for nvidia-docker). • https://issues.apache.org/jira/browse/YARN-3926 • https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn- site/UsingGpus.html
  • 46. 46 © Cloudera, Inc. All rights reserved. Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, Efficient Processing of Deep Neural Networks: A Tutorial and Survey
  • 47. 47 © Cloudera, Inc. All rights reserved.
  • 48. 48 © Cloudera, Inc. All rights reserved. ACCELERATORS FOR TEMPORAL ARCHITECTURES • The downside for using matrix multiplication for the CONV layers is that there is redundant data in the input feature map matrix, which can lead to either inefficiency in storage, or a complex memory access pattern • There are software libraries designed for CPUs (e.g., Open- BLAS, Intel MKL, etc.) and GPUs (e.g., cuBLAS, cuDNN, etc.) that optimize for matrix multiplications • The matrix multiplications on these platforms can be further sped up by applying computational transforms to the data to reduce the number of multiplications
  • 49. 49 © Cloudera, Inc. All rights reserved. ACCELERATORS FOR SPATIAL ARCHITECTURES • For DNNs, the bottleneck for processing is in the memory access • Accelerators, such as spatial architectures, provide an opportunity to reduce the energy cost of data movement by introducing several levels of local memory hierarchy with different energy cost • The multiple levels of memory hierarchy help to improve energy efficiency by providing low-cost data accesses
  • 50. 50 © Cloudera, Inc. All rights reserved. 1) How do you collect your data? 2) Where do your data scientists play? 3) Let’s talk to the business